DST pitfalls

10 errors that cost time and produce uninformative or no error messages

Published

July 21, 2026

This page collects the errors that most frequently catch new users of DST registers. What they have in common: the error messages are either confusing, or there is no error message at all - the result is just silently wrong.

1. `dodsaars` vs `dodsaasg` - use the correct death register

There are two registers with similar names:

Register	Contains	Used for
`dodsaars`	Individual death registrations with precise date of death (`d_dodsdto`)	Censoring at death
`dodsaasg`	Cause-of-death classification	Only for analysis of cause of death

dodsaasg does not have the date of death in the correct format and is not the authoritative source for individual death dates.

Check dodsaars coverage in your project guide. dodsaars does not necessarily cover your entire study period - in project 708421 it covers only ~1970–2001 (as of June 2026), and post-2001 deaths require a separate extraction. Other projects may have different coverage.

# CORRECT - read the register by name (fastreg gets the path from your project config)
death <- read_register("dodsaars") %>%
  rename_with(tolower) # check coverage in your project guide
# Without fastreg: open_dataset("path/to/dodsaars/") %>% rename_with(tolower)
death_person <- death %>%
  semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>% # only the cohort's pnr's
  select(pnr, death_date = d_dodsdto) %>% # d_dodsdto is the confirmed column
  collect()

# WRONG - do not use dodsaasg for censoring dates

2. RAM is shared - clean up after large extractions

You are on a shared server with shared RAM. When the memory bar in RStudio turns red, everyone on the server experiences slowdowns - and DST automatically kills processes when RAM is close to full, so an oversized extraction can cost you your unsaved work.

# Filter early - never collect() first
lmdb <- read_register("lmdb") %>%
  rename_with(tolower) # lazy connection - no RAM used yet (fastreg gets the path)
# Without fastreg: open_dataset("path/to/lmdb/") %>% rename_with(tolower)

result <- lmdb %>%
  semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>% # only the cohort's pnr's
  filter(substr(atc, 1, 4) == "N06D") %>% # filter before collect
  select(pnr, atc, eksd) %>%
  collect() # only now is data moved to R

# Free large objects when you are done with them
rm(lmdb) # delete the lazy connection - it does not use much, but it is good practice
gc() # return memory to the operating system

More practical habits (save often, partial loading, Task Manager) in Mind your RAM in the shared environment, and DST’s official advice in DST guide: Reducing RAM use in the shared environment (PDF, Danish).

3. `rename_with(tolower)` must be called on each register

Raw column names vary by register and year: PNR, pnr, Pnr, V_CPR. If you forget it, semi_join(..., by = "pnr") silently fails with “Column pnr not found” - even though the column is there.

The rule: every open_dataset() or read_register() call ends with %>% rename_with(tolower) as the first step in your pipe. See Extracting data step by step for explanation and example.

4. Date columns are not always in Date format

DST registers store dates in multiple formats - they look the same but behave differently.

Format	Example	What `class()` returns	What to do
Date	`2020-05-15`	`"Date"`	Nothing - can be used directly
Character	`"2020-05-15"`	`"character"`	`as.Date(column)`
Datetime	`"2020-05-15 14:32:00"`	`"POSIXct"`	`as.Date(column)` to get only the date part
SAS integer	`21990`	`"numeric"`	`as.Date(column, origin = "1960-01-01")`

The rule: always check class() on a date column before using it in calculations.

class(lpr_a_kontakt$kont_starttidspunkt) # "POSIXct" - datetime, not Date
# Fix:
mutate(date = as.Date(kont_starttidspunkt))

class(bef$foed_dag) # "Date" - can be used directly

5. BEF is a status snapshot - not a live register

BEF is a status register: it records the composition of the population at a given reference time - not continuously. DST’s reference time is ultimo (typically 31 December for an annual snapshot). Since 2008, BEF is also delivered quarterly (March, June, September, December).

**aar == 2020 = 1 January 2020" is a project convention.** In many projects BEF snapshots are renamed soaar == 2020` conventionally refers to the population composition as of 1 January 2020 - but this does not follow from DST’s delivery naming. Confirm the convention in your project guide.

See DST’s official BEF documentation: statistikdokumentation/befolkningen →

This means that a person who dies in June 2020 still appears in the 2020 BEF snapshot.

# ERROR: do not use BEF to check "alive on a specific date"
bef_2020 <- bef %>%
  filter(aar == 2020) # includes everyone in the 2020 snapshot
# - including those who die during 2020

# CORRECT: combine with dodsaars to exclude deaths
deaths <- read_register("dodsaars") %>% # without fastreg: open_dataset("path/to/dodsaars/")
  rename_with(tolower) %>%
  semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>%
  select(pnr, d_dodsdto) %>%
  collect()

bef_alive <- bef_data %>%
  left_join(deaths, by = "pnr") %>%
  filter(is.na(d_dodsdto) | d_dodsdto > index_date) # alive at index date

6. The “a” in `lpr_a_diagnose` does not mean A-type diagnoses

The table is called lpr_a_diagnose - the “a” refers to “analysis model” (the LPR_A series introduced in 2025). It does not mean the table only contains A-type (action) diagnoses.

The table contains all diagnosis types: A (action), B (secondary diagnosis) and G (underlying condition). You still need to filter on diag_kode_type:

lpr_a_diagnose %>%
  filter(diag_kode_type %in% c("A", "B")) %>% # still necessary
  ...

7. Categorical codes are not consistent across registers

The same variable can have different coding in different registers - different type (numeric vs. character), different values, or both.

In practice you extract demographic variables (sex, age) from BEF and rarely need to compare the same variable in another register. But if you do, always check with table() and class() before using the variable:

table(register_a$koen) # what are the actual values and types?
class(register_a$koen)
table(register_b$koen)
class(register_b$koen)

8. `!!` (bang-bang) forgotten in lazy evaluation

When filtering with a local R vector inside a DuckDB query, you must use !!. Without it, DuckDB looks for a column with that name - and fails silently or with a confusing message.

# Example: a year list against bef (the principle applies to any local R vector)
my_years <- c(2018, 2019, 2020) # local R vector (years, here as an example)

# WRONG - DuckDB looks for a column called "my_years"
bef %>% filter(aar %in% my_years) # error or wrong result

# CORRECT - !! tells DuckDB: "use the local R vector"
bef %>% filter(aar %in% !!my_years)

!! is necessary for all local R objects used inside filter(), mutate() etc. on lazy DuckDB connections - typically code or year lists (%in% !!codes, >= !!min_date). If instead you filter on pnr against the whole cohort, use semi_join(tibble(pnr = cohort_pnrs), by = "pnr"): it takes a local table directly and needs no !!. See Functions guide for full explanation.

9. `nmi_count` ≠ `nmi_score`

These two variables are not the same and are not interchangeable:

Variable	What it is	Source
`nmi_score`	Weighted comorbidity score - Nordic Multimorbidity Index (Kristensen et al., Clin Epidemiol 2022). 50 predictors with individual weights; lung cancer counts e.g. 19 points, type 2 diabetes counts 2.	See NMI page
`nmi_count`	Simple count of the number of chronic conditions (out of 33 possible) a person has been diagnosed with	Calculated separately

If you use nmi_count in your regression model instead of nmi_score, you are adjusting for something different than you think - and you get no error message.

10. Immortal time bias - exposure defined using the future

No error message, no warning - just an effect estimate that looks too good. Immortal time bias arises when a person is given follow-up time during which they could not, by construction, have had the outcome. It is the classic register mistake, because register data let you define groups retrospectively, looking back at what eventually happened.

A concrete example. Question: does bariatric surgery lower mortality in people with type 2 diabetes? You take everyone diagnosed with T2D in 2010, split them into a surgery group (had surgery at some point during follow-up) and a no-surgery group, and start counting follow-up for everyone at the diagnosis date.

The trap: to land in the surgery group, a person had to survive long enough to be operated. Say the average wait from diagnosis to surgery is 2 years. Those 2 years are immortal: anyone who died in that window never reached surgery and so fell into the no-surgery group instead. You have handed the surgery group ~2 years of guaranteed-alive person-time and labelled it “surgery” time.

Group	Deaths	Person-years	Rate (per 1000 py)
Surgery (immortal time counted as surgery time)	30	12,000	2.5
Surgery (time correctly aligned)	30	8,000	3.8
No surgery	50	13,000	3.8

The true rates are identical (3.8) - surgery does nothing. But counting the 4,000 immortal person-years as surgery time drops its rate to 2.5 and makes surgery look 34% protective. The “effect” is an artefact of the misaligned time zero, not of surgery.

The fix: align time zero. Eligibility, exposure assignment and start of follow-up must coincide.

Risk-set (incidence-density) matching: start each person’s follow-up at the moment they become exposed, and assign each comparator the same index date (the core rule in Comparison cohort).
Treat exposure as time-varying: the person contributes unexposed person-time until surgery, then exposed time after - never exposed time before they were exposed (see Time-varying variables).

Related: defining a baseline covariate from post-index information is the same error in disguise (e.g. the OSDC diabetes type, see OSDC). Whenever a variable is built from the future, ask whether you are conditioning on the person having survived to see it. Background: Hernán & Robins, What If, §3.6 (target trial, time zero).

11. The most common error messages and what they mean

R’s error messages are short and technical - here are the ones you most often encounter in a DST workflow, translated into what they actually mean:

Error message	Typical cause	Solution
`Error: Column 'pnr' not found`	`rename_with(tolower)` is missing	Add `%>% rename_with(tolower)` immediately after `read_register()` - see pitfall 3
`Error: object 'my_list' not found`	`!!` missing in `filter()` on a lazy connection	Write `filter(aar %in% !!my_list)` - see pitfall 8
`Error: could not find function "read_register"`	`library(fastreg)` missing	Add `library(fastreg)` at the top of the script
`non-numeric argument to binary operator`	Date column is `character`, not `Date`	`mutate(date = as.Date(date))` - see pitfall 4
`Error in filter.default(...)`	Filtering on a lazy object without `%>%`	Switch to `%>%` - see the pipe
`Error: Can't convert ... to ...`	Join on columns of different type (e.g. numeric vs. character)	Use `mutate(pnr = as.character(pnr))` to match types
`object of type 'closure' is not subsettable`	A variable name overwrites a function (e.g. `data <- ...`)	Use a unique variable name - avoid `data`, `df`, `c` as object names

The fastest debugging flow - what to do step by step when you see a red error message - is described in Phase 7 - Seeing a red error message?.

1. dodsaars vs dodsaasg - use the correct death register

2. RAM is shared - clean up after large extractions

3. rename_with(tolower) must be called on each register

4. Date columns are not always in Date format

5. BEF is a status snapshot - not a live register

6. The “a” in lpr_a_diagnose does not mean A-type diagnoses

7. Categorical codes are not consistent across registers

8. !! (bang-bang) forgotten in lazy evaluation

9. nmi_count ≠ nmi_score

10. Immortal time bias - exposure defined using the future

11. The most common error messages and what they mean

1. `dodsaars` vs `dodsaasg` - use the correct death register

3. `rename_with(tolower)` must be called on each register

6. The “a” in `lpr_a_diagnose` does not mean A-type diagnoses

8. `!!` (bang-bang) forgotten in lazy evaluation

9. `nmi_count` ≠ `nmi_score`