Outcomes
Extract outcome dates and censoring - filtered to your cohort
Under development. Structural outline - to be expanded. The pattern is the same as in Extract from LPR, now filtered to your cohort.
Your outcome is typically the date of the first event after index - e.g. a diagnosis (LPR). But to calculate follow-up time you also need the dates on which a person stops being at risk for the outcome: death (DODSAARS) and emigration (VNDS). You extract each date as its own table with pnr + date, filtered to the cohort.
Why dates other than the outcome itself?
In a time-to-event study you follow each person from the index date until the outcome occurs, or until the person can no longer have the outcome observed. A person who dies or emigrates disappears from the Danish registers, and you can no longer know whether they later developed the outcome. So you censor them on the death or emigration date. You extract these dates here so that in Phase 12 you can calculate correct follow-up time: the earliest of outcome, death, emigration and end of study.
If a competing event (e.g. death) precludes your outcome rather than just ending follow-up, it is a competing risk, which is handled in the analysis, see Competing risks.
The pattern
# unique() removes duplicates, so you get one pnr per person - a clean list of the cohort's people
cohort_pnrs <- unique(readRDS("path/to/full_cohort.rds")$pnr) # your cohort from Phase 10
# Example: first dementia diagnosis after index (outcome)
outcome <- open_dataset("path/to/lpr_diag/") %>%
rename_with(tolower) %>%
semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>% # ONLY your cohort
# ... get the contact date from lpr_adm and filter on diagnosis codes - see Phase 9 ...
collect() %>%
group_by(pnr) %>% arrange(event_date) %>% slice(1) %>% ungroup() # keep only first event per person
saveRDS(outcome, "sti/til/extract_outcome.rds")group_by(pnr) + arrange() + slice(1) keeps only the first event per person: for an outcome definition you rarely need more than the first. The pattern is explained in Long ↔︎ wide format.
Two kinds of joins - don’t confuse them. Some joins assemble one extract - e.g. LPR’s contact and diagnosis registers, joined to get pnr + date + diagnosis in one table (see Extract from LPR). That is not the same as linking the finished extract onto your cohort; that happens later, in Link your extracts, once all your extractions are done.
- Censoring: also extract date of death from DODSAARS (
d_dodsdto) and emigration from VNDS (indud_kode == "U") - each as its own table, see Overview of registers. - Save each extraction as its own
.rdsfile. An.rdsfile can hold several columns (it does for covariates, for example) - for outcomes it is typicallypnr+ one date column. In Phase 12 you link them all onto the cohort and calculate follow-up time from the outcome, death and emigration dates.
Your codes are not the truth - mind misclassification. A register diagnosis is a proxy for the real condition, with its own validity (sensitivity, and positive predictive value, PPV). How wrong it is shapes your result:
- Non-differential misclassification: errors unrelated to exposure (the same code validity in both groups) usually bias the effect toward the null - a real effect looks weaker than it is.
- Differential misclassification: the outcome is captured more completely in one group (e.g. exposed people are in contact with the system more often and so get diagnosed more). This can bias the estimate in either direction and is the more dangerous kind.
Practical guards: use validated code definitions where they exist, look up the PPV of your codes, and run a sensitivity analysis with a stricter or looser definition. For Danish registers see the validity reviews, e.g. Schmidt et al. 2015, Clin Epidemiol (the National Patient Registry). Background: Hernán & Robins, What If, ch. 9 (measurement bias).
Using a codelist (regex, %in%) is the easy part; defining which codes make up your condition is the hard part. Don’t reinvent it: use a published, ideally validated list, and document and share your own so others can reproduce it.
- ICD-10: e.g. mltc-codelists (ICD-10 codelists for many chronic conditions, University of Edinburgh) and the accompanying methodology paper (Communications Medicine 2025).
- ATC (medication): define drug groups from the WHO ATC/DDD index and published drug-class definitions. See Medication (ATC).
See also
- Extract from LPR: the diagnosis pattern in detail
- Phase 12 - Assemble and prepare the dataset: join outcomes to the cohort