Extract variables

Extract outcomes and covariates - filtered to exactly your cohort

Published

July 2, 2026

You have built your cohort (Phase 10) - a table with pnr and index_date per person. Now you extract the variables you need: outcomes and covariates.

Note

How the workflow goes: You extract the variables you need from each register - gather e.g. everything you need from BEF in one extraction, everything from LPR in one, and so on - and save each extraction as its own .rds file. Finally, in Phase 12 - Assemble and prepare the dataset, you link all the extractions together into one large, finished analysis dataset.

Important

Always filter to your cohort - before collect(). You only need data on the people you have built. So filter each extraction to your cohort while it is still lazy (in Arrow/DuckDB), so you do not pull the entire population into R:

kohort      <- readRDS("sti/til/full_cohort.rds")
cohort_pnrs <- unique(kohort$pnr)          # vector with ALL pnr (exposed + comparator cohort)

register <- open_dataset("path/to/register/") %>%
  rename_with(tolower) %>%
  semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>%   # keep only the cohort's rows - pushed down into DuckDB, see Phase 5
  select(pnr, ...) %>%                       # select only the columns you need
  collect()                                  # ONLY now is data pulled into R

Filter and select() before collect() - the single most important rule for speed and memory (see Phase 5).


Which variables?

Type Page Source
Outcomes (event date, censoring) Outcomes LPR (diagnoses), DODSAARS (death), VNDS (emigration)
Socioeconomics (education, income, employment) Socioeconomic variables UDDA, FAIK, AKM
Comorbidity (multimorbidity score) Comorbidity LPR + ready-made algorithm (NMI)
Medication (ATC exposure) Medication (ATC) LMDB
Demographics (age, sex) covered in Phase 6 BEF

Each page shows the same pattern: open the register → filter to the cohort → select/derive the variable → collect() → save as .rds. Once all variables are extracted, you assemble them in Phase 12 - Assemble and prepare the dataset.


See also

Back to top