Extract variables
Extract outcomes and covariates - filtered to exactly your cohort
You have built your cohort (Phase 10) - a table with pnr and index_date per person. Now you extract the variables you need: outcomes and covariates.
How the workflow goes: You extract the variables you need from each register - gather e.g. everything you need from BEF in one extraction, everything from LPR in one, and so on - and save each extraction as its own .rds file. Finally, in Phase 12 - Assemble and prepare the dataset, you link all the extractions together into one large, finished analysis dataset.
Always filter to your cohort - before collect(). You only need data on the people you have built. So filter each extraction to your cohort while it is still lazy (in Arrow/DuckDB), so you do not pull the entire population into R:
kohort <- readRDS("sti/til/full_cohort.rds")
cohort_pnrs <- unique(kohort$pnr) # vector with ALL pnr (exposed + comparator cohort)
register <- open_dataset("path/to/register/") %>%
rename_with(tolower) %>%
semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>% # keep only the cohort's rows - pushed down into DuckDB, see Phase 5
select(pnr, ...) %>% # select only the columns you need
collect() # ONLY now is data pulled into RFilter and select() before collect() - the single most important rule for speed and memory (see Phase 5).
Which variables?
| Type | Page | Source |
|---|---|---|
| Outcomes (event date, censoring) | Outcomes | LPR (diagnoses), DODSAARS (death), VNDS (emigration) |
| Socioeconomics (education, income, employment) | Socioeconomic variables | UDDA, FAIK, AKM |
| Comorbidity (multimorbidity score) | Comorbidity | LPR + ready-made algorithm (NMI) |
| Medication (ATC exposure) | Medication (ATC) | LMDB |
| Demographics (age, sex) | covered in Phase 6 | BEF |
Each page shows the same pattern: open the register → filter to the cohort → select/derive the variable → collect() → save as .rds. Once all variables are extracted, you assemble them in Phase 12 - Assemble and prepare the dataset.
See also
- Phase 10 - Build your study population: the cohort you filter to
- Phase 12 - Assemble and prepare the dataset: join all extracts into one dataset
- Algorithms & special packages: ready-made tools (OSDC, NMI) to derive variables