erDiagram
full_cohort ||--o| extract_dementia : pnr
full_cohort ||--o| extract_bef : pnr
full_cohort ||--o| extract_nmi : pnr
full_cohort {
string pnr PK
date index_date
int exposed
}
extract_dementia {
string pnr FK
date event_date
}
extract_bef {
string pnr FK
int age
int sex
}
extract_nmi {
string pnr FK
int nmi_score
}
Link your extracts
Joins, bind_rows() and calculating follow-up time - assemble your extracts into one dataset
Your extracts from Phase 6, Extract from LPR and Phase 11 are stored as separate RDS files. Once each extract is down to one row per person (Long ↔︎ wide format), you join them on pnr into one analysis-ready dataset. Finally, you calculate follow-up time and the event variable.
The code examples use generic paths and variable names. Adapt to your project’s folder structure and column names.
First get each extract down to one row per person. left_join() only behaves predictably when both tables have one row per pnr. Many register extracts have multiple rows per person (long format). Reduce them to one row per person with group_by() + slice() or pivot_wider() - explained in Long ↔︎ wide format. This page assumes your extracts are already in that shape.
Joins - link two tables
In register-based research you almost always work with data from two or more tables that need to be linked. All joins in dplyr take two tables and one key column they share (e.g. pnr).
When you join your finished person-level extracts onto the cohort, the key is usually pnr. Inside a register it is other keys: joining contact and diagnosis in LPR, for instance, happens on recnum or dw_ek_kontakt, not pnr (see the overview of LPR join keys). The ER diagram below shows the common join on pnr - extracts joined onto the cohort:
pnr is the primary key (PK) in the cohort and a foreign key (FK) in each extract. The symbol ||--o| is standard ER notation (“crow’s foot”): || = exactly one, o| = zero or one. So the line reads: each person in the cohort matches zero or one row in the extract.
This describes the shape you must ensure before the join - not something left_join() checks or fixes. If an extract has several rows per pnr, left_join() happily proceeds and duplicates the cohort rows (row explosion). That is why the callout above tells you to get each extract down to one row per person first, and why check 1 in Validate your join afterwards (did the row count grow?).
The joins you will use
library(dplyr) # left_join, inner_join, semi_join, anti_join
# left_join: keep ALL rows from x, add columns from y
# Persons in x without a match in y get NA.
result <- cohort %>%
left_join(outcome, by = "pnr") # all cohort members retained; outcome = NA if no event
# inner_join: keep only rows with a match in BOTH tables
result <- lpr_adm %>%
inner_join(lpr_diag, by = "recnum") # only contacts with at least one diagnosis
# right_join: keep all rows from y (rarely used)
# full_join: keep all rows from both (rarely used)Filter joins - change rows only, not columns
# semi_join: keep rows in x THAT HAVE a match in y
bef_in_cohort <- bef %>%
semi_join(cohort, by = "pnr") # only BEF rows for cohort members
# anti_join: keep rows in x THAT DO NOT HAVE a match in y
bef_not_in_cohort <- bef %>%
anti_join(cohort, by = "pnr") # only BEF rows for non-cohort members
# also used for diagnostics - see "Validate your join" belowOverview: all joins in one table
| Function | Keeps rows from | Used for |
|---|---|---|
left_join(x, y) |
All from x | Add outcomes/covariates to cohort - NA if no match |
inner_join(x, y) |
Only matches in both | Join contacts and diagnoses - only want rows with both |
right_join(x, y) |
All from y | Rarely |
full_join(x, y) |
All from both | Rarely |
semi_join(x, y) |
Only x with match in y | Filter a register to only cohort members |
anti_join(x, y) |
Only x without match in y | Find everyone in BEF NOT in the cohort |
Join with multiple keys
Multiple keys only become relevant when at least one of the tables has several rows per person - otherwise pnr alone is enough. With multiple keys all the listed columns must agree before two rows count as a match (pnr and aar, not pnr or aar).
The typical case is panel data: registers with one row per person per year. BEF, AKM, FAIK and so on have a pnr row for every calendar year. Join two of them on pnr alone and each of the person’s rows is paired with all of the person’s rows in the other table: with 5 years in BEF and 5 in AKM you get every combination, 5 × 5 = 25 rows instead of 5. Add the year as an extra key and those 25 are filtered down to the 5 where the year also matches - 2018 against 2018:
# Join on pnr AND year: a row matches only when BOTH columns are equal
bef_akm <- bef %>%
left_join(akm, by = c("pnr", "aar")) # 2018 BEF joins only to 2018 AKMIf instead you join two extracts that each have one row per person (e.g. finished person-level extracts onto the cohort), pnr alone is enough - there is no extra dimension to keep track of.
Validate your join
Joins rarely fail with an error message - but they can produce silently wrong results. Always check these three things after a left_join():
# Before join: what is nrow?
nrow(cohort) # e.g. 4,823
# Join:
cohort2 <- cohort %>% left_join(outcome, by = "pnr")
# Check 1: did the row count increase? Duplicated keys in outcome give extra rows.
nrow(cohort2) # should still be 4,823
# Check 2: how many got a match?
sum(!is.na(cohort2$event_date)) # number with outcome
sum(is.na(cohort2$event_date)) # number without - expected censored?
# Check 3: who did NOT match? (diagnostic - not necessarily an error)
missing <- anti_join(cohort, outcome, by = "pnr") # pnr's not found in outcome
nrow(missing) # 0 = all matched; > 0 = investigateanti_join() as diagnostics anti_join(cohort, outcome, by = "pnr") returns the cohort members that are not found in the outcome table. Use it for “who in my cohort has no death date record?” (all alive - expected) or “who is missing from register X?” (unexpected - investigate).
Stack tables vertically - bind_rows()
bind_rows() and left_join() solve two fundamentally different problems:
bind_rows() |
left_join() |
|
|---|---|---|
| What it does | Stacks tables vertically - adds rows | Combines tables horizontally - adds columns |
| Requires | That the columns have the same names | That the tables share one common key column |
| Result | More rows, same number of columns | Same number of rows, more columns |
| Matches on key? | No - all rows from both tables | Yes - match on pnr or other key |
| Typical use | LPR2 + LPR3 → one combined register | Link outcome or covariate to cohort table |
bind_rows() is not a join. It does not look at pnr and does not match. It simply places table 2’s rows below table 1’s rows - like pasting two Excel sheets together vertically.
# bind_rows: LPR2 and LPR3 have the same columns (pnr, date_contact, icd3)
# → stack them vertically into one combined diagnosis register
lpr2_dx # 45,000 rows - diagnoses up to March 2019
lpr3_dx # 32,000 rows - diagnoses from March 2019 onwards
alle_dx <- bind_rows(lpr2_dx, lpr3_dx) # 77,000 rows - both periods combinedColumns missing in one table (e.g. a column that only exists in LPR3) are automatically filled with NA for rows from the other table.
When do you use which? bind_rows() - when you have two versions of the same register (LPR2 + LPR3, exposed + comparators) and want to merge them into one. left_join() - when you want to attach a new variable (outcome, date of death, age) to your cohort table.
Practical examples: join date of death and emigration to the cohort
Date of death (DODSAARS)
#=====================================================
# Attach date of death (DODSAARS) to the cohort
#=====================================================
# Cohort data (one row per person):
kohort <- readRDS("sti/til/full_cohort.rds")
# Death dates from DODSAARS:
dodsaars <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/dodsaars/") %>%
rename_with(tolower) %>% # standardise column names
semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>% # only the cohort's pnr's
select(pnr, death_date = d_dodsdto) %>% # rename d_dodsdto to death_date
collect() # fetch into R
# Join: all cohort members retained; the living get death_date = NA
cohort_with_death <- kohort %>%
left_join(dodsaars, by = "pnr") # attach death date - NA = still aliveEmigration date (VNDS)
Emigration censors just like death - the person leaves the study on the day they emigrate. VNDS contains one row per migration event; use only "U" (emigration) and take the first date per person.
#=====================================================
# Attach emigration date (VNDS) to the cohort
#=====================================================
# Emigration dates from VNDS:
vnds <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/vnds/") %>%
rename_with(tolower) %>% # standardise column names
semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>% # only the cohort's pnr's
filter(indud_kode == "U") %>% # "U" = udvandring/emigration (VNDS holds both in- and out-migration)
select(pnr, emigration_date = haend_dato) %>% # rename haend_dato to emigration_date
collect() %>% # fetch into R
group_by(pnr) %>% # group to find first emigration
arrange(emigration_date) %>% # oldest date first
slice(1) %>% # first emigration per person
ungroup() # release grouping
# Join: all cohort members retained; non-emigrants get emigration_date = NA
cohort_with_emigration <- kohort %>%
left_join(vnds, by = "pnr") # NA = never emigrated during the study periodCalculate follow-up time and event variable
Before you can analyse, each cohort member needs a censoring date (when follow-up ends) and an event variable (did they experience the outcome?).
The censoring date is the earliest of: event date, date of death, emigration date and end of study period.
#=====================================================
# Calculate follow-up time and event variable
#=====================================================
study_end <- as.Date("2024-12-31") # replace with your actual study end date
# format: "yyyy-mm-dd" (ISO 8601 - R's standard)
kohort <- kohort %>%
mutate(
# Censoring date = the earliest of all possible stopping reasons
censor_date = pmin(event_date, death_date, emigration_date,
study_end, na.rm = TRUE),
# Follow-up time in years
followup_years = as.numeric(censor_date - index_date) / 365.25,
# Event variable: 1 = outcome occurred before censoring, 0 = censored
event = as.integer(!is.na(event_date) & event_date <= censor_date)
)pmin() compares vectors position by position and returns the smallest value per person - it is the vectorised version of min(). na.rm = TRUE ensures that a missing death date (= alive) does not make the censoring date NA.
Competing risks? If a competing event (typically death) must be handled separately (see Time-to-event), then instead of a binary event (0/1) build a status with three values - what happened first:
#=====================================================
# Competing risks: a status with three values
#=====================================================
kohort <- kohort %>%
mutate(
# status_code: which date "won" the pmin() above?
status_code = case_when(
!is.na(event_date) & event_date <= censor_date ~ 1, # outcome first
!is.na(death_date) & death_date <= censor_date ~ 2, # death first (competing risk)
TRUE ~ 0 # otherwise censored (emigration/study end)
)
)Assemble the final analysis dataset
By now your cohort is built in Phase 10: a table with one row per person, pnr and index_date (and exposed/case if you have a comparison group). Your outcomes, covariates and censoring variables are extracted as separate RDS files (Phase 9, Phase 11). This last step joins them onto the cohort with left_join(), calculates follow-up time, and keeps only the columns the analysis needs.
End with select() - keep only the columns the analysis requires.
cohort_final <- kohort %>%
select(pnr, index_date, censor_date, followup_years, event,
alder, koen, nmi_score, occupation_cat, education_cat, income_cat)
saveRDS(cohort_final, "sti/til/analysis_dataset.rds") # save the final datasetComplete recipe - from cohort to analysis-ready dataset
#=====================================================
# Complete recipe: cohort -> analysis-ready dataset
#=====================================================
#-----------------------------------------------------
# 0. Load your cohort (pnr + index_date, from Phase 6/9)
#-----------------------------------------------------
kohort <- readRDS("sti/til/full_cohort.rds") # one row per person
#-----------------------------------------------------
# 1. Exclude prevalent cases (see Phase 9)
#-----------------------------------------------------
# ... cohort_clean <- kohort %>% anti_join(prevalent, by = "pnr")
#-----------------------------------------------------
# 2. Link outcome (e.g. first dementia diagnosis after index)
#-----------------------------------------------------
outcome <- readRDS("sti/til/extract_dementia.rds") # pnr + event_date (NA = no event)
kohort <- kohort_clean %>% left_join(outcome, by = "pnr")
#-----------------------------------------------------
# 3. Link censoring: death date and emigration
#-----------------------------------------------------
deaths <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/dodsaars/") %>%
rename_with(tolower) %>%
semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>%
select(pnr, death_date = d_dodsdto) %>% collect()
vnds_data <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/vnds/") %>%
rename_with(tolower) %>%
semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>%
filter(indud_kode == "U") %>%
select(pnr, emigration_date = haend_dato) %>% collect() %>%
group_by(pnr) %>% arrange(emigration_date) %>% slice(1) %>% ungroup()
kohort <- kohort %>%
left_join(deaths, by = "pnr") %>%
left_join(vnds_data, by = "pnr")
#-----------------------------------------------------
# 4. Calculate follow-up time and event variable
#-----------------------------------------------------
study_end <- as.Date("2024-12-31")
kohort <- kohort %>%
mutate(
censor_date = pmin(event_date, death_date, emigration_date, study_end, na.rm = TRUE),
followup_years = as.numeric(censor_date - index_date) / 365.25,
event = as.integer(!is.na(event_date) & event_date <= censor_date)
)
#-----------------------------------------------------
# 5. Link covariates (demographic, SES, comorbidity)
#-----------------------------------------------------
bef_data <- readRDS("sti/til/extract_bef.rds") # age, sex from BEF
ses_data <- readRDS("sti/til/extract_ses.rds") # education, income, employment
nmi_data <- readRDS("sti/til/extract_nmi.rds") # NMI score
kohort <- kohort %>%
left_join(bef_data, by = "pnr") %>%
left_join(ses_data, by = "pnr") %>%
left_join(nmi_data, by = "pnr")
#-----------------------------------------------------
# 6. Keep ONLY the columns the analysis requires
#-----------------------------------------------------
cohort_final <- kohort %>%
select(pnr, index_date, censor_date, followup_years, event,
alder, koen, nmi_score, occupation_cat, education_cat, income_cat)
saveRDS(cohort_final, "sti/til/analysis_dataset.rds") # save the final dataset
nrow(cohort_final) # verify number of persons
names(cohort_final) # verify column namesNext steps
You now have one analysis-ready dataset with one row per person. The next step is the analysis:
See also
- Long ↔︎ wide format: get each extract down to one row per person
- Functions: overview: all functions for transforming data
- Extract from LPR: practical example of long register data and inner_join
Further depth (in English):
- Joining data in The Epidemiologist R Handbook.
- Joins in R for Data Science: keys, mutating vs. filtering joins and what happens when a key has duplicates.