Link your extracts

Joins, bind_rows() and calculating follow-up time - assemble your extracts into one dataset

Published

July 2, 2026

Your extracts from Phase 6, Extract from LPR and Phase 11 are stored as separate RDS files. Once each extract is down to one row per person (Long ↔︎ wide format), you join them on pnr into one analysis-ready dataset. Finally, you calculate follow-up time and the event variable.

Note

The code examples use generic paths and variable names. Adapt to your project’s folder structure and column names.

Important

First get each extract down to one row per person. left_join() only behaves predictably when both tables have one row per pnr. Many register extracts have multiple rows per person (long format). Reduce them to one row per person with group_by() + slice() or pivot_wider() - explained in Long ↔︎ wide format. This page assumes your extracts are already in that shape.


Validate your join

Joins rarely fail with an error message - but they can produce silently wrong results. Always check these three things after a left_join():

# Before join: what is nrow?
nrow(cohort)                                   # e.g. 4,823

# Join:
cohort2 <- cohort %>% left_join(outcome, by = "pnr")

# Check 1: did the row count increase? Duplicated keys in outcome give extra rows.
nrow(cohort2)                                  # should still be 4,823

# Check 2: how many got a match?
sum(!is.na(cohort2$event_date))                # number with outcome
sum(is.na(cohort2$event_date))                 # number without - expected censored?

# Check 3: who did NOT match? (diagnostic - not necessarily an error)
missing <- anti_join(cohort, outcome, by = "pnr")   # pnr's not found in outcome
nrow(missing)                                  # 0 = all matched; > 0 = investigate
Tip

anti_join() as diagnostics anti_join(cohort, outcome, by = "pnr") returns the cohort members that are not found in the outcome table. Use it for “who in my cohort has no death date record?” (all alive - expected) or “who is missing from register X?” (unexpected - investigate).


Stack tables vertically - bind_rows()

bind_rows() and left_join() solve two fundamentally different problems:

bind_rows() left_join()
What it does Stacks tables vertically - adds rows Combines tables horizontally - adds columns
Requires That the columns have the same names That the tables share one common key column
Result More rows, same number of columns Same number of rows, more columns
Matches on key? No - all rows from both tables Yes - match on pnr or other key
Typical use LPR2 + LPR3 → one combined register Link outcome or covariate to cohort table

bind_rows() is not a join. It does not look at pnr and does not match. It simply places table 2’s rows below table 1’s rows - like pasting two Excel sheets together vertically.

# bind_rows: LPR2 and LPR3 have the same columns (pnr, date_contact, icd3)
# → stack them vertically into one combined diagnosis register
lpr2_dx    # 45,000 rows - diagnoses up to March 2019
lpr3_dx    # 32,000 rows - diagnoses from March 2019 onwards

alle_dx <- bind_rows(lpr2_dx, lpr3_dx)   # 77,000 rows - both periods combined

Columns missing in one table (e.g. a column that only exists in LPR3) are automatically filled with NA for rows from the other table.

Note

When do you use which? bind_rows() - when you have two versions of the same register (LPR2 + LPR3, exposed + comparators) and want to merge them into one. left_join() - when you want to attach a new variable (outcome, date of death, age) to your cohort table.


Practical examples: join date of death and emigration to the cohort

Date of death (DODSAARS)

#=====================================================
# Attach date of death (DODSAARS) to the cohort
#=====================================================
# Cohort data (one row per person):
kohort <- readRDS("sti/til/full_cohort.rds")

# Death dates from DODSAARS:
dodsaars <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/dodsaars/") %>%
  rename_with(tolower) %>%               # standardise column names
  semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>%   # only the cohort's pnr's
  select(pnr, death_date = d_dodsdto) %>%   # rename d_dodsdto to death_date
  collect()                              # fetch into R

# Join: all cohort members retained; the living get death_date = NA
cohort_with_death <- kohort %>%
  left_join(dodsaars, by = "pnr")       # attach death date - NA = still alive

Emigration date (VNDS)

Emigration censors just like death - the person leaves the study on the day they emigrate. VNDS contains one row per migration event; use only "U" (emigration) and take the first date per person.

#=====================================================
# Attach emigration date (VNDS) to the cohort
#=====================================================
# Emigration dates from VNDS:
vnds <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/vnds/") %>%
  rename_with(tolower) %>%                       # standardise column names
  semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>%   # only the cohort's pnr's
  filter(indud_kode == "U") %>%                  # "U" = udvandring/emigration (VNDS holds both in- and out-migration)
  select(pnr, emigration_date = haend_dato) %>%  # rename haend_dato to emigration_date
  collect() %>%                                  # fetch into R
  group_by(pnr) %>%                              # group to find first emigration
  arrange(emigration_date) %>%                   # oldest date first
  slice(1) %>%                                   # first emigration per person
  ungroup()                                      # release grouping

# Join: all cohort members retained; non-emigrants get emigration_date = NA
cohort_with_emigration <- kohort %>%
  left_join(vnds, by = "pnr")                    # NA = never emigrated during the study period

Calculate follow-up time and event variable

Before you can analyse, each cohort member needs a censoring date (when follow-up ends) and an event variable (did they experience the outcome?).

The censoring date is the earliest of: event date, date of death, emigration date and end of study period.

#=====================================================
# Calculate follow-up time and event variable
#=====================================================
study_end <- as.Date("2024-12-31")   # replace with your actual study end date
                                      # format: "yyyy-mm-dd" (ISO 8601 - R's standard)

kohort <- kohort %>%
  mutate(
    # Censoring date = the earliest of all possible stopping reasons
    censor_date = pmin(event_date, death_date, emigration_date,
                       study_end, na.rm = TRUE),

    # Follow-up time in years
    followup_years = as.numeric(censor_date - index_date) / 365.25,

    # Event variable: 1 = outcome occurred before censoring, 0 = censored
    event = as.integer(!is.na(event_date) & event_date <= censor_date)
  )
Note

pmin() compares vectors position by position and returns the smallest value per person - it is the vectorised version of min(). na.rm = TRUE ensures that a missing death date (= alive) does not make the censoring date NA.

Competing risks? If a competing event (typically death) must be handled separately (see Time-to-event), then instead of a binary event (0/1) build a status with three values - what happened first:

#=====================================================
# Competing risks: a status with three values
#=====================================================
kohort <- kohort %>%
  mutate(
    # status_code: which date "won" the pmin() above?
    status_code = case_when(
      !is.na(event_date) & event_date <= censor_date ~ 1,   # outcome first
      !is.na(death_date) & death_date <= censor_date ~ 2,   # death first (competing risk)
      TRUE                                           ~ 0     # otherwise censored (emigration/study end)
    )
  )

Assemble the final analysis dataset

By now your cohort is built in Phase 10: a table with one row per person, pnr and index_date (and exposed/case if you have a comparison group). Your outcomes, covariates and censoring variables are extracted as separate RDS files (Phase 9, Phase 11). This last step joins them onto the cohort with left_join(), calculates follow-up time, and keeps only the columns the analysis needs.

Note

Cohort construction - including the matching - belongs to Phase 10. Here we assume the cohort already exists; this page only assembles the analysis dataset around it. Design choices (comparator, immortal time, matching ratio) are decided back in Phase 1.

Important

End with select() - keep only the columns the analysis requires.

cohort_final <- kohort %>%
  select(pnr, index_date, censor_date, followup_years, event,
         alder, koen, nmi_score, occupation_cat, education_cat, income_cat)

saveRDS(cohort_final, "sti/til/analysis_dataset.rds")   # save the final dataset
Complete recipe - from cohort to analysis-ready dataset
#=====================================================
# Complete recipe: cohort -> analysis-ready dataset
#=====================================================

#-----------------------------------------------------
# 0. Load your cohort (pnr + index_date, from Phase 6/9)
#-----------------------------------------------------
kohort <- readRDS("sti/til/full_cohort.rds")   # one row per person

#-----------------------------------------------------
# 1. Exclude prevalent cases (see Phase 9)
#-----------------------------------------------------
# ... cohort_clean <- kohort %>% anti_join(prevalent, by = "pnr")

#-----------------------------------------------------
# 2. Link outcome (e.g. first dementia diagnosis after index)
#-----------------------------------------------------
outcome  <- readRDS("sti/til/extract_dementia.rds")   # pnr + event_date (NA = no event)
kohort <- kohort_clean %>% left_join(outcome, by = "pnr")

#-----------------------------------------------------
# 3. Link censoring: death date and emigration
#-----------------------------------------------------
deaths <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/dodsaars/") %>%
  rename_with(tolower) %>%
  semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>%
  select(pnr, death_date = d_dodsdto) %>% collect()

vnds_data <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/vnds/") %>%
  rename_with(tolower) %>%
  semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>%
  filter(indud_kode == "U") %>%
  select(pnr, emigration_date = haend_dato) %>% collect() %>%
  group_by(pnr) %>% arrange(emigration_date) %>% slice(1) %>% ungroup()

kohort <- kohort %>%
  left_join(deaths,    by = "pnr") %>%
  left_join(vnds_data, by = "pnr")

#-----------------------------------------------------
# 4. Calculate follow-up time and event variable
#-----------------------------------------------------
study_end <- as.Date("2024-12-31")
kohort <- kohort %>%
  mutate(
    censor_date    = pmin(event_date, death_date, emigration_date, study_end, na.rm = TRUE),
    followup_years = as.numeric(censor_date - index_date) / 365.25,
    event          = as.integer(!is.na(event_date) & event_date <= censor_date)
  )

#-----------------------------------------------------
# 5. Link covariates (demographic, SES, comorbidity)
#-----------------------------------------------------
bef_data <- readRDS("sti/til/extract_bef.rds")    # age, sex from BEF
ses_data <- readRDS("sti/til/extract_ses.rds")    # education, income, employment
nmi_data <- readRDS("sti/til/extract_nmi.rds")    # NMI score

kohort <- kohort %>%
  left_join(bef_data, by = "pnr") %>%
  left_join(ses_data, by = "pnr") %>%
  left_join(nmi_data, by = "pnr")

#-----------------------------------------------------
# 6. Keep ONLY the columns the analysis requires
#-----------------------------------------------------
cohort_final <- kohort %>%
  select(pnr, index_date, censor_date, followup_years, event,
         alder, koen, nmi_score, occupation_cat, education_cat, income_cat)

saveRDS(cohort_final, "sti/til/analysis_dataset.rds")   # save the final dataset
nrow(cohort_final)                                        # verify number of persons
names(cohort_final)                                       # verify column names

Next steps

You now have one analysis-ready dataset with one row per person. The next step is the analysis:

Phase 13 - Analysis

See also

TipFurther information

Further depth (in English):

  • Joining data in The Epidemiologist R Handbook.
  • Joins in R for Data Science: keys, mutating vs. filtering joins and what happens when a key has duplicates.
Back to top