Assemble and prepare the dataset
Joins, pivots and preparing your analysis dataset
You arrive at this phase with separate RDS files: your cohort from Phase 10 and the variables you extracted in Phase 11 - outcomes, socioeconomics and comorbidity (plus demographics from Phase 6 if relevant). You leave with one analysis-ready dataset ready for Phase 13 - Analysis.
In short: Reshape each extract to one row per person (with group_by() + slice() or pivot_wider()), then join them on pnr with left_join() into one analysis-ready dataset.
Before you can analyse, you need to:
- Shape each extract into one row per person (most registers are in long format with many rows per person)
- Assemble them into one analysis-ready dataset with
left_join()
That is what this phase is about. The technique for step 1 is either group_by() + slice() or pivot_wider() - the choice depends on what you want out of the register. Step 2 is always left_join().
The code examples use generic paths and variable names. Adapt to your project’s folder structure and column names.
What is an analysis-ready dataset?
An analysis-ready dataset has one row per person and one column per variable - what we call wide format:
pnr |
index_date |
censor_date |
followup_years |
event |
age |
sex |
nmi_score |
|---|---|---|---|---|---|---|---|
| 001 | 2015-03-01 | 2022-07-14 | 7.4 | 0 | 58 | 2 | 12 |
| 002 | 2016-11-20 | 2019-11-20 | 3.0 | 1 | 63 | 1 | 7 |
The example uses a cohort study with exposed, event and followup_years. For a prevalence study the column structure will look different - typically without follow-up time, but with a cross-sectional date and covariates measured at that point. The techniques in this phase (joins, pivots, group_by() + slice()) are the same regardless of study design.
The registers you extract from - LPR, LMDB, BEF - are almost always in long format with many rows per person. This phase shows how to reshape and assemble them.
What have you already saved?
Each extraction from Phases 6, 9 and 11 etc. should be saved as its own RDS file - one register, one topic. Example of what you typically have:
| File | Contents | Key columns |
|---|---|---|
full_cohort.rds |
Cohort with index date | pnr, index_date, exposed |
extract_dementia.rds |
First dementia diagnosis | pnr, event_date (NA = none) |
extract_deaths.rds |
Date of death | pnr, death_date (NA = alive) |
extract_emigration.rds |
First emigration | pnr, emigration_date (NA = not emigrated) |
extract_bef.rds |
Demographics at index | pnr, age, sex |
extract_nmi.rds |
Comorbidity at baseline | pnr, nmi_score |
extract_ses.rds |
Socioeconomic variables | pnr, education, income, occupation |
The goal of this phase is to assemble them into one dataset with left_join() - one new column per file. The cohort is the backbone: each extract is joined on via pnr and adds its columns - all cohort members are kept, and people with no match get NA.
Assemble only what you need. It is tempting to keep all columns “just in case”. Do not. Always end with select() and explicitly choose the columns the analysis requires. A dataset with only the necessary columns is far easier to debug and understand.
Two tools for shaping data
The two central operations in this phase are pivots and joins - they sound similar but do fundamentally different things:
| Pivots | Joins | |
|---|---|---|
| What it does | Reshapes one table - changes the form, not the content | Combines two tables - adds columns from another table |
| Input | One table | Two tables with a shared key (typically pnr) |
| Output | Same data, new shape | Wider table with columns from both |
| Used for | Long → wide format (pivot_wider) or wide → long (pivot_longer) |
Assemble cohort + outcomes + covariates into one dataset |
Rule of thumb: Use pivot when you want to change the shape of one table. Use join when you want to combine two tables.
Derived variables: create new columns
Besides reshaping and joining, you almost always need to compute new variables from the ones you have. You do this with mutate() (introduced in Phase 2): it adds a column computed from existing columns.
library(dplyr); library(lubridate)
df <- df %>%
mutate(
# Age at index: number of days between two dates / 365.25 (a year = 365.25 days, leap years)
age_at_index = as.numeric(index_date - foed_dag) / 365.25,
# BMI from height (m) and weight (kg)
bmi = weight_kg / (height_m^2),
# Binary event indicator: 1 if the outcome occurred before censoring, otherwise 0
event = if_else(!is.na(event_date) & event_date <= censor_date, 1L, 0L),
# Categorize a continuous variable into groups
age_group = case_when(
age_at_index < 50 ~ "<50",
age_at_index < 65 ~ "50-64",
TRUE ~ "65+" # everything else (here: 65 and over)
)
)A few habits:
- Compute a variable ONCE and reuse it - so e.g. age at index is the same in the matching and in Table 1.
if_else()for yes/no (two outcomes);case_when()for several categories. Use the same type in every branch (e.g.1L/0L, or text in all).- Follow-up time (
censor_date - index_date) is usually computed after the join, because it uses dates from several extracts - see Link your extracts.
mutate(), if_else() and case_when() are explained in the Function guide.
This phase’s pages
Two core pages take you from separate extracts to one analysis-ready dataset:
- Long ↔︎ wide format: get each extract down to one row per person. Understand the difference between long and wide format, and use either
group_by() + slice()(one representative value per person) orpivot_wider()(several time points side by side). - Link your extracts: join the extracts on
pnr(left_join(),inner_join(),semi_join(),anti_join()), stack two versions of the same register vertically (bind_rows()), validate the join, and calculate follow-up time and event variable. This is where the final dataset is assembled.
The order is deliberate: reshape first, then join. You cannot join predictably until each extract has one row per person.
Two optional add-on pages you only take if your project needs them:
- Missing data: how to understand and handle missing values, including multiple imputation.
- Time-varying variables: what to do when a variable changes during follow-up (and when you do not need to do anything).
Next steps
Start by getting each extract down to one row per person:
See also
- Functions: overview: all functions for transforming data
- Extract from LPR: practical example of long register data and inner_join
Further depth in The Epidemiologist R Handbook: