Long ↔︎ wide format
Get one extract down to one row per person - with group_by() + slice() or pivot
Before you can link your extracts together (Link your extracts), each extract needs to be in wide format with one row per person. The registers you extract from are almost always in long format with many rows per person. This page explains the difference and shows the two ways you reshape an extract: group_by() + slice() and pivot_wider() / pivot_longer().
The code examples use generic paths and variable names. Adapt to your project’s folder structure and column names.
Wide vs. long format
Imagine three persons with weight measurements at two time points.
Wide format - one row per person, one column per time point:
| pnr | weight_baseline | weight_6mo |
|---|---|---|
| 001 | 120 | 95 |
| 002 | 105 | 88 |
| 003 | 98 | 80 |
Long format - one row per observation, one column for time point and one for value:
| pnr | timepoint | weight |
|---|---|---|
| 001 | baseline | 120 |
| 001 | 6mo | 95 |
| 002 | baseline | 105 |
| 002 | 6mo | 88 |
| 003 | baseline | 98 |
| 003 | 6mo | 80 |
Both formats contain the same information - they are just arranged differently.
Which formats are where?
| Register | Format | Why |
|---|---|---|
| BEF | Long | One row per person per snapshot - quarterly since 2008, otherwise annual |
| LPR (lpr_adm + lpr_diag) | Long | One row per diagnosis per contact - one admission can have 10 diagnoses |
| LMDB | Long | One row per dispensed prescription - the same person can have hundreds |
| DODSAARS | Wide | One row per deceased person - already one-per-person |
| Your analysis dataset | Wide | One row per person - one column per covariate |
Registers are usually in long format. Your analysis dataset should typically be in wide format.
Two ways to get one row per person
The key choice is what you want out of the long register:
| You want… | Use | Result |
|---|---|---|
| One representative value per person (newest snapshot, first event, earliest date) | group_by() + slice() |
One row per person, same columns |
| Several time points side by side as separate columns | pivot_wider() |
One row per person, one new column per time point |
In register-based research, group_by() + slice() is by far the most common choice - you usually just want “the newest” or “the first” value per person. You use pivot_wider() when you explicitly want to compare several time points.
group_by() + slice() - one representative row
The pattern is always the same: group by pnr, sort so the desired row is on top, take the first row per group.
# BEF: keep only the newest record per person (BEF has one snapshot per year)
bef_one_per_person <- bef %>%
group_by(pnr) %>% # group: all rows with the same pnr are treated together
arrange(desc(aar)) %>% # sort within group: newest year first (desc = descending)
slice(1) %>% # take only the first row per group - i.e. the newest year
ungroup() # IMPORTANT: release grouping when doneAlways remember ungroup(). group_by() puts a “grouped_df” tag on your data.frame, which all subsequent operations inherit. That means: - mutate() calculates per group, not across the entire dataset - n() counts rows per group, not in total - summarize() gives one result per group
These errors are hard to detect because the code runs without error messages and produces seemingly reasonable numbers. ungroup() removes the tag and makes the data.frame normal again. Rule of thumb: always end a group_by() chain with ungroup().
# LPR / alle_dx: find first diagnosis per person (for outcome definition)
first_dementia <- alle_dx %>%
filter(icd3 %in% c("G30", "F00", "F01", "F02", "F03")) %>% # only dementia codes
group_by(pnr) %>% # group per person
arrange(date_contact) %>% # oldest date first (ascending - no desc)
slice(1) %>% # first diagnosis per person
ungroup() # release grouping
# → ready for join: cohort %>% left_join(first_dementia, by = "pnr")# VNDS: find first emigration per person (a person may have emigrated and returned)
first_emigration <- vnds %>%
filter(indud_kode == "U") %>% # only emigration events
group_by(pnr) %>%
arrange(haend_dato) %>%
slice(1) %>%
ungroup()Pivots - switch between long and wide
Long → wide with pivot_wider()
library(tidyr) # pivot_wider, pivot_longer
df_wide <- df_long %>%
pivot_wider(
names_from = "timepoint", # which column should become column names?
values_from = "weight" # which column contains the values?
)You CANNOT pivot_wider() an entire LPR table. LPR potentially has hundreds of unique ICD codes per person - pivot_wider() would give you one column per code, i.e. thousands of columns and a dataset that is impossible to work with.
Instead use the pattern from Extract from LPR: extract the first date for one specific diagnosis group (with group_by() + slice() above), save as extract_dementia.rds, and join it as one column with left_join().
Practical example: employment status from AKM - long to wide
AKM (the Labour Classification Module) stores one row per person per year - long format. If you want to compare employment status at baseline and follow-up side by side, you can pivot:
#=====================================================
# AKM: long -> wide (employment status per year)
#=====================================================
library(dplyr) # filter, mutate
library(tidyr) # pivot_wider
# AKM in long format - two years per person:
akm_long <- data.frame(
pnr = c("001", "001", "002", "002", "003", "003"),
aar = c(2015, 2020, 2015, 2020, 2015, 2020),
socio = c(110, 110, 210, 310, 150, 150) # employment categories from AKM (socio13)
)
# Convert to wide - one column per year:
akm_wide <- akm_long %>%
pivot_wider(
names_from = aar,
values_from = socio,
names_prefix = "socio_" # gives columns: socio_2015, socio_2020
)
# Result:
# pnr socio_2015 socio_2020
# 001 110 110
# 002 210 310
# 003 150 150In practice you would typically not pivot AKM - you filter to one specific year with filter(aar == index_year) and use group_by() + slice() above. Pivot is relevant when you explicitly want two time points side by side in the analysis.
Wide → long with pivot_longer()
df_long <- df_wide %>%
pivot_longer(
cols = c(weight_baseline, weight_6mo), # columns to "fold in"
names_to = "timepoint", # name for the new timepoint column
values_to = "weight" # name for the new value column
)Next steps
Once each extract is down to one row per person, you link them into one analysis-ready dataset:
See also
- Link your extracts: joins,
bind_rows()and calculating follow-up time - Functions: overview: all functions for transforming data
Further depth in The Epidemiologist R Handbook: