Long ↔︎ wide format

Get one extract down to one row per person - with group_by() + slice() or pivot

Published

July 2, 2026

Before you can link your extracts together (Link your extracts), each extract needs to be in wide format with one row per person. The registers you extract from are almost always in long format with many rows per person. This page explains the difference and shows the two ways you reshape an extract: group_by() + slice() and pivot_wider() / pivot_longer().

Note

The code examples use generic paths and variable names. Adapt to your project’s folder structure and column names.


Wide vs. long format

Imagine three persons with weight measurements at two time points.

Wide format - one row per person, one column per time point:

pnr weight_baseline weight_6mo
001 120 95
002 105 88
003 98 80

Long format - one row per observation, one column for time point and one for value:

pnr timepoint weight
001 baseline 120
001 6mo 95
002 baseline 105
002 6mo 88
003 baseline 98
003 6mo 80

Both formats contain the same information - they are just arranged differently.


Which formats are where?

Register Format Why
BEF Long One row per person per snapshot - quarterly since 2008, otherwise annual
LPR (lpr_adm + lpr_diag) Long One row per diagnosis per contact - one admission can have 10 diagnoses
LMDB Long One row per dispensed prescription - the same person can have hundreds
DODSAARS Wide One row per deceased person - already one-per-person
Your analysis dataset Wide One row per person - one column per covariate

Registers are usually in long format. Your analysis dataset should typically be in wide format.


Two ways to get one row per person

The key choice is what you want out of the long register:

You want… Use Result
One representative value per person (newest snapshot, first event, earliest date) group_by() + slice() One row per person, same columns
Several time points side by side as separate columns pivot_wider() One row per person, one new column per time point

In register-based research, group_by() + slice() is by far the most common choice - you usually just want “the newest” or “the first” value per person. You use pivot_wider() when you explicitly want to compare several time points.


group_by() + slice() - one representative row

The pattern is always the same: group by pnr, sort so the desired row is on top, take the first row per group.

# BEF: keep only the newest record per person (BEF has one snapshot per year)
bef_one_per_person <- bef %>%
  group_by(pnr) %>%          # group: all rows with the same pnr are treated together
  arrange(desc(aar)) %>%     # sort within group: newest year first (desc = descending)
  slice(1) %>%               # take only the first row per group - i.e. the newest year
  ungroup()                  # IMPORTANT: release grouping when done
Important

Always remember ungroup(). group_by() puts a “grouped_df” tag on your data.frame, which all subsequent operations inherit. That means: - mutate() calculates per group, not across the entire dataset - n() counts rows per group, not in total - summarize() gives one result per group

These errors are hard to detect because the code runs without error messages and produces seemingly reasonable numbers. ungroup() removes the tag and makes the data.frame normal again. Rule of thumb: always end a group_by() chain with ungroup().

# LPR / alle_dx: find first diagnosis per person (for outcome definition)
first_dementia <- alle_dx %>%
  filter(icd3 %in% c("G30", "F00", "F01", "F02", "F03")) %>%   # only dementia codes
  group_by(pnr) %>%              # group per person
  arrange(date_contact) %>%      # oldest date first (ascending - no desc)
  slice(1) %>%                   # first diagnosis per person
  ungroup()                      # release grouping
# → ready for join: cohort %>% left_join(first_dementia, by = "pnr")
# VNDS: find first emigration per person (a person may have emigrated and returned)
first_emigration <- vnds %>%
  filter(indud_kode == "U") %>%   # only emigration events
  group_by(pnr) %>%
  arrange(haend_dato) %>%
  slice(1) %>%
  ungroup()

Pivots - switch between long and wide

Long → wide with pivot_wider()

library(tidyr)   # pivot_wider, pivot_longer

df_wide <- df_long %>%
  pivot_wider(
    names_from  = "timepoint",  # which column should become column names?
    values_from = "weight"      # which column contains the values?
  )
Warning

You CANNOT pivot_wider() an entire LPR table. LPR potentially has hundreds of unique ICD codes per person - pivot_wider() would give you one column per code, i.e. thousands of columns and a dataset that is impossible to work with.

Instead use the pattern from Extract from LPR: extract the first date for one specific diagnosis group (with group_by() + slice() above), save as extract_dementia.rds, and join it as one column with left_join().

Practical example: employment status from AKM - long to wide

AKM (the Labour Classification Module) stores one row per person per year - long format. If you want to compare employment status at baseline and follow-up side by side, you can pivot:

#=====================================================
# AKM: long -> wide (employment status per year)
#=====================================================
library(dplyr)   # filter, mutate
library(tidyr)   # pivot_wider

# AKM in long format - two years per person:
akm_long <- data.frame(
  pnr   = c("001", "001", "002", "002", "003", "003"),
  aar   = c(2015, 2020, 2015, 2020, 2015, 2020),
  socio = c(110, 110, 210, 310, 150, 150)   # employment categories from AKM (socio13)
)

# Convert to wide - one column per year:
akm_wide <- akm_long %>%
  pivot_wider(
    names_from   = aar,
    values_from  = socio,
    names_prefix = "socio_"   # gives columns: socio_2015, socio_2020
  )

# Result:
#   pnr   socio_2015  socio_2020
#   001          110         110
#   002          210         310
#   003          150         150

In practice you would typically not pivot AKM - you filter to one specific year with filter(aar == index_year) and use group_by() + slice() above. Pivot is relevant when you explicitly want two time points side by side in the analysis.

Wide → long with pivot_longer()

df_long <- df_wide %>%
  pivot_longer(
    cols      = c(weight_baseline, weight_6mo),  # columns to "fold in"
    names_to  = "timepoint",                     # name for the new timepoint column
    values_to = "weight"                         # name for the new value column
  )

Next steps

Once each extract is down to one row per person, you link them into one analysis-ready dataset:

Link your extracts

See also

TipFurther information

Further depth in The Epidemiologist R Handbook:

Back to top