Long ↔︎ wide format

Get one extract down to one row per person - with group_by() + slice() or pivot

Published

July 21, 2026

Before you can link your extracts together (Link your extracts), each extract needs to be in wide format with one row per person. The registers you extract from are almost always in long format with many rows per person. This page explains the difference and shows the two ways you reshape an extract: group_by() + slice() and pivot_wider() / pivot_longer().

The code examples use generic paths and variable names. Adapt to your project’s folder structure and column names.

Wide vs. long format

Imagine three persons with weight measurements at two time points.

Wide format - one row per person, one column per time point:

pnr	weight_baseline	weight_6mo
001	120	95
002	105	88
003	98	80

Long format - one row per observation, one column for time point and one for value:

pnr	timepoint	weight
001	baseline	120
001	6mo	95
002	baseline	105
002	6mo	88
003	baseline	98
003	6mo	80

Both formats contain the same information - they are just arranged differently.

Which formats are where?

Register	Format	Why
BEF	Long	One row per person per snapshot - quarterly since 2008, otherwise annual
LPR (lpr_adm + lpr_diag)	Long	One row per diagnosis per contact - one admission can have 10 diagnoses
LMDB	Long	One row per dispensed prescription - the same person can have hundreds
DODSAARS	Wide	One row per deceased person - already one-per-person
Your analysis dataset	Wide	One row per person - one column per covariate

Registers are usually in long format. Your analysis dataset should typically be in wide format.

Two ways to get one row per person

The key choice is what you want out of the long register:

You want…	Use	Result
One representative value per person (newest snapshot, first event, earliest date)	`group_by() + slice()`	One row per person, same columns
Several time points side by side as separate columns	`pivot_wider()`	One row per person, one new column per time point

In register-based research, group_by() + slice() is by far the most common choice - you usually just want “the newest” or “the first” value per person. You use pivot_wider() when you explicitly want to compare several time points.

`group_by()` + `slice()` - one representative row

The pattern is always the same: group by pnr, sort so the desired row is on top, take the first row per group.

# BEF: keep only the newest record per person (BEF has one snapshot per year)
bef_one_per_person <- bef %>%
  group_by(pnr) %>% # group: all rows with the same pnr are treated together
  arrange(desc(aar)) %>% # sort within group: newest year first (desc = descending)
  slice(1) %>% # take only the first row per group - i.e. the newest year
  ungroup() # IMPORTANT: release grouping when done

Always remember ungroup(). group_by() puts a “grouped_df” tag on your data.frame, which all subsequent operations inherit. That means: - mutate() calculates per group, not across the entire dataset - n() counts rows per group, not in total - summarize() gives one result per group

These errors are hard to detect because the code runs without error messages and produces seemingly reasonable numbers. ungroup() removes the tag and makes the data.frame normal again. Rule of thumb: always end a group_by() chain with ungroup().

# LPR / alle_dx: find first diagnosis per person (for outcome definition)
first_dementia <- alle_dx %>%
  filter(icd3 %in% c("G30", "F00", "F01", "F02", "F03")) %>% # only dementia codes
  group_by(pnr) %>% # group per person
  arrange(date_contact) %>% # oldest date first (ascending - no desc)
  slice(1) %>% # first diagnosis per person
  ungroup() # release grouping
# → ready for join: cohort %>% left_join(first_dementia, by = "pnr")

# VNDS: find first emigration per person (a person may have emigrated and returned)
first_emigration <- vnds %>%
  filter(indud_kode == "U") %>% # only emigration events
  group_by(pnr) %>%
  arrange(haend_dato) %>%
  slice(1) %>%
  ungroup()

Pivots - switch between long and wide

Long → wide with `pivot_wider()`

library(tidyr) # pivot_wider, pivot_longer

df_wide <- df_long %>%
  pivot_wider(
    names_from = "timepoint", # which column should become column names?
    values_from = "weight" # which column contains the values?
  )

You CANNOT pivot_wider() an entire LPR table. LPR potentially has hundreds of unique ICD codes per person - pivot_wider() would give you one column per code, i.e. thousands of columns and a dataset that is impossible to work with.

Instead use the pattern from Extract from LPR: extract the first date for one specific diagnosis group (with group_by() + slice() above), save as extract_dementia.rds, and join it as one column with left_join().

Practical example: employment status from AKM - long to wide

AKM (the Labour Classification Module) stores one row per person per year - long format. If you want to compare employment status at baseline and follow-up side by side, you can pivot:

#=====================================================
# AKM: long -> wide (employment status per year)
#=====================================================
library(dplyr)   # filter, mutate
library(tidyr)   # pivot_wider

# AKM in long format - two years per person:
akm_long <- data.frame(
  pnr   = c("001", "001", "002", "002", "003", "003"),
  aar   = c(2015, 2020, 2015, 2020, 2015, 2020),
  socio = c(110, 110, 210, 310, 150, 150)   # employment categories from AKM (socio13)
)

# Convert to wide - one column per year:
akm_wide <- akm_long %>%
  pivot_wider(
    names_from   = aar,
    values_from  = socio,
    names_prefix = "socio_"   # gives columns: socio_2015, socio_2020
  )

# Result:
#   pnr   socio_2015  socio_2020
#   001          110         110
#   002          210         310
#   003          150         150

In practice you would typically not pivot AKM - you filter to one specific year with filter(aar == index_year) and use group_by() + slice() above. Pivot is relevant when you explicitly want two time points side by side in the analysis.

Wide → long with `pivot_longer()`

df_long <- df_wide %>%
  pivot_longer(
    cols = c(weight_baseline, weight_6mo), # columns to "fold in"
    names_to = "timepoint", # name for the new timepoint column
    values_to = "weight" # name for the new value column
  )

Next steps

Once each extract is down to one row per person, you link them into one analysis-ready dataset:

→ Link your extracts

Wide vs. long format

Which formats are where?

Two ways to get one row per person

group_by() + slice() - one representative row

Pivots - switch between long and wide

Long → wide with pivot_wider()

Wide → long with pivot_longer()

Next steps

See also

`group_by()` + `slice()` - one representative row

Long → wide with `pivot_wider()`

Wide → long with `pivot_longer()`