Socioeconomic variables

Education, income and employment following the SEPLINE approach

Published

July 2, 2026

You now know the pattern: open_dataset → filter → collect → left_join. That is exactly what you use here. Two things are new in this phase:

  1. “Fetch for the year before index date”: SES registers are annual snapshots. You cannot just filter on pnr; you must also match on which year is relevant per person (year(index_date) - 1). This is done by calculating the baseline year per person and using it as a join key.
  2. FAIK via familie_id: Income is linked to the household, not the person directly. You need BEF as a bridge: fetch familie_id from BEF for the baseline year, then join to FAIK on familie_id.

The rest is categorisation per SEPLINE guidelines.


Socioeconomic position (SEP) is measured in register-based studies via three dimensions: education (UDDA), income (FAIK) and employment (AKM). This page shows how to extract and categorise them following the SEPLINE guideline.

SEPLINE article: Hjorth et al. Clinical Epidemiology 2025 - doi:10.2147/CLEP.S520772. See the article for full justification, recommended reference groups and categorisations.

Important

Under development - code examples, not validated code. The categorisations below are not validated and should not be used directly in analyses without review. They are shown as structural examples of how to code the variables - not as an approved implementation. If you have code that already works, or input on the categorisations, please get in touch: Sara Schwartz - saras@clin.au.dk

Note

Accessing the registers - the ways:

  1. fastreg (recommended): Use read_register("akm") - reads by name, fastreg knows the path (see Phase 4).
  2. Parquet via open_dataset() (any project): Use open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/akm/") + rename_with(tolower). The examples below use this pattern; swap in read_register("akm") if you use fastreg.
  3. SAS files (if parquet is not ready): Use haven::read_sas("path/akm.sas7bdat") - but this reads the entire file into RAM. Recommended: convert to parquet once and work from that (see Phase 4 - Convert SAS to parquet).

Your columns may be named differently - check with names(your_data).

Warning

Fetch the variable for the year BEFORE index - not the index year itself. All three registers are annual: each person has one value per calendar year. When in the year the value applies depends on the register - e.g. income (FAIK) is a sum over the whole calendar year, while education (UDDA) and employment (AKM) use a reference point during the year. Look it up for the specific register if the exact timing matters to you.

Regardless of the exact timing, the point holds: if you take the value from the index year, you risk measuring status that falls after your exposure. Example: index = surgery date 15 June 2015. The 2015 value may reflect status after the surgery and be affected by the exposure itself (e.g. job loss after illness) - that gives reverse causation, whereas you want a baseline value from before the exposure.

So you fetch the value for year(index_date) - 1 (here 2014). This is a safe, uniform rule for the whole cohort, regardless of when in the year index falls.


The three dimensions

Dimension Register Variable
Education UDDA hfaudd - highest completed education (ISCED code)
Income FAIK famaekvivadisp_13 - household-equivalised disposable income
Employment AKM socio13 - labour market classification

SEPLINE specifies both how these variables are categorised and when in the follow-up they are measured.


Employment - AKM (socio13)

library(arrow)       # open_dataset()
library(dplyr)       # filter, select, mutate, left_join, collect
library(lubridate)   # year() to extract year from dates

# Replace the path with your project's parquet path - DARTER: read_register("akm")
akm <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/akm/") %>%
  rename_with(tolower)   # standardise column names

# Fetch employment status for the year before index date
# (assumes cohort has columns pnr and index_date)
index_year <- unique(lubridate::year(kohort$index_date) - 1)   # baseline year = index year minus 1

akm_data <- akm %>%
  semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>%   # only the cohort's pnr's
  filter(aar %in% !!index_year) %>%                      # baseline year
  select(pnr, aar, socio13) %>%                               # only the columns we use
  collect()                                                   # fetch into R

# Attach to cohort with index year as join key
cohort_akm <- kohort %>%
  mutate(aar_baseline = lubridate::year(index_date) - 1) %>%        # calculate baseline year
  left_join(akm_data, by = c("pnr", "aar_baseline" = "aar"))        # join on pnr and year

# Categorise per SEPLINE
cohort_akm <- cohort_akm %>%
  mutate(occupation_cat = case_when(
    socio13 %in% c(110, 111, 112, 113, 114, 120, 131, 132, 133, 134, 135, 139) ~ "Employed",
    socio13 == 310                              ~ "Student",
    socio13 %in% c(210, 410)                   ~ "Unemployed",
    socio13 %in% c(220, 321, 330)              ~ "Outside labour market",   # sick pay, disability pension, flex job
    socio13 %in% c(322, 323)                   ~ "Retired",
    TRUE                                        ~ "Unknown"                  # 0, 420 or missing
  ))

Education - UDDA (hfaudd)

Categorised from the ISCED code in hfaudd: short (10/15), medium (20/30/35), long (4080), unknown (90 or missing).

# DARTER: read_register("udda")
udda <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/udda/") %>%
  rename_with(tolower)   # standardise column names

udda_data <- udda %>%
  semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>%   # only the cohort's pnr's
  filter(aar %in% !!index_year) %>%                      # baseline year
  select(pnr, aar, hfaudd) %>%                               # only the columns we use
  collect()                                                   # fetch into R

# Take the latest record if a person appears multiple times
udda_data <- udda_data %>%
  group_by(pnr) %>%           # group to find newest record per person
  arrange(desc(aar)) %>%      # newest year first
  slice(1) %>%                # keep only the newest record
  ungroup()                   # release grouping

# Categorise per SEPLINE
udda_data <- udda_data %>%
  mutate(education_cat = case_when(
    substr(as.character(hfaudd), 1, 2) %in% c("10", "15") ~ "Short",
    substr(as.character(hfaudd), 1, 2) %in% c("20", "30", "35") ~ "Medium",
    as.numeric(substr(as.character(hfaudd), 1, 2)) >= 40  ~ "Long",
    is.na(hfaudd) | substr(as.character(hfaudd), 1, 2) == "90" ~ "Unknown",
    TRUE ~ "Unknown"
  ))

Income - FAIK via BEF (famaekvivadisp_13)

Income is linked to the household, not the person. You need familie_id from BEF as a bridge. SEPLINE recommends a 3-year average divided into quintiles stratified by sex × 5-year age group × reference year.

# DARTER: read_register("bef") and read_register("faik")
bef  <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/bef/")  %>% rename_with(tolower)
faik <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/faik/") %>% rename_with(tolower)

# Fetch familie_id from BEF for baseline year
bef_family <- bef %>%
  semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>%   # only the cohort's pnr's
  filter(aar %in% !!index_year) %>%                      # baseline year
  select(pnr, aar, familie_id) %>%                            # familie_id is the bridge to FAIK
  collect()                                                   # fetch into R

# Fetch income from FAIK for baseline year
faik_data <- faik %>%
  filter(aar %in% !!index_year) %>%                           # only baseline year
  select(familie_id, aar, famaekvivadisp_13) %>%              # only the columns we use
  collect()                                                   # fetch into R

# Join: pnr → familie_id → income
income <- bef_family %>%
  left_join(faik_data, by = c("familie_id", "aar"))           # two-key join: household and year
3-year average and quintiles (SEPLINE recommendation)

SEPLINE recommends a 3-year average of income and quintiles stratified by sex × 5-year age group × year. Here is a simplified version with quintiles per year:

Note

What this code does not do: ntile(mean_income, 5) calculates quintile boundaries from the cohort’s own values. The correct SEPLINE approach uses cut-points (Q20/Q40/Q60/Q80) derived from the full BEF population for each reference year, stratified by sex × 5-year age group. This requires an additional BEF extraction without a pnr filter and is not implemented here.

library(dplyr)   # filter, select, left_join, group_by, summarise, mutate

# Fetch 3 years: index year and the two preceding
aar_3 <- c(index_year, index_year - 1, index_year - 2)   # 3-year window for average

bef_3yr <- bef %>%
  semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>%   # only the cohort's pnr's
  filter(aar %in% !!aar_3) %>%                           # the 3 years
  select(pnr, aar, familie_id) %>%                       # familie_id is the bridge to FAIK
  collect()                                              # fetch into R

faik_3yr <- faik %>%
  filter(aar %in% !!aar_3) %>%                           # only the 3 baseline years
  select(familie_id, aar, famaekvivadisp_13) %>%         # only the columns we use
  collect()                                              # fetch into R

# Calculate 3-year average per person
income_mean <- bef_3yr %>%
  left_join(faik_3yr, by = c("familie_id", "aar")) %>%   # link income via household and year
  group_by(pnr) %>%                                      # group to calculate average per person
  summarise(
    mean_income = mean(famaekvivadisp_13, na.rm = TRUE),   # mean disposable income
    .groups = "drop"                                        # release grouping automatically
  )

# Divide into quintiles
income_quintile <- income_mean %>%
  mutate(income_cat = ntile(mean_income, 5))   # ntile(x, 5): 5 groups - 1 = lowest, 5 = highest

Assemble all SES variables onto the cohort

cohort_ses <- kohort %>%
  left_join(cohort_akm      %>% select(pnr, occupation_cat), by = "pnr") %>%   # attach employment
  left_join(udda_data       %>% select(pnr, education_cat),  by = "pnr") %>%   # attach education
  left_join(income_quintile %>% select(pnr, income_cat),     by = "pnr")        # attach income quintile

See also


Next steps

You now have SES covariates. Next steps depend on what you still need:

Back to top