Socioeconomic variables

Education, income and employment following the SEPLINE approach

Published

July 21, 2026

You now know the pattern: read_register/open_dataset → filter → collect → left_join. That is exactly what you use here. Two things are new in this phase:

“Fetch for the year before index date”: SES registers are annual snapshots. You cannot just filter on pnr; you must also match on which year is relevant per person (year(index_date) - 1). This is done by calculating the baseline year per person and using it as a join key.
FAIK via familie_id: Income is linked to the household, not the person directly. You need BEF as a bridge: fetch familie_id from BEF for the baseline year, then join to FAIK on familie_id.

The rest is categorisation per SEPLINE guidelines.

Socioeconomic position (SEP) is measured in register-based studies via three dimensions: education (UDDA), income (FAIK) and employment (AKM). This page shows how to extract and categorise them following the SEPLINE guideline.

SEPLINE article: Hjorth et al. Clinical Epidemiology 2025 - doi:10.2147/CLEP.S520772. See the article for full justification, recommended reference groups and categorisations.

Under development - code examples, not validated code. The categorisations below are not validated and should not be used directly in analyses without review. They are shown as structural examples of how to code the variables - not as an approved implementation. If you have code that already works, or input on the categorisations, please get in touch: Sara Schwartz - saras@clin.au.dk

Accessing the registers - the ways:

fastreg (recommended): Use read_register("akm") - reads by name, fastreg knows the path (see Phase 4). The examples below use this.
Parquet via open_dataset() (any project): Use open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/akm/") + rename_with(tolower), if fastreg is not set up on your project.
SAS files (if parquet is not ready): Use haven::read_sas("path/akm.sas7bdat") - but this reads the entire file into RAM. Recommended: convert to parquet once and work from that (see Phase 4 - Convert SAS to parquet).

Your columns may be named differently - check with names(your_data).

Fetch the variable for the year BEFORE index - not the index year itself. All three registers are annual: each person has one value per calendar year. When in the year the value applies depends on the register - e.g. income (FAIK) is a sum over the whole calendar year, while education (UDDA) and employment (AKM) use a reference point during the year. Look it up for the specific register if the exact timing matters to you.

Regardless of the exact timing, the point holds: if you take the value from the index year, you risk measuring status that falls after your exposure. Example: index = surgery date 15 June 2015. The 2015 value may reflect status after the surgery and be affected by the exposure itself (e.g. job loss after illness) - that gives reverse causation, whereas you want a baseline value from before the exposure.

So you fetch the value for year(index_date) - 1 (here 2014). This is a safe, uniform rule for the whole cohort, regardless of when in the year index falls.

The three dimensions

Dimension	Register	Variable
Education	UDDA	`hfaudd` - highest completed education (ISCED code)
Income	FAIK	`famaekvivadisp_13` - household-equivalised disposable income
Employment	AKM	`socio13` - labour market classification

SEPLINE specifies both how these variables are categorised and when in the follow-up they are measured.

Employment - AKM (`socio13`)

library(fastreg) # read_register() - reads registers by name
library(arrow) # open_dataset() - fallback without fastreg
library(dplyr) # filter, select, mutate, left_join, collect
library(lubridate) # year() to extract year from dates

# 1. Open AKM via fastreg (read_register gets the path from your project config)
akm <- read_register("akm") %>%
  rename_with(tolower) # standardise column names
# Without fastreg: open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/akm/") %>% rename_with(tolower)

# 2. Fetch employment status for the year before index date
# (assumes cohort has columns pnr and index_date)
index_year <- unique(lubridate::year(kohort$index_date) - 1) # baseline year = index year minus 1

akm_data <- akm %>%
  semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>% # only the cohort's pnr's
  filter(aar %in% !!index_year) %>% # baseline year
  select(pnr, aar, socio13) %>% # only the columns we use
  collect() # fetch into R

# 3. Attach to cohort with index year as join key
cohort_akm <- kohort %>%
  mutate(aar_baseline = lubridate::year(index_date) - 1) %>% # calculate baseline year
  left_join(akm_data, by = c("pnr", "aar_baseline" = "aar")) # join on pnr and year

# 4. Categorise per SEPLINE
cohort_akm <- cohort_akm %>%
  mutate(
    occupation_cat = case_when(
      socio13 %in%
        c(
          110,
          111,
          112,
          113,
          114,
          120,
          131,
          132,
          133,
          134,
          135,
          139
        ) ~ "Employed",
      socio13 == 310 ~ "Student",
      socio13 %in% c(210, 410) ~ "Unemployed",
      socio13 %in% c(220, 321, 330) ~ "Outside labour market", # sick pay, disability pension, flex job
      socio13 %in% c(322, 323) ~ "Retired",
      TRUE ~ "Unknown" # 0, 420 or missing
    )
  )

Education - UDDA (`hfaudd`)

Categorised from the ISCED code in hfaudd: short (10/15), medium (20/30/35), long (40–80), unknown (90 or missing).

# 1. Open UDDA via fastreg (read_register gets the path from your project config)
udda <- read_register("udda") %>%
  rename_with(tolower) # standardise column names
# Without fastreg: open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/udda/") %>% rename_with(tolower)

# 2. Fetch education for the baseline year
udda_data <- udda %>%
  semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>% # only the cohort's pnr's
  filter(aar %in% !!index_year) %>% # baseline year
  select(pnr, aar, hfaudd) %>% # only the columns we use
  collect() # fetch into R

# 3. Take the latest record if a person appears multiple times
udda_data <- udda_data %>%
  group_by(pnr) %>% # group to find newest record per person
  arrange(desc(aar)) %>% # newest year first
  slice(1) %>% # keep only the newest record
  ungroup() # release grouping

# 4. Categorise per SEPLINE
udda_data <- udda_data %>%
  mutate(
    education_cat = case_when(
      substr(as.character(hfaudd), 1, 2) %in% c("10", "15") ~ "Short",
      substr(as.character(hfaudd), 1, 2) %in% c("20", "30", "35") ~ "Medium",
      as.numeric(substr(as.character(hfaudd), 1, 2)) >= 40 ~ "Long",
      is.na(hfaudd) | substr(as.character(hfaudd), 1, 2) == "90" ~ "Unknown",
      TRUE ~ "Unknown"
    )
  )

Income - FAIK via BEF (`famaekvivadisp_13`)

Income is linked to the household, not the person. You need familie_id from BEF as a bridge. SEPLINE recommends a 3-year average divided into quintiles stratified by sex × 5-year age group × reference year.

# 1. Open BEF + FAIK via fastreg (read_register gets the path from your project config)
bef <- read_register("bef") %>%
  rename_with(tolower)
faik <- read_register("faik") %>%
  rename_with(tolower)
# Without fastreg: open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/bef/") %>% rename_with(tolower)  (same for faik)

# 2. Fetch familie_id from BEF for baseline year
bef_family <- bef %>%
  semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>% # only the cohort's pnr's
  filter(aar %in% !!index_year) %>% # baseline year
  select(pnr, aar, familie_id) %>% # familie_id is the bridge to FAIK
  collect() # fetch into R

# 3. Fetch income from FAIK for baseline year
faik_data <- faik %>%
  filter(aar %in% !!index_year) %>% # only baseline year
  select(familie_id, aar, famaekvivadisp_13) %>% # only the columns we use
  collect() # fetch into R

# 4. Join: pnr → familie_id → income
income <- bef_family %>%
  left_join(faik_data, by = c("familie_id", "aar")) # two-key join: household and year

3-year average and quintiles (SEPLINE recommendation)

SEPLINE recommends a 3-year average of income and quintiles stratified by sex × 5-year age group × year. Here is a simplified version with quintiles per year:

What this code does not do: ntile(mean_income, 5) calculates quintile boundaries from the cohort’s own values. The correct SEPLINE approach uses cut-points (Q20/Q40/Q60/Q80) derived from the full BEF population for each reference year, stratified by sex × 5-year age group. This requires an additional BEF extraction without a pnr filter and is not implemented here.

library(dplyr)   # filter, select, left_join, group_by, summarise, mutate

# Fetch 3 years: index year and the two preceding
aar_3 <- c(index_year, index_year - 1, index_year - 2)   # 3-year window for average

bef_3yr <- bef %>%
  semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>%   # only the cohort's pnr's
  filter(aar %in% !!aar_3) %>%                           # the 3 years
  select(pnr, aar, familie_id) %>%                       # familie_id is the bridge to FAIK
  collect()                                              # fetch into R

faik_3yr <- faik %>%
  filter(aar %in% !!aar_3) %>%                           # only the 3 baseline years
  select(familie_id, aar, famaekvivadisp_13) %>%         # only the columns we use
  collect()                                              # fetch into R

# Calculate 3-year average per person
income_mean <- bef_3yr %>%
  left_join(faik_3yr, by = c("familie_id", "aar")) %>%   # link income via household and year
  group_by(pnr) %>%                                      # group to calculate average per person
  summarise(
    mean_income = mean(famaekvivadisp_13, na.rm = TRUE),   # mean disposable income
    .groups = "drop"                                        # release grouping automatically
  )

# Divide into quintiles
income_quintile <- income_mean %>%
  mutate(income_cat = ntile(mean_income, 5))   # ntile(x, 5): 5 groups - 1 = lowest, 5 = highest

Assemble all SES variables onto the cohort

cohort_ses <- kohort %>%
  left_join(cohort_akm %>% select(pnr, occupation_cat), by = "pnr") %>% # attach employment
  left_join(udda_data %>% select(pnr, education_cat), by = "pnr") %>% # attach education
  left_join(income_quintile %>% select(pnr, income_cat), by = "pnr") # attach income quintile

Next steps

You now have SES covariates. Next steps depend on what you still need:

Specialist packages (OSDC for diabetes classification, NMI for comorbidity score, analysis tools)? → Algorithms & special packages
Ready to export results? → Phase 14 - Export and repatriation

The three dimensions

Employment - AKM (socio13)

Education - UDDA (hfaudd)

Income - FAIK via BEF (famaekvivadisp_13)

Assemble all SES variables onto the cohort

See also

Next steps

Employment - AKM (`socio13`)

Education - UDDA (`hfaudd`)

Income - FAIK via BEF (`famaekvivadisp_13`)