Extract from LPR

Practical recipes - the two approaches, helper functions and integration with the cohort

Published

July 21, 2026

This page shows how to extract diagnoses from LPR with code. It builds on the structure from Understand LPR: the periods (LPR2/LPR3), the D-prefix, the diagnosis types (A/B/G) and the filter for retracted diagnoses. Read that page first if you haven’t already.

You use the same extraction pattern as in Phase 5 and Phase 6 - just applied to LPR’s two generations. It is the most important and probably the most complex part of the guide.

Circular dependency with Phase 10 - deliberate. To extract diagnoses (this page) we assume you already have a cohort; but you build the cohort in Phase 10 using exactly the extraction pattern you learn here. Read the phases in order, and come back to the code when your cohort is ready. The code uses inner_join() and bind_rows(); if these are new, they are explained in detail in Phase 12.

LPR3 files and column names - check before you load.

Use the LPR_A files consistently. DST has switched LPR3 from the old LPR_F format (kontakter, diagnoser, forloeb) to the new LPR_A format (lpr_a_kontakt, lpr_a_diagnose). Both versions may sit in your folder and cover the same years - loading both (or mixing old and new) gives you duplicated rows. Use lpr_a_* as in the examples here, and leave the LPR_F files alone. lpr_a_kontakt also contains older data that is already in LPR2; filter it out with lprindberetningssystem == "LPR3". This is a general property of LPR_A - on DARTER it is spelled out in pitfall 5.
Verify column names. The switch to LPR_A changed a number of variable names (often opaque Danish abbreviations). The examples use the names the registers typically have, but always check before relying on a column name. On a lazy open_dataset() object use arrow::schema(your_data) - it lists the column names and their types without reading data into RAM; colnames(your_data) also works (and is what you use for the DuckDB-backed objects from read_register() (fastreg)). Look names up in Overview of registers.
Mind the 2019-2020 diagnosis spike. The LPR3 contact model registers far more outpatient diagnoses than LPR2 did, producing a spike in diagnosis counts around 2019-2020 that matters for any count- or trend-based analysis. See the diagnosis spike.

Fetch diagnoses from LPR - choose your approach

Choose one of two approaches depending on your study:

	Approach 1 - direct extraction	Approach 2 - `alle_dx`
Best when	You have fewer outcomes	You have multiple outcomes from LPR
Workflow	Fetch specific codes → Exclude → done	Fetch all → Exclude → filter per outcome
Advantage	Simpler and faster for single-outcome studies	LPR queried only once; reused for all outcomes

Approach 1 is best for a smaller number of outcomes. Filter on specific ICD codes directly in the filter() step before collect(). DuckDB/Arrow pushes the filter down to the storage layer - only matching rows are loaded into RAM.

Approach 2 is best when your study has multiple outcomes. You query LPR once and build alle_dx: a shared table with all A and B diagnoses. For each new outcome, filter alle_dx on the relevant codes - the only line you change is the code list.

The examples require parquet files and a completed study population. kohort is the data.frame with pnr and index_date per person - see Phase 10.

The examples use read_register("name") (fastreg); without fastreg use open_dataset("E:/workdata/[projectnumber]/.../<register>/") with your own path. Note: LPR is one of the largest registers, so the data must be in parquet. That lets Arrow/DuckDB filter before loading and fetch only the rows you ask for; reading LPR directly from e.g. SAS pulls the whole register into RAM. Convert SAS to parquet first: Phase 4 - SAS to Parquet.

Why semi_join(tibble(pnr = ...)) and not filter(pnr %in% ...)? Both keep only the cohort’s rows, but a semi_join against a small tibble of pnr’s pushes down into Arrow/DuckDB more efficiently and reliably. A large %in% filter with an R vector can get slow or be rejected outright (especially with an older duckplyr). It also frees you from !!: semi_join takes an ordinary local table as its argument. You still need !! when you pass a local R vector into a filter(), e.g. a code list (substr(c_diag, 2, 4) %in% !!CODES).

Approach 1 - fetch specific diagnoses directly (start here for one outcome)

Filter on specific codes before collect(). The example fetches diabetes mellitus (E10–E14) - replace CODES_REGEX with your own codes.

#=====================================================
# Extract diabetes diagnoses from LPR (Approach 1)
#=====================================================
library(fastreg) # read_register - read a register by name
library(arrow)   # open_dataset - fallback without fastreg
library(dplyr)

cohort_pnrs <- unique(kohort$pnr)
CODES_REGEX <- "^DE1[0-4]"   # diabetes mellitus (E10–E14) - with D-prefix

#-----------------------------------------------------
# LPR2 (somatic): lpr_adm + lpr_diag
#-----------------------------------------------------
lpr_adm  <- read_register("lpr_adm")  %>% rename_with(tolower)
lpr_diag <- read_register("lpr_diag") %>% rename_with(tolower)

# Here we combine LPR2's two tables: the contact register lpr_adm (pnr + dates) and the diagnosis
# register lpr_diag (the ICD code), filter on your codes/diagnosis types, and select the columns you need.
# You can pull as many codes at once as you like (via CODES_REGEX), and you choose the object name YOURSELF
# (here 'lpr2_dm' = LPR2 + diabetes mellitus, because the example is diabetes).
lpr2_dm <- lpr_adm %>%
  semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>%   # ONLY your cohort. OMIT this line if you want the WHOLE population
  select(pnr, recnum, date_contact = d_inddto) %>%  # pick columns from LPR2; d_inddto is RENAMED to date_contact (see note below)
  inner_join(
    lpr_diag %>%                                    # diagnosis register: has recnum + ICD code, but neither pnr nor date
      filter(c_diagtype %in% c("A", "B"),           # A + B = action/secondary diagnoses; add "G" (-> c("A","B","G")) if you also want underlying conditions
             grepl(CODES_REGEX, c_diag)) %>%         # keep only your codes - filter BEFORE collect (D-prefix in the regex)
      select(recnum, c_diag, c_diagtype),           # the diagnosis columns we need
    by = "recnum"                                   # recnum = the key linking contact and diagnosis in LPR2
  ) %>%
  collect() %>%                                      # ONLY here is data pulled into RAM (everything above runs in the database)
  mutate(icd3 = substr(c_diag, 2, 4))                # strip the D-prefix: "DE11" -> "E11"

#-----------------------------------------------------
# LPR3: lpr_a_kontakt + lpr_a_diagnose
#-----------------------------------------------------
lpr3_k <- read_register("lpr_a_kontakt")  %>% rename_with(tolower) %>%
  filter(lprindberetningssystem == "LPR3")   # keep only LPR3-system rows (avoid overlap - see warning at top)
lpr3_d <- read_register("lpr_a_diagnose") %>% rename_with(tolower)

lpr3_dm <- lpr3_k %>%                                 # same pattern as LPR2 - but LPR3 tables and different column names
  semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>%   # ONLY your cohort. OMIT this line for the WHOLE population
  select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>%  # LPR3's date column renamed to the SAME name: date_contact
  inner_join(
    lpr3_d %>%
      filter(diag_kode_type %in% c("A", "B"),         # as in LPR2: add "G" if you also want underlying conditions
             is.na(senere_afkraeftet) | senere_afkraeftet != "Ja",  # drop withdrawn diagnoses (LPR3 only)
             grepl(CODES_REGEX, diag_kode)) %>%        # same codes; filter BEFORE collect
      select(dw_ek_kontakt, c_diag = diag_kode, c_diagtype = diag_kode_type),  # rename LPR3 names -> same names as LPR2
    by = "dw_ek_kontakt"                               # LPR3's contact key (NOT recnum as in LPR2)
  ) %>%
  collect() %>%
  mutate(date_contact = as.Date(date_contact),         # the LPR3 date is a datetime -> make it a plain date
         icd3 = substr(c_diag, 2, 4))                  # strip the D-prefix

#-----------------------------------------------------
# LPR2 psychiatric (up to March 2019) - OPTIONAL: only if your outcome is an F-code (dementia, depression, ...)
#-----------------------------------------------------
psyk_adm  <- read_register("t_psyk_adm")  %>%
  rename_with(tolower) %>% rename(pnr = v_cpr, recnum = k_recnum)
psyk_diag <- read_register("t_psyk_diag") %>%
  rename_with(tolower) %>% rename(recnum = v_recnum)

lpr2_psyk_dm <- psyk_adm %>%
  semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>%
  select(pnr, recnum, date_contact = d_inddto) %>%
  inner_join(
    psyk_diag %>%
      filter(c_diagtype %in% c("A", "B"),
             grepl(CODES_REGEX, c_diag)) %>%          # same codes as above
      select(recnum, c_diag, c_diagtype),
    by = "recnum"
  ) %>%
  collect() %>%
  mutate(icd3 = substr(c_diag, 2, 4))

#-----------------------------------------------------
# Combine LPR2 (somatic + psych) + LPR3 into one extract
#-----------------------------------------------------
# Works because we gave the extracts the SAME column names above (date_contact, c_diag, ...):
dm_dx <- bind_rows(lpr2_dm, lpr2_psyk_dm, lpr3_dm)     # drop lpr2_psyk_dm if you didn't pull psych
# Columns: pnr | date_contact | c_diag | c_diagtype | icd3

What is date_contact = d_inddto? (why is date_contact coloured?) Inside select(), new_name = old_name means you rename a column. d_inddto is LPR2’s actual date column, and date_contact is a name you choose - the editor colours it like an argument, but it’s just the new column name (not a function argument). We rename because LPR3 has a different date column (kont_starttidspunkt); giving both the name date_contact makes the two extracts share column names, so bind_rows() can stack them.

Must vs. choice: the register names (pnr, recnum, d_inddto, c_diag, c_diagtype, dw_ek_kontakt, kont_starttidspunkt, diag_kode …) must be spelled exactly as in the register. The new names (date_contact, icd3) and the object name (lpr2_dm) are yours to choose - just keep the harmonized names identical across LPR2 and LPR3.

Many specific codes? Build the regex programmatically:

codes <- c("E10", "E11", "E12", "E13", "E14")
CODES_REGEX <- paste0("^D(", paste(codes, collapse = "|"), ")")

Do you have F-codes (e.g. dementia, depression)? Extend CODES_REGEX to include them, e.g. "^DE1[0-4]|^DF0[0-3]|^DG30", and include the optional LPR2 psychiatric block above (F-diagnoses before March 2019 live only there).

Alternative: compact extraction (single-table approach)

A colleague may have shown you this shorter approach:

lpr <- left_join(lpr_adm, lpr_diag, by = "RECNUM") %>%
  filter(C_DIAGTYPE == "A",
         grepl("^S72", C_DIAG)) %>%
  group_by(PNR) %>%
  filter(D_INDDTO == min(D_INDDTO)) %>%
  slice(1) %>%
  ungroup()

It is shorter but has three pitfalls on DST data:

D-prefix error: "^S72" does NOT match "DS72..." in DST data - returns zero rows with no error message. Use "^DS72" (with D) or strip the prefix first.
left_join instead of inner_join: Keeps all admissions from lpr_adm - including those with no matching diagnosis. Unnecessarily heavy on national registers.
No pnr filter: Loads the entire population’s data. Correct when building a cohort (Phase 10), not when extracting from an existing one.

Approach 2 - fetch all diagnoses + filter outcome (for multiple outcomes)

Part 1 - build alle_dx

#=====================================================
# Fetch ALL diagnoses from LPR (Approach 2)
#=====================================================
library(fastreg) # read_register - read a register by name
library(arrow)   # open_dataset - fallback without fastreg
library(dplyr)

cohort_pnrs <- unique(kohort$pnr)

#-----------------------------------------------------
# LPR2 somatic (up to March 2019)
#-----------------------------------------------------
lpr_adm  <- read_register("lpr_adm")  %>% rename_with(tolower)
lpr_diag <- read_register("lpr_diag") %>% rename_with(tolower)

lpr2_dx <- lpr_adm %>%
  semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>%
  select(pnr, recnum, date_contact = d_inddto) %>%
  inner_join(
    lpr_diag %>%
      filter(c_diagtype %in% c("A", "B")) %>%
      select(recnum, c_diag, c_diagtype),
    by = "recnum"
  ) %>%
  collect() %>%
  mutate(icd3 = substr(c_diag, 2, 4))

#-----------------------------------------------------
# LPR3 (March 2019 and onwards)
#-----------------------------------------------------
lpr3_k <- read_register("lpr_a_kontakt")  %>% rename_with(tolower) %>%
  filter(lprindberetningssystem == "LPR3")   # keep only LPR3-system rows (avoid overlap - see warning at top)
lpr3_d <- read_register("lpr_a_diagnose") %>% rename_with(tolower)

lpr3_dx <- lpr3_k %>%
  semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>%
  select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>%
  inner_join(
    lpr3_d %>%
      filter(diag_kode_type %in% c("A", "B"),
             is.na(senere_afkraeftet) | senere_afkraeftet != "Ja") %>%
      select(dw_ek_kontakt, c_diag = diag_kode, c_diagtype = diag_kode_type),
    by = "dw_ek_kontakt"
  ) %>%
  collect() %>%
  mutate(date_contact = as.Date(date_contact), icd3 = substr(c_diag, 2, 4))

#-----------------------------------------------------
# LPR2 psychiatric (up to March 2019) - only if you need F-codes
#-----------------------------------------------------
psyk_adm  <- read_register("t_psyk_adm")  %>%
  rename_with(tolower) %>% rename(pnr = v_cpr, recnum = k_recnum)
psyk_diag <- read_register("t_psyk_diag") %>%
  rename_with(tolower) %>% rename(recnum = v_recnum)

lpr2_psyk_dx <- psyk_adm %>%
  semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>%
  select(pnr, recnum, date_contact = d_inddto) %>%
  inner_join(
    psyk_diag %>%
      filter(c_diagtype %in% c("A", "B")) %>%
      select(recnum, c_diag, c_diagtype),
    by = "recnum"
  ) %>%
  collect() %>%
  mutate(icd3 = substr(c_diag, 2, 4))

#-----------------------------------------------------
# Combine LPR2 (somatic + psych) + LPR3 into one extract
#-----------------------------------------------------
alle_dx <- bind_rows(lpr2_dx, lpr2_psyk_dx, lpr3_dx)
# Columns: pnr | date_contact | c_diag | c_diagtype | icd3

Why is LPR2 psychiatric included? F-codes (dementia, depression, schizophrenia, alcohol/tobacco) recorded before March 2019 live in the separate psych registers t_psyk_adm/t_psyk_diag, which have different column names (v_cpr, k_recnum, v_recnum). They are included in the flow above. Omit them and your F-code outcomes miss everything before March 2019. If you don’t need F-codes at all, drop the lpr2_psyk_dx block and leave it out of bind_rows().

Using duckplyr? union_all() combines tables before collect() and requires identical column names and types. Rename LPR3 columns to match the LPR2 format before combining - see the onboarding document for an example.

Filter your extracted table for specific outcomes

CODES <- c("G30", "F00", "F01", "F02", "F03")   # dementia - change to your outcome

outcome <- alle_dx %>%
  filter(icd3 %in% CODES) %>%
  inner_join(cohort %>% select(pnr, index_date), by = "pnr") %>%   # use cohort_clean after exclusion (Phase 10 Step 2)
  filter(date_contact > index_date) %>%   # post-index; use < for baseline covariate
  group_by(pnr) %>%
  arrange(date_contact) %>%
  slice(1) %>%
  ungroup() %>%
  select(pnr, event_date = date_contact)

# Join to cohort - NA = no event (censored at end of study)
result <- cohort %>%
  select(pnr) %>%
  left_join(outcome, by = "pnr")

saveRDS(result, "sti/til/extract_dementia.rds")   # change filename for each new outcome

Exclusion of prevalent cases - persons who already had the diagnosis before index date - happens in Phase 10, Step 2. Use cohort_clean instead of cohort in the code above after completing that step.

Try it yourself - runnable example with synthetic data (Approach 1)

This example requires RStudio installed locally on your computer - not the DST server. The synthetic dataset (fakeregs) is not available on DST. Download R: cran.r-project.org · Download RStudio: posit.co/download/rstudio-desktop It uses open_dataset() on local synth_data/ folders; fastreg’s read_register() is for a configured DST project, not ad-hoc local folders.

The example extracts CVD diagnoses (ischaemic heart disease, ICD-10 I20–I25) from LPR2 and LPR3 combined - the complete pattern from the theory section above, but runnable locally with synthetic data. It follows Approach 1: specific codes are filtered out before collect().

The synthetic LPR data is generated with the fakeregs package, which you already know from Phase 6 - First extraction. If you have already generated and saved data there, synth_data/lpr_adm/ is ready and you can skip the preparation block.

Adapted from Anders Aasted Isaksen’s dev/common_tasks_datatable.qmd in fakeregs (MIT licence, Steno Diabetes Center Aarhus). Rewritten to dplyr + arrow and adapted to this guide’s pattern.

# Install fakeregs for the first time:
# install.packages("pak"); pak::pak("steno-aarhus/fakeregs")

library(fakeregs)   # synthetic DST register data
library(dplyr)      # filter, select, mutate, inner_join, bind_rows
library(arrow)      # open_dataset, write_parquet

#=====================================================
# Preparation: generate synthetic data (run only once)
#=====================================================
bp             <- generate_background_pop()
lpr_adm_synth  <- generate_lpr_adm(background_df = bp)
lpr_diag_synth <- generate_lpr_diag(background_df = lpr_adm_synth)
lpr_a_k_synth  <- generate_lpr_a_kontakt(background_df = bp)
lpr_a_d_synth  <- generate_lpr_a_diagnose(background_df = lpr_a_k_synth)

dir.create("synth_data/lpr_adm",        recursive = TRUE, showWarnings = FALSE)
dir.create("synth_data/lpr_diag",       recursive = TRUE, showWarnings = FALSE)
dir.create("synth_data/lpr_a_kontakt",  recursive = TRUE, showWarnings = FALSE)
dir.create("synth_data/lpr_a_diagnose", recursive = TRUE, showWarnings = FALSE)
write_parquet(lpr_adm_synth,  "synth_data/lpr_adm/lpr_adm.parquet")
write_parquet(lpr_diag_synth, "synth_data/lpr_diag/lpr_diag.parquet")
write_parquet(lpr_a_k_synth,  "synth_data/lpr_a_kontakt/lpr_a_kontakt.parquet")
write_parquet(lpr_a_d_synth,  "synth_data/lpr_a_diagnose/lpr_a_diagnose.parquet")

The path is relative to your working directory - check with getwd(). If you have already run the preparation block in Phase 6, synth_data/lpr_adm/ is already saved.

#=====================================================
# Extract CVD diagnoses (same pattern as Approach 1)
#=====================================================
# The ICD codes we are looking for - change these to your own outcome
CVD_CODES <- c("I20", "I21", "I22", "I23", "I24", "I25")   # ischaemic heart disease

#-----------------------------------------------------
# LPR2 somatic (up to March 2019)
#-----------------------------------------------------
lpr_adm  <- open_dataset("synth_data/lpr_adm/")  %>% rename_with(tolower)   # LPR2 contact table - synthetic
lpr_diag <- open_dataset("synth_data/lpr_diag/") %>% rename_with(tolower)   # LPR2 diagnosis table - synthetic

lpr2_cvd <- lpr_adm %>%
  select(pnr, recnum, date_contact = d_inddto) %>%           # select only necessary columns
  inner_join(
    lpr_diag %>%
      filter(c_diagtype %in% c("A", "B"),                    # only action and secondary diagnoses
             substr(c_diag, 2, 4) %in% !!CVD_CODES) %>%       # !! sends the local R vector to DuckDB
      select(recnum, c_diag),                    # only join key and diagnosis code
    by = "recnum"                                             # join key in LPR2
  ) %>%
  collect() %>%                                              # HERE data is fetched into R
  mutate(icd3 = substr(c_diag, 2, 4))                        # save cleaned code as new column

#-----------------------------------------------------
# LPR3 (March 2019 and onwards)
#-----------------------------------------------------
lpr3_k <- open_dataset("synth_data/lpr_a_kontakt/")  %>% rename_with(tolower)   # LPR3 contact table - synthetic
lpr3_d <- open_dataset("synth_data/lpr_a_diagnose/") %>% rename_with(tolower)   # LPR3 diagnosis table - synthetic

lpr3_cvd <- lpr3_k %>%
  select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>%   # dw_ek_kontakt is join key to lpr_a_diagnose
  inner_join(
    lpr3_d %>%
      filter(diag_kode_type %in% c("A", "B"),
             is.na(senere_afkraeftet) | senere_afkraeftet != "Ja",  # exclude retracted diagnoses
             substr(diag_kode, 2, 4) %in% !!CVD_CODES) %>%   # !! sends the local R vector to DuckDB
      select(dw_ek_kontakt, c_diag = diag_kode),             # rename to c_diag for consistency with LPR2
    by = "dw_ek_kontakt"                                     # join key in LPR3
  ) %>%
  collect() %>%                                              # fetch into R
  mutate(
    date_contact = as.Date(date_contact),                    # datetime → date
    icd3         = substr(c_diag, 2, 4)                      # strip D-prefix: "DI21" → "I21"
  )

#-----------------------------------------------------
# Combine and save
#-----------------------------------------------------
alle_cvd <- bind_rows(lpr2_cvd, lpr3_cvd)                   # stack LPR2 and LPR3

nrow(alle_cvd)                                               # check: number of diagnosis rows
length(unique(alle_cvd$pnr))                                 # check: number of unique individuals
table(alle_cvd$icd3)                                         # distribution across codes

saveRDS(alle_cvd, "sti/til/extract_cvd.rds")                # save - change path to your own folder

Wrap the pattern in a reusable function (for multiple outcomes)

If you extract diagnoses for several outcomes, it pays off to encapsulate the Approach 2 pattern in one reusable function rather than copying ~40 lines for each new outcome. Define it at the top of your script or in a separate functions.R file. The function keeps the diagnosis-type column (c_diagtype) in its output, so you can later restrict the case definition for a sensitivity analysis (see Diagnosis types) without re-querying LPR.

Advantages: - One place to fix if something changes (e.g. a new register or a new column) - The code block for each outcome is reduced from ~40 lines to one function call - Errors are introduced in one place instead of in each copy

Name or path? The examples use read_register("<register>") (fastreg, by name). It requires fastreg set up with the path to your registers - see Phase 4 if you did not convert them from SAS yourself. Without fastreg, replace read_register("<register>") with open_dataset("E:/workdata/.../<register>/") (by path). On DARTER fastreg is already set up; see DARTER - overview and pipeline for the fully adapted variant - it is kept up to date with the current, confirmed register names (as of June 2026).

See the full get_lpr_diagnoses() function and usage

library(fastreg) # read_register - read a register by name
library(arrow)   # open_dataset - fallback without fastreg
library(dplyr)

#=====================================================
# Function: get_lpr_diagnoses() - reusable LPR extract
#=====================================================
get_lpr_diagnoses <- function(pnr_vector, icd_codes = NULL, diagtypes = c("A", "B")) {
  # Arguments:
  #   pnr_vector : your cohort's pnr, e.g. unique(cohort$pnr)
  #   icd_codes  : optional vector of ICD codes to pull, e.g. c("E11", "I50") or your
  #                own code list. Matches on the D-prefix, so both 3-char
  #                ("E11" -> all E11x) and 4-char ("I700") work.
  #                Default NULL = pull ALL diagnoses (run once, save, filter locally).
  #   diagtypes  : "A"=action, "B"=secondary, "G"=grundmorbus (LPR2 only)
  # read_register (fastreg) reads each register by name - without fastreg: open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/<register>/") %>% rename_with(tolower)

  # keep_codes() applies the icd_codes filter to EACH of the three register queries
  # below, BEFORE collect(), so only the wanted rows are pulled (that's why it appears
  # three times). icd_codes = NULL -> no filter (pull everything).
  keep_codes <- function(x) {
    if (is.null(icd_codes)) return(x)                                # NULL = pull everything
    pattern <- paste0("^D(", paste(icd_codes, collapse = "|"), ")")  # D-prefix + code (3 or 4 chars)
    x %>% filter(grepl(pattern, c_diag))
  }


  # Open registers
  lpr_adm   <- read_register("lpr_adm")   %>% rename_with(tolower)   # LPR2 somatic contacts
  lpr_diag  <- read_register("lpr_diag")  %>% rename_with(tolower)   # LPR2 somatic diagnoses
  psyk_adm  <- read_register("t_psyk_adm")  %>% rename_with(tolower) %>%
    rename(pnr = v_cpr, recnum = k_recnum)                            # LPR2 psychiatric contacts
  psyk_diag <- read_register("t_psyk_diag") %>% rename_with(tolower) %>%
    rename(recnum = v_recnum)                                          # LPR2 psychiatric diagnoses
  lpr3_k    <- read_register("lpr_a_kontakt")  %>% rename_with(tolower) %>%
    filter(lprindberetningssystem == "LPR3")
  lpr3_d    <- read_register("lpr_a_diagnose") %>% rename_with(tolower)  # LPR3 diagnoses

  # LPR2 somatic
  lpr2_dx <- lpr_adm %>%
    semi_join(tibble(pnr = pnr_vector), by = "pnr") %>%
    select(pnr, recnum, date_contact = d_inddto) %>%
    inner_join(
      lpr_diag %>% filter(c_diagtype %in% !!diagtypes) %>% select(recnum, c_diag, c_diagtype),
      by = "recnum"
    ) %>%
    keep_codes() %>% # filter to icd_codes (before collect)
    collect() %>%
    mutate(icd3 = substr(c_diag, 2, 4))                       # strip D-prefix

  # LPR2 psychiatric
  lpr2_psyk_dx <- psyk_adm %>%
    semi_join(tibble(pnr = pnr_vector), by = "pnr") %>%
    select(pnr, recnum, date_contact = d_inddto) %>%
    inner_join(
      psyk_diag %>% filter(c_diagtype %in% !!diagtypes) %>% select(recnum, c_diag, c_diagtype),
      by = "recnum"
    ) %>%
    keep_codes() %>% # filter to icd_codes (before collect)
    collect() %>%
    mutate(icd3 = substr(c_diag, 2, 4))

  # LPR3
  lpr3_dx <- lpr3_k %>%
    semi_join(tibble(pnr = pnr_vector), by = "pnr") %>%
    select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>%
    inner_join(
      lpr3_d %>%
        filter(diag_kode_type %in% !!diagtypes,
               is.na(senere_afkraeftet) | senere_afkraeftet != "Ja") %>%
        select(dw_ek_kontakt, c_diag = diag_kode, c_diagtype = diag_kode_type),
      by = "dw_ek_kontakt"
    ) %>%
    keep_codes() %>% # filter to icd_codes (before collect)
    collect() %>%
    mutate(date_contact = as.Date(date_contact),               # datetime → date
           icd3 = substr(c_diag, 2, 4))

  bind_rows(lpr2_dx, lpr2_psyk_dx, lpr3_dx) %>%              # combine the three sources
    select(pnr, date_contact, c_diag, c_diagtype, icd3)      # identical columns, drop join keys (recnum/dw_ek_kontakt)
}

Use the function - one call per extraction, only change CODES:

cohort    <- readRDS("sti/til/full_cohort.rds")
pnr_list  <- unique(cohort$pnr)

# Fetch all diagnoses for the cohort (Phase 1 - see hospital contacts page)
alle_dx <- get_lpr_diagnoses(
  pnr_vector    = pnr_list,
  diagtypes     = c("A", "B")    # A=action, B=secondary. To also include grundmorbus, add "G" (LPR2 only): c("A", "B", "G")
)
# Returns: pnr | date_contact | c_diag | c_diagtype | icd3

# Extract one outcome from alle_dx - repeat per outcome (only if you pulled ALL, i.e. no icd_codes)
CODES <- c("F00", "F01", "F02", "F03", "G30", "G31")   # dementia

dementia <- alle_dx %>%
  filter(icd3 %in% CODES) %>%
  inner_join(cohort %>% select(pnr, index_date), by = "pnr") %>%
  filter(date_contact > index_date) %>%
  group_by(pnr) %>% arrange(date_contact) %>% slice(1) %>% ungroup() %>%
  select(pnr, dementia_date = date_contact)

result <- cohort %>% select(pnr) %>% left_join(dementia, by = "pnr")
saveRDS(result, "sti/til/extract_dementia.rds")

Two ways to call the function (RAM). For a few codes, pass them via icd_codes - they are filtered BEFORE collect(), so only those rows are pulled: get_lpr_diagnoses(pnr_list, icd_codes = c("E11", "I50")). For many different outcomes, pull everything once (omit icd_codes), save with saveRDS(), free memory with rm() + gc() (Phase 5), then filter locally with readRDS() %>% filter(icd3 %in% ...) - so you don’t re-scan the registers for each outcome.

From event log to one row per person

get_lpr_diagnoses() (and the extracts above) return an event log: many rows per person, one per diagnosis. That is not analysis-ready. Two patterns turn it into one row per person:

A) One outcome with a date (e.g. time-to-event): filter to the codes, and take each person’s first contact - the slice(1) pattern above.

B) Many binary variables at once (e.g. comorbidity flags): build the flags, collapse with any(), and merge onto the cohort. any() returns TRUE if the person has at least one matching diagnosis in the window (0/1, not a count):

library(dplyr)
library(stringr)
library(tidyr) # replace_na

# 1. Build your flags (one expression per variable)
diagnosis_flags <- alle_dx %>%
  filter(date_contact <= index_date) %>% # before index only (baseline covariate)
  mutate(
    dx_diabetes = icd3 %in% c("E10", "E11"),
    dx_copd = str_detect(c_diag, "^DJ4[1-4]"),
    dx_heart_failure = str_detect(c_diag, "^DI50")
  )

# 2. Collapse to one row per person: any() = does the person have at least one?
diagnosis_flags_collapsed <- diagnosis_flags %>%
  group_by(pnr) %>%
  summarise(across(starts_with("dx_"), ~ as.integer(any(.))), .groups = "drop")

# 3. Merge onto the cohort; people with no diagnosis get NA -> set to 0
cohort_diagnoses <- cohort %>%
  left_join(diagnosis_flags_collapsed, by = "pnr") %>%
  mutate(across(starts_with("dx_"), ~ replace_na(., 0)))

It gives one 0/1 flag per comorbidity per person, ready to enter your model as covariates. (The same group_by(pnr) %>% summarise(any(.)) pattern is used on the NMI page, for example.)

Classify patient type: inpatient, outpatient or emergency

Many studies need to tell inpatient contacts apart from outpatient visits and emergency-room visits (e.g. only admissions as an outcome, or acute contacts as a marker). There is no single shared field across LPR2 and LPR3 - you derive the type.

Column names and codes must be verified against your own extract. The logic itself is durable (the codes ALCA00 and ATA1 are confirmed LPR3 values as of 2025), but the exact column names in LPR_A may differ from the example. Check with arrow::schema()/colnames() and look them up in Overview of registers. The pattern is adapted from the Plana-Ripoll group’s code on OSF.

LPR2 (up to March 2019). Patient type lives in c_pattype, but the emergency-room coding changed around 2014:

lpr2_type <- lpr_adm %>% # lpr_adm from the extract above
  mutate(
    patienttype = case_when(
      c_pattype %in% c("0", "1", "4", "5") ~ "inpatient", # full-day/part-day/day/night
      c_pattype == "3" ~ "emergency", # explicit ER (mostly pre-2014)
      c_pattype == "2" & c_indm == "1" ~ "emergency", # from ~2014: acute admission mode = ER
      c_pattype == "2" ~ "outpatient",
      TRUE ~ NA_character_
    )
  )

LPR3 (March 2019 and onwards). There is no direct c_pattype. The kont_type code ALCA00 means physical attendance (not admission), and the prioritet code ATA1 means acute. A common research approach derives patient type from the contact’s duration plus an ER/acute marker:

lpr3_type <- lpr3_k %>% # lpr3_k from the extract above
  filter(kont_type == "ALCA00") %>% # keep physical attendances; drop phone/video
  mutate(
    duration_hours = as.numeric(
      difftime(kont_sluttidspunkt, kont_starttidspunkt, units = "hours") # verify the end-time column name
    ),
    patienttype = case_when(
      duration_hours >= 8 ~ "inpatient", # >= 8 hours ~ admission
      enhedstype_ans == "skadestue" & prioritet == "ATA1" ~ "emergency", # acute ER
      TRUE ~ "outpatient"
    )
  )

The 8-hour cut-off is a heuristic, not an official definition - LPR3 has no “inpatient” field. Pick and document your own threshold, and clarify with your data manager. The column names for start/end time and unit type vary between deliveries; verify them before use.

Remove unwanted diagnoses

A few codes are administrative artefacts rather than the patient’s own disease, and should typically be removed from both outcomes and comorbidity:

“Healthy companion” (someone admitted as a companion to another patient, e.g. a parent): ICD-10 DZ763. Related contact/observation codes with no disease: DZ032, DZ038, DZ039.
“Diagnosis not found”/unspecified from the ICD-8 era: Y719.

alle_dx <- alle_dx %>%
  filter(
    !substr(toupper(c_diag), 1, 5) %in% c("DZ763", "DZ032", "DZ038", "DZ039"),
    substr(toupper(c_diag), 1, 4) != "Y719"
  )

What to remove depends on project and question - DZ03* (observation for suspected disease) is for instance relevant in some studies and should be kept there. The pattern is adapted from the Plana-Ripoll group’s code on OSF; verify against your own data.

Next steps

You have now extracted diagnoses from two LPR generations. Next steps are to shape and combine your extracts:

→ Phase 12 - Assemble and prepare the dataset

Fetch diagnoses from LPR - choose your approach

Wrap the pattern in a reusable function (for multiple outcomes)

From event log to one row per person

Classify patient type: inpatient, outpatient or emergency

Remove unwanted diagnoses

Next steps

See also