Extract from LPR
Practical recipes - the two approaches, helper functions and integration with the cohort
This page shows how to extract diagnoses from LPR with code. It builds on the structure from Understand LPR: the periods (LPR2/LPR3), the D-prefix, the diagnosis types (A/B/G) and the filter for retracted diagnoses. Read that page first if you haven’t already.
You use the same extraction pattern as in Phase 5 and Phase 6 - just applied to LPR’s two generations. It is the most important and probably the most complex part of the guide.
Circular dependency with Phase 10 - deliberate. To extract diagnoses (this page) we assume you already have a cohort; but you build the cohort in Phase 10 using exactly the extraction pattern you learn here. Read the phases in order, and come back to the code when your cohort is ready. The code uses inner_join() and bind_rows(); if these are new, they are explained in detail in Phase 12.
LPR3 files and column names - check before you load.
- Use the LPR_A files consistently. DST has switched LPR3 from the old LPR_F format (
kontakter,diagnoser,forloeb) to the new LPR_A format (lpr_a_kontakt,lpr_a_diagnose). Both versions may sit in your folder and cover the same years - loading both (or mixing old and new) gives you duplicated rows. Uselpr_a_*as in the examples here, and leave the LPR_F files alone. Some projects also have older data insidelpr_a_kontaktthat is already in LPR2; filter it out withlprindberetningssystem == "LPR3". This is project-specific - on DARTER it is handled in pitfall 5; otherwise check with your data manager. - Verify column names. The switch to LPR_A changed a number of variable names (often opaque Danish abbreviations). The examples use the names the registers typically have, but always check before relying on a column name. On a lazy
open_dataset()object usearrow::schema(your_data)- it lists the column names and their types without reading data into RAM;colnames(your_data)also works (and is what you use for the DuckDB-backed objects fromread_register()/read_register()). Look names up in Overview of registers. - Mind the 2019-2020 diagnosis spike. The LPR3 contact model registers far more outpatient diagnoses than LPR2 did, producing a spike in diagnosis counts around 2019-2020 that matters for any count- or trend-based analysis. See the diagnosis spike.
Fetch diagnoses from LPR - choose your approach
Choose one of two approaches depending on your study:
| Approach 1 - direct extraction | Approach 2 - alle_dx |
|
|---|---|---|
| Best when | You have fewer outcomes | You have multiple outcomes from LPR |
| Workflow | Fetch specific codes → Exclude → done | Fetch all → Exclude → filter per outcome |
| Advantage | Simpler and faster for single-outcome studies | LPR queried only once; reused for all outcomes |
Approach 1 is best for a smaller number of outcomes. Filter on specific ICD codes directly in the filter() step before collect(). DuckDB/Arrow pushes the filter down to the storage layer - only matching rows are loaded into RAM.
Approach 2 is best when your study has multiple outcomes. You query LPR once and build alle_dx: a shared table with all A and B diagnoses. For each new outcome, filter alle_dx on the relevant codes - the only line you change is the code list.
The examples require parquet files and a completed study population. kohort is the data.frame with pnr and index_date per person - see Phase 10.
Adapt the path in open_dataset(...) to your project’s data folder (the examples’ E:/workdata/[projectnumber]/... is project-specific). Note: LPR is one of the largest registers, so the data must be in parquet. That lets Arrow/DuckDB filter before loading and fetch only the rows you ask for; reading LPR directly from e.g. SAS pulls the whole register into RAM. Convert SAS to parquet first: Phase 4 - SAS to Parquet.
Why semi_join(tibble(pnr = ...)) and not filter(pnr %in% ...)? Both keep only the cohort’s rows, but a semi_join against a small tibble of pnr’s pushes down into Arrow/DuckDB more efficiently and reliably. A large %in% filter with an R vector can get slow or be rejected outright (especially with an older duckplyr). It also frees you from !!: semi_join takes an ordinary local table as its argument. You still need !! when you pass a local R vector into a filter(), e.g. a code list (substr(c_diag, 2, 4) %in% !!CODES).
Approach 1 - fetch specific diagnoses directly (start here for one outcome)
Filter on specific codes before collect(). The example fetches diabetes mellitus (E10–E14) - replace CODES_REGEX with your own codes.
#=====================================================
# Extract diabetes diagnoses from LPR (Approach 1)
#=====================================================
library(arrow)
library(dplyr)
cohort_pnrs <- unique(kohort$pnr)
CODES_REGEX <- "^DE1[0-4]" # diabetes mellitus (E10–E14) - with D-prefix
#-----------------------------------------------------
# LPR2 (somatic): lpr_adm + lpr_diag
#-----------------------------------------------------
lpr_adm <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_adm/") %>% rename_with(tolower)
lpr_diag <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_diag/") %>% rename_with(tolower)
# Here we combine LPR2's two tables: the contact register lpr_adm (pnr + dates) and the diagnosis
# register lpr_diag (the ICD code), filter on your codes/diagnosis types, and select the columns you need.
# You can pull as many codes at once as you like (via CODES_REGEX), and you choose the object name YOURSELF
# (here 'lpr2_dm' = LPR2 + diabetes mellitus, because the example is diabetes).
lpr2_dm <- lpr_adm %>%
semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>% # ONLY your cohort. OMIT this line if you want the WHOLE population
select(pnr, recnum, date_contact = d_inddto) %>% # pick columns from LPR2; d_inddto is RENAMED to date_contact (see note below)
inner_join(
lpr_diag %>% # diagnosis register: has recnum + ICD code, but neither pnr nor date
filter(c_diagtype %in% c("A", "B"), # A + B = action/secondary diagnoses; add "G" (-> c("A","B","G")) if you also want underlying conditions
grepl(CODES_REGEX, c_diag)) %>% # keep only your codes - filter BEFORE collect (D-prefix in the regex)
select(recnum, c_diag, c_diagtype), # the diagnosis columns we need
by = "recnum" # recnum = the key linking contact and diagnosis in LPR2
) %>%
collect() %>% # ONLY here is data pulled into RAM (everything above runs in the database)
mutate(icd3 = substr(c_diag, 2, 4)) # strip the D-prefix: "DE11" -> "E11"
#-----------------------------------------------------
# LPR3: lpr_a_kontakt + lpr_a_diagnose
#-----------------------------------------------------
lpr3_k <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_a_kontakt/") %>% rename_with(tolower)
lpr3_d <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_a_diagnose/") %>% rename_with(tolower)
lpr3_dm <- lpr3_k %>% # same pattern as LPR2 - but LPR3 tables and different column names
semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>% # ONLY your cohort. OMIT this line for the WHOLE population
select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>% # LPR3's date column renamed to the SAME name: date_contact
inner_join(
lpr3_d %>%
filter(diag_kode_type %in% c("A", "B"), # as in LPR2: add "G" if you also want underlying conditions
is.na(senere_afkraeftet) | senere_afkraeftet != "Ja", # drop withdrawn diagnoses (LPR3 only)
grepl(CODES_REGEX, diag_kode)) %>% # same codes; filter BEFORE collect
select(dw_ek_kontakt, c_diag = diag_kode, c_diagtype = diag_kode_type), # rename LPR3 names -> same names as LPR2
by = "dw_ek_kontakt" # LPR3's contact key (NOT recnum as in LPR2)
) %>%
collect() %>%
mutate(date_contact = as.Date(date_contact), # the LPR3 date is a datetime -> make it a plain date
icd3 = substr(c_diag, 2, 4)) # strip the D-prefix
#-----------------------------------------------------
# Combine LPR2 + LPR3 into one extract
#-----------------------------------------------------
# Works because we gave the two extracts the SAME column names above (date_contact, c_diag, ...):
dm_dx <- bind_rows(lpr2_dm, lpr3_dm) # one combined extract (LPR2 + LPR3)
# Columns: pnr | date_contact | c_diag | c_diagtype | icd3What is date_contact = d_inddto? (why is date_contact coloured?) Inside select(), new_name = old_name means you rename a column. d_inddto is LPR2’s actual date column, and date_contact is a name you choose - the editor colours it like an argument, but it’s just the new column name (not a function argument). We rename because LPR3 has a different date column (kont_starttidspunkt); giving both the name date_contact makes the two extracts share column names, so bind_rows() can stack them.
Must vs. choice: the register names (pnr, recnum, d_inddto, c_diag, c_diagtype, dw_ek_kontakt, kont_starttidspunkt, diag_kode …) must be spelled exactly as in the register. The new names (date_contact, icd3) and the object name (lpr2_dm) are yours to choose - just keep the harmonized names identical across LPR2 and LPR3.
Many specific codes? Build the regex programmatically:
codes <- c("E10", "E11", "E12", "E13", "E14")
CODES_REGEX <- paste0("^D(", paste(codes, collapse = "|"), ")")Do you have F-codes (e.g. dementia, depression)? Extend the regex to include them, e.g. "^DE1[0-4]|^DF0[0-3]|^DG30", and add psychiatric LPR2 - see Approach 2 below for the code.
Alternative: compact extraction (single-table approach)
A colleague may have shown you this shorter approach:
lpr <- left_join(lpr_adm, lpr_diag, by = "RECNUM") %>%
filter(C_DIAGTYPE == "A",
grepl("^S72", C_DIAG)) %>%
group_by(PNR) %>%
filter(D_INDDTO == min(D_INDDTO)) %>%
slice(1) %>%
ungroup()It is shorter but has three pitfalls on DST data:
- D-prefix error:
"^S72"does NOT match"DS72..."in DST data - returns zero rows with no error message. Use"^DS72"(with D) or strip the prefix first. left_joininstead ofinner_join: Keeps all admissions fromlpr_adm- including those with no matching diagnosis. Unnecessarily heavy on national registers.- No pnr filter: Loads the entire population’s data. Correct when building a cohort (Phase 10), not when extracting from an existing one.
Approach 2 - fetch all diagnoses + filter outcome (for multiple outcomes)
Part 1 - build alle_dx
#=====================================================
# Fetch ALL diagnoses from LPR (Approach 2)
#=====================================================
library(arrow)
library(dplyr)
cohort_pnrs <- unique(kohort$pnr)
#-----------------------------------------------------
# LPR2 somatic (up to March 2019)
#-----------------------------------------------------
lpr_adm <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_adm/") %>% rename_with(tolower)
lpr_diag <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_diag/") %>% rename_with(tolower)
lpr2_dx <- lpr_adm %>%
semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>%
select(pnr, recnum, date_contact = d_inddto) %>%
inner_join(
lpr_diag %>%
filter(c_diagtype %in% c("A", "B")) %>%
select(recnum, c_diag, c_diagtype),
by = "recnum"
) %>%
collect() %>%
mutate(icd3 = substr(c_diag, 2, 4))
#-----------------------------------------------------
# LPR3 (March 2019 and onwards)
#-----------------------------------------------------
lpr3_k <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_a_kontakt/") %>% rename_with(tolower)
lpr3_d <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_a_diagnose/") %>% rename_with(tolower)
lpr3_dx <- lpr3_k %>%
semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>%
select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>%
inner_join(
lpr3_d %>%
filter(diag_kode_type %in% c("A", "B"),
is.na(senere_afkraeftet) | senere_afkraeftet != "Ja") %>%
select(dw_ek_kontakt, c_diag = diag_kode, c_diagtype = diag_kode_type),
by = "dw_ek_kontakt"
) %>%
collect() %>%
mutate(date_contact = as.Date(date_contact), icd3 = substr(c_diag, 2, 4))
#-----------------------------------------------------
# Combine LPR2 + LPR3 into one extract
#-----------------------------------------------------
alle_dx <- bind_rows(lpr2_dx, lpr3_dx)
# Columns: pnr | date_contact | c_diag | c_diagtype | icd3Do you have F-codes (e.g. dementia, depression)? Psychiatric diagnoses recorded before March 2019 are in separate registers. Add them before bind_rows():
psyk_adm <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/t_psyk_adm/") %>%
rename_with(tolower) %>% rename(pnr = v_cpr, recnum = k_recnum)
psyk_diag <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/t_psyk_diag/") %>%
rename_with(tolower) %>% rename(recnum = v_recnum)
lpr2_psyk_dx <- psyk_adm %>%
semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>%
select(pnr, recnum, date_contact = d_inddto) %>%
inner_join(psyk_diag %>% filter(c_diagtype %in% c("A", "B")) %>%
select(recnum, c_diag, c_diagtype), by = "recnum") %>%
collect() %>% mutate(icd3 = substr(c_diag, 2, 4))
alle_dx <- bind_rows(lpr2_dx, lpr2_psyk_dx, lpr3_dx)Using duckplyr? union_all() combines tables before collect() and requires identical column names and types. Rename LPR3 columns to match the LPR2 format before combining - see the onboarding document for an example.
Filter your extracted table for specific outcomes
CODES <- c("G30", "F00", "F01", "F02", "F03") # dementia - change to your outcome
outcome <- alle_dx %>%
filter(icd3 %in% CODES) %>%
inner_join(cohort %>% select(pnr, index_date), by = "pnr") %>% # use cohort_clean after exclusion (Phase 10 Step 2)
filter(date_contact > index_date) %>% # post-index; use < for baseline covariate
group_by(pnr) %>%
arrange(date_contact) %>%
slice(1) %>%
ungroup() %>%
select(pnr, event_date = date_contact)
# Join to cohort - NA = no event (censored at end of study)
result <- cohort %>%
select(pnr) %>%
left_join(outcome, by = "pnr")
saveRDS(result, "sti/til/extract_dementia.rds") # change filename for each new outcomeExclusion of prevalent cases - persons who already had the diagnosis before index date - happens in Phase 10, Step 2. Use cohort_clean instead of cohort in the code above after completing that step.
Try it yourself - runnable example with synthetic data (Approach 1)
This example requires RStudio installed locally on your computer - not the DST server. The synthetic dataset (fakeregs) is not available on DST. Download R: cran.r-project.org · Download RStudio: posit.co/download/rstudio-desktop It uses open_dataset() on local synth_data/ folders; fastreg’s read_register() is for a configured DST project, not ad-hoc local folders.
The example extracts CVD diagnoses (ischaemic heart disease, ICD-10 I20–I25) from LPR2 and LPR3 combined - the complete pattern from the theory section above, but runnable locally with synthetic data. It follows Approach 1: specific codes are filtered out before collect().
The synthetic LPR data is generated with the fakeregs package, which you already know from Phase 6 - First extraction. If you have already generated and saved data there, synth_data/lpr_adm/ is ready and you can skip the preparation block.
Adapted from Anders Aasted Isaksen’s dev/common_tasks_datatable.qmd in fakeregs (MIT licence, Steno Diabetes Center Aarhus). Rewritten to dplyr + arrow and adapted to this guide’s pattern.
# Install fakeregs for the first time:
# install.packages("pak"); pak::pak("steno-aarhus/fakeregs")
library(fakeregs) # synthetic DST register data
library(dplyr) # filter, select, mutate, inner_join, bind_rows
library(arrow) # open_dataset, write_parquet
#=====================================================
# Preparation: generate synthetic data (run only once)
#=====================================================
bp <- generate_background_pop()
lpr_adm_synth <- generate_lpr_adm(background_df = bp)
lpr_diag_synth <- generate_lpr_diag(background_df = lpr_adm_synth)
lpr_a_k_synth <- generate_lpr_a_kontakt(background_df = bp)
lpr_a_d_synth <- generate_lpr_a_diagnose(background_df = lpr_a_k_synth)
dir.create("synth_data/lpr_adm", recursive = TRUE, showWarnings = FALSE)
dir.create("synth_data/lpr_diag", recursive = TRUE, showWarnings = FALSE)
dir.create("synth_data/lpr_a_kontakt", recursive = TRUE, showWarnings = FALSE)
dir.create("synth_data/lpr_a_diagnose", recursive = TRUE, showWarnings = FALSE)
write_parquet(lpr_adm_synth, "synth_data/lpr_adm/lpr_adm.parquet")
write_parquet(lpr_diag_synth, "synth_data/lpr_diag/lpr_diag.parquet")
write_parquet(lpr_a_k_synth, "synth_data/lpr_a_kontakt/lpr_a_kontakt.parquet")
write_parquet(lpr_a_d_synth, "synth_data/lpr_a_diagnose/lpr_a_diagnose.parquet")The path is relative to your working directory - check with getwd(). If you have already run the preparation block in Phase 6, synth_data/lpr_adm/ is already saved.
#=====================================================
# Extract CVD diagnoses (same pattern as Approach 1)
#=====================================================
# The ICD codes we are looking for - change these to your own outcome
CVD_CODES <- c("I20", "I21", "I22", "I23", "I24", "I25") # ischaemic heart disease
#-----------------------------------------------------
# LPR2 somatic (up to March 2019)
#-----------------------------------------------------
lpr_adm <- open_dataset("synth_data/lpr_adm/") %>% rename_with(tolower) # LPR2 contact table - synthetic
lpr_diag <- open_dataset("synth_data/lpr_diag/") %>% rename_with(tolower) # LPR2 diagnosis table - synthetic
lpr2_cvd <- lpr_adm %>%
select(pnr, recnum, date_contact = d_inddto) %>% # select only necessary columns
inner_join(
lpr_diag %>%
filter(c_diagtype %in% c("A", "B"), # only action and secondary diagnoses
substr(c_diag, 2, 4) %in% !!CVD_CODES) %>% # !! sends the local R vector to DuckDB
select(recnum, c_diag), # only join key and diagnosis code
by = "recnum" # join key in LPR2
) %>%
collect() %>% # HERE data is fetched into R
mutate(icd3 = substr(c_diag, 2, 4)) # save cleaned code as new column
#-----------------------------------------------------
# LPR3 (March 2019 and onwards)
#-----------------------------------------------------
lpr3_k <- open_dataset("synth_data/lpr_a_kontakt/") %>% rename_with(tolower) # LPR3 contact table - synthetic
lpr3_d <- open_dataset("synth_data/lpr_a_diagnose/") %>% rename_with(tolower) # LPR3 diagnosis table - synthetic
lpr3_cvd <- lpr3_k %>%
select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>% # dw_ek_kontakt is join key to lpr_a_diagnose
inner_join(
lpr3_d %>%
filter(diag_kode_type %in% c("A", "B"),
is.na(senere_afkraeftet) | senere_afkraeftet != "Ja", # exclude retracted diagnoses
substr(diag_kode, 2, 4) %in% !!CVD_CODES) %>% # !! sends the local R vector to DuckDB
select(dw_ek_kontakt, c_diag = diag_kode), # rename to c_diag for consistency with LPR2
by = "dw_ek_kontakt" # join key in LPR3
) %>%
collect() %>% # fetch into R
mutate(
date_contact = as.Date(date_contact), # datetime → date
icd3 = substr(c_diag, 2, 4) # strip D-prefix: "DI21" → "I21"
)
#-----------------------------------------------------
# Combine and save
#-----------------------------------------------------
alle_cvd <- bind_rows(lpr2_cvd, lpr3_cvd) # stack LPR2 and LPR3
nrow(alle_cvd) # check: number of diagnosis rows
length(unique(alle_cvd$pnr)) # check: number of unique individuals
table(alle_cvd$icd3) # distribution across codes
saveRDS(alle_cvd, "sti/til/extract_cvd.rds") # save - change path to your own folderWrap the pattern in a reusable function (for multiple outcomes)
If you extract diagnoses for several outcomes, it pays off to encapsulate the Approach 2 pattern in one reusable function rather than copying ~40 lines for each new outcome. Define it at the top of your script or in a separate functions.R file. The function keeps the diagnosis-type column (c_diagtype) in its output, so you can later restrict the case definition for a sensitivity analysis (see Diagnosis types) without re-querying LPR.
Advantages: - One place to fix if something changes (e.g. a new register or a new column) - The code block for each outcome is reduced from ~40 lines to one function call - Errors are introduced in one place instead of in each copy
Using fastreg, or working on DARTER? With fastreg, replace open_dataset("E:/workdata/.../<register>/") with read_register("<register>") (by name). On DARTER you use the same read_register("<register>") (fastreg). See DARTER - overview and pipeline for the fully adapted variant - it is kept up to date with the current, confirmed register names (as of June 2026).
See the full get_lpr_diagnoses() function and usage
library(arrow)
library(dplyr)
#=====================================================
# Function: get_lpr_diagnoses() - reusable LPR extract
#=====================================================
get_lpr_diagnoses <- function(pnr_vector, diagtypes = c("A", "B"), inpatient_only = FALSE) {
base <- "E:/workdata/[projectnumber]/cleaned-data/parquet-registers/"
# Open registers
lpr_adm <- open_dataset(paste0(base, "lpr_adm/")) %>% rename_with(tolower) # LPR2 somatic contacts
lpr_diag <- open_dataset(paste0(base, "lpr_diag/")) %>% rename_with(tolower) # LPR2 somatic diagnoses
psyk_adm <- open_dataset(paste0(base, "t_psyk_adm/")) %>% rename_with(tolower) %>%
rename(pnr = v_cpr, recnum = k_recnum) # LPR2 psychiatric contacts
psyk_diag <- open_dataset(paste0(base, "t_psyk_diag/")) %>% rename_with(tolower) %>%
rename(recnum = v_recnum) # LPR2 psychiatric diagnoses
lpr3_k <- open_dataset(paste0(base, "lpr_a_kontakt/")) %>% rename_with(tolower) %>%
filter(lprindberetningssystem == "LPR3") # CRITICAL (DARTER): keep only rows from the LPR3 system - avoid overlapping rows
lpr3_d <- open_dataset(paste0(base, "lpr_a_diagnose/")) %>% rename_with(tolower) # LPR3 diagnoses
# Filter on admission type if desired
if (inpatient_only) {
lpr_adm <- lpr_adm %>% filter(c_pattype == "0") # "0" = inpatient (full-day) in LPR2
lpr3_k <- lpr3_k %>% filter(kont_type == "ALCA00") # ALCA00 = physical attendance (NOT = inpatient) - see "Classify patient type" below
}
# LPR2 somatic
lpr2_dx <- lpr_adm %>%
semi_join(tibble(pnr = pnr_vector), by = "pnr") %>%
select(pnr, recnum, date_contact = d_inddto) %>%
inner_join(
lpr_diag %>% filter(c_diagtype %in% !!diagtypes) %>% select(recnum, c_diag, c_diagtype),
by = "recnum"
) %>%
collect() %>%
mutate(icd3 = substr(c_diag, 2, 4)) # strip D-prefix
# LPR2 psychiatric
lpr2_psyk_dx <- psyk_adm %>%
semi_join(tibble(pnr = pnr_vector), by = "pnr") %>%
select(pnr, recnum, date_contact = d_inddto) %>%
inner_join(
psyk_diag %>% filter(c_diagtype %in% !!diagtypes) %>% select(recnum, c_diag, c_diagtype),
by = "recnum"
) %>%
collect() %>%
mutate(icd3 = substr(c_diag, 2, 4))
# LPR3
lpr3_dx <- lpr3_k %>%
semi_join(tibble(pnr = pnr_vector), by = "pnr") %>%
select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>%
inner_join(
lpr3_d %>%
filter(diag_kode_type %in% !!diagtypes,
is.na(senere_afkraeftet) | senere_afkraeftet != "Ja") %>%
select(dw_ek_kontakt, c_diag = diag_kode, c_diagtype = diag_kode_type),
by = "dw_ek_kontakt"
) %>%
collect() %>%
mutate(date_contact = as.Date(date_contact), # datetime → date
icd3 = substr(c_diag, 2, 4))
bind_rows(lpr2_dx, lpr2_psyk_dx, lpr3_dx) # return combined table
}Use the function - one call per extraction, only change CODES:
cohort <- readRDS("sti/til/full_cohort.rds")
pnr_list <- unique(cohort$pnr)
# Fetch all diagnoses for the cohort (Phase 1 - see hospital contacts page)
alle_dx <- get_lpr_diagnoses(
pnr_vector = pnr_list,
diagtypes = c("A", "B"),
inpatient_only = FALSE
)
# Returns: pnr | date_contact | c_diag | c_diagtype | icd3
# Extract one outcome - only change CODES (Phase 2)
CODES <- c("F00", "F01", "F02", "F03", "G30", "G31") # dementia
dementia <- alle_dx %>%
filter(icd3 %in% CODES) %>%
inner_join(cohort %>% select(pnr, index_date), by = "pnr") %>%
filter(date_contact > index_date) %>%
group_by(pnr) %>% arrange(date_contact) %>% slice(1) %>% ungroup() %>%
select(pnr, dementia_date = date_contact)
result <- cohort %>% select(pnr) %>% left_join(dementia, by = "pnr")
saveRDS(result, "sti/til/extract_dementia.rds")Classify patient type: inpatient, outpatient or emergency
Many studies need to tell inpatient contacts apart from outpatient visits and emergency-room visits (e.g. only admissions as an outcome, or acute contacts as a marker). There is no single shared field across LPR2 and LPR3 - you derive the type.
Column names and codes must be verified against your own extract. The logic itself is durable (the codes ALCA00 and ATA1 are confirmed LPR3 values as of 2025), but the exact column names in LPR_A may differ from the example. Check with arrow::schema()/colnames() and look them up in Overview of registers. The pattern is adapted from the Plana-Ripoll group’s code on OSF.
LPR2 (up to March 2019). Patient type lives in c_pattype, but the emergency-room coding changed around 2014:
lpr2_type <- lpr_adm %>% # lpr_adm from the extract above
mutate(patienttype = case_when(
c_pattype %in% c("0", "1", "4", "5") ~ "inpatient", # full-day/part-day/day/night
c_pattype == "3" ~ "emergency", # explicit ER (mostly pre-2014)
c_pattype == "2" & c_indm == "1" ~ "emergency", # from ~2014: acute admission mode = ER
c_pattype == "2" ~ "outpatient",
TRUE ~ NA_character_
))LPR3 (March 2019 and onwards). There is no direct c_pattype. The kont_type code ALCA00 means physical attendance (not admission), and the prioritet code ATA1 means acute. A common research approach derives patient type from the contact’s duration plus an ER/acute marker:
lpr3_type <- lpr3_k %>% # lpr3_k from the extract above
filter(kont_type == "ALCA00") %>% # keep physical attendances; drop phone/video
mutate(
duration_hours = as.numeric(
difftime(kont_sluttidspunkt, kont_starttidspunkt, units = "hours") # verify the end-time column name
),
patienttype = case_when(
duration_hours >= 8 ~ "inpatient", # >= 8 hours ~ admission
enhedstype_ans == "skadestue" & prioritet == "ATA1" ~ "emergency", # acute ER
TRUE ~ "outpatient"
)
)The 8-hour cut-off is a heuristic, not an official definition - LPR3 has no “inpatient” field. Pick and document your own threshold, and clarify with your data manager. The column names for start/end time and unit type vary between deliveries; verify them before use.
Remove unwanted diagnoses
A few codes are administrative artefacts rather than the patient’s own disease, and should typically be removed from both outcomes and comorbidity:
- “Healthy companion” (someone admitted as a companion to another patient, e.g. a parent): ICD-10
DZ763. Related contact/observation codes with no disease:DZ032,DZ038,DZ039. - “Diagnosis not found”/unspecified from the ICD-8 era:
Y719.
alle_dx <- alle_dx %>%
filter(
!substr(toupper(c_diag), 1, 5) %in% c("DZ763", "DZ032", "DZ038", "DZ039"),
substr(toupper(c_diag), 1, 4) != "Y719"
)What to remove depends on project and question - DZ03* (observation for suspected disease) is for instance relevant in some studies and should be kept there. The pattern is adapted from the Plana-Ripoll group’s code on OSF; verify against your own data.
Next steps
You have now extracted diagnoses from two LPR generations. Next steps are to shape and combine your extracts:
See also
- Understand LPR: structure, periods, D-prefix and diagnosis types
- Phase 6 - First extraction: step-by-step introduction to open_dataset, collect and saveRDS
- Overview of registers: confirmed column names for all LPR registers
- DST pitfalls: known issues with LPR on DST