Build your study population

Identify your cases/exposed - then choose your study design

Published

July 2, 2026

The later phases - extract variables (Phase 11), assemble the dataset (Phase 12) and analysis (Phase 13) - assume you already have a cohort: a table with pnr and index_date per person. This page shows how to identify your study population and give each person an index date; then you choose a study design.

Tip

In short: First you identify your cases/exposed and give each an index date. After that, the path depends on your design:

What is index date?

Index date is the point in time that marks the start of follow-up for a given person.

  • Exposed: the date the person received the exposure (e.g. surgery date, diagnosis date, first prescription dispensed)
  • Comparator cohort: the index date assigned from the matched exposed person

Everything that follows - outcome date, covariates at baseline, follow-up time - is calculated relative to index date. The definition of index date is crucial for study validity.

Warning

This page is still under development. The code needs further review and testing before being used directly. Use it as structural guidance and adapt to your own project.

Note

Functions used here. inner_join() and bind_rows() are already shown in Extract from LPR and explained in detail in Link your extracts. group_by() + slice() is explained in Long ↔︎ wide format. New in Step 2: anti_join(), distinct(), pull() plus nrow(), cat() and stopifnot() - see Function guide.


Step 1 - Identify the exposed (or cases)

You scan the register that defines your exposure. You do not yet have a cohort_pnrs list - you query the entire register and filter on the exposure criterion.

The exposure can be defined in many ways:

Type Example Register
Surgery / procedure (SKS code) Bariatric surgery KJDF10/KJDF11 lpr_sksopra (LPR2), procedurer_kirurgia (LPR3)
Hospital diagnosis (ICD code) Type 2 diabetes E11 lpr_diag + lpr_adm, lpr_a_diagnose + lpr_a_kontakt
Medication exposure (ATC code) Metformin A10BA02 LMDB
Clinical measurement / biomarker BMI > 35, HbA1c > 75 mmol/mol Project-specific data / OSDC / DBSO

a lpr_sksopr and procedurer_kirurgi are the names on the DARTER project (708421) - see Overview of registers. Names may vary on other projects.

Example A: SKS codes (surgery/procedure)

lpr_sksopr holds the procedure code (c_opr) + recnum, but not pnr or date; those live in lpr_adm. So you join the two on recnum to get person + date + procedure together - exactly the same pattern as diagnoses (contact + diagnosis register), see Understand LPR.

#=====================================================
# Step 1A: identify the exposed via SKS codes
#=====================================================
library(arrow)    # open_dataset
library(dplyr)    # filter, select, group_by, slice, ungroup, bind_rows, mutate

# Adapt these codes to your study
RYGB     <- c("KJDF10", "KJDF11")                     # Roux-en-Y gastric bypass
SG       <- c("KJDF40", "KJDF41", "KJDF96", "KJDF97") # sleeve gastrectomy
BS_CODES <- c(RYGB, SG)                               # combined vector

#-----------------------------------------------------
# LPR2: procedures up to 2018/2019
#-----------------------------------------------------
lpr_sksopr <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_sksopr/") %>%
  rename_with(tolower)
lpr_adm    <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_adm/") %>%
  rename_with(tolower)

exp_lpr2 <- lpr_sksopr %>%
  filter(c_opr %in% !!BS_CODES) %>%                        # only bariatric procedures
  select(recnum, sks_code = c_opr) %>%                      # recnum is the join key to lpr_adm
  inner_join(
    lpr_adm %>% select(pnr, recnum, index_date = d_inddto), # attach pnr and date
    by = "recnum"
  ) %>%
  collect()

#-----------------------------------------------------
# LPR3: procedures from 2019 onwards
#-----------------------------------------------------
proc_kir <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/procedurer_kirurgi/") %>%
  rename_with(tolower)
lpr_a_k  <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_a_kontakt/") %>%
  rename_with(tolower)

exp_lpr3 <- proc_kir %>%
  filter(procedurekode %in% !!BS_CODES) %>%                 # only bariatric procedures
  select(dw_ek_forloeb, sks_code = procedurekode) %>%
  inner_join(
    lpr_a_k %>% select(pnr, dw_ek_forloeb, index_date = kont_starttidspunkt),
    by = "dw_ek_forloeb"
  ) %>%
  collect() %>%
  mutate(index_date = as.Date(index_date))                  # datetime → date

#-----------------------------------------------------
# Combine and take one procedure per person (the first)
#-----------------------------------------------------
# The result is called "exposed" - only the exposed group, NOT the full cohort yet
exposed <- bind_rows(exp_lpr2, exp_lpr3) %>%
  group_by(pnr) %>%                                        # group per person
  arrange(index_date) %>%                                   # oldest date first
  slice(1) %>%                                              # one procedure per person (the first)
  ungroup() %>%                                             # release grouping (see Phase 12)
  mutate(exposed = 1L)                                      # mark as exposed (1 = yes)

nrow(exposed)                                              # number of unique operated individuals
Tip

exposed contains only the operated individuals. The full cohort (exposed + comparator cohort) is built on Comparison cohort and saved as cohort. It is cohort - not exposed - that you use as cohort_pnrs in the other phases.

Example B: ICD diagnosis as exposure criterion
#=====================================================
# Step 1B: identify the exposed via ICD diagnosis
#=====================================================
# Same LPR pattern as Phase 9 - but without semi_join(tibble(pnr = cohort_pnrs), by = "pnr"),
# as the cohort does not yet exist. You query the full population to identify the exposed.
lpr_diag <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_diag/") %>%
  rename_with(tolower)
lpr_adm  <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_adm/") %>%
  rename_with(tolower)

exposed <- lpr_adm %>%
  inner_join(
    lpr_diag %>%
      filter(c_diagtype %in% c("A", "B"),
             substr(c_diag, 2, 4) == "E11") %>%             # T2D: "DE11" → strip D-prefix
      select(recnum, c_diag),
    by = "recnum"
  ) %>%
  select(pnr, index_date = d_inddto) %>%
  collect() %>%
  group_by(pnr) %>%
  arrange(index_date) %>%
  slice(1) %>%                                              # first diagnosis per person
  ungroup() %>%
  mutate(exposed = 1L)
Note

This example pulls from LPR2 only. Always consider which registers your outcome/exposure needs - LPR2 (possibly + psychiatry) and LPR3 - and combine them as in 9b: Extract from LPR. If you leave out a register, you miss the cases that only appear there.


Step 2 - Exclude prevalent cases

People who already had your outcome before the index date must be excluded - otherwise they count as new cases even though they are not. This applies to all designs, including a prevalence study without a comparison group, so it belongs here where you build the exposed cohort.

Count how many drop out at each exclusion step, so you can describe the attrition and later draw a STROBE flow diagram (also called a participant or attrition diagram; see the STROBE checklist). The generic N-counting template is in Phase 6.

Helper function for excluding prevalent cases

If the same check is needed more than once (e.g. also on a comparison group later, see Comparison cohort), wrap the logic in a function so the two checks do not drift apart. See Good coding practice.

Show the code: exclude prevalent cases
#=====================================================
# Step 2: exclude prevalent cases
#=====================================================
# diagnoses = your LPR extract from Phase 9 (columns: pnr, date_contact, icd3)
OUTCOME_CODES <- c("G30", "F00", "F01", "F02", "F03")   # ICD-10 for your outcome - adapt

# We make our OWN function (a reusable block of code). It takes three arguments:
#   diagnoses = your LPR diagnosis extract (columns: pnr, date_contact, icd3)
#   codes     = the ICD codes that define your outcome (here OUTCOME_CODES)
#   cohort    = the group to check, with pnr + index_date
# It returns the pnr that had the outcome BEFORE their own index date.
prior_outcome_pnrs <- function(diagnoses, codes, cohort) {
  diagnoses %>%
    filter(icd3 %in% codes) %>%
    inner_join(cohort %>% select(pnr, index_date), by = "pnr") %>%   # attach each person's index
    filter(date_contact < index_date) %>%                            # only contacts BEFORE index
    distinct(pnr) %>%                                                 # one row per person
    pull(pnr)                                                         # vector of pnr's to exclude
}

prevalent <- prior_outcome_pnrs(diagnoses, OUTCOME_CODES, exposed)

n_before <- nrow(exposed)                     # nrow() = number of rows (here: number of persons)
exposed <- exposed %>%
  filter(!pnr %in% prevalent)                 # keep only those WITHOUT a prevalent outcome

# cat() simply prints a readable line in the console so you can follow the attrition.
cat("After prevalence exclusion:", nrow(exposed),
    "| excluded:", n_before - nrow(exposed), "\n")
# If the message is only for yourself, you can settle for:  nrow(exposed)   # number remaining

stopifnot(n_distinct(exposed$pnr) == nrow(exposed))   # check: one row per person
Tip

nrow(), cat() and stopifnot() are explained in the Function guide.

Tip

Planning a comparison cohort? Then extract the outcome (and exposure) dates for the whole population already here - not just for the exposed - and reuse them: restricted to the exposed in Step 2 (as above), and later for the comparison pool (see Comparison cohort). The outcome extract on the diagnosis codes returns everyone with the diagnosis anyway, so you save an extra query against the registers.


Participant flow diagram (STROBE)

The N counts you made in Step 1 and Step 2 come together in a STROBE flow diagram: a figure showing how many people were in from the start, how many were excluded at each step and why, and how many ended up in the study population. It is expected in almost every observational study. The diagram below also shows where the data comes from at each step (the numbers are simulated).

flowchart TD
    A[("Source registers - whole population<br>LPR2: lpr_sksopr + lpr_adm<br>LPR3: procedurer_kirurgi + lpr_a_kontakt")]:::store
    B["Step 1: filter on exposure codes<br>exposed with pnr + index_date<br>n = 12,300"]:::step
    E1["Excluded in Step 2:<br>prevalent outcome before index<br>(from LPR diagnoses)<br>n = 1,150"]:::excl
    C["Exposed cohort<br>n = 11,150"]:::step
    F["Optional (Phase 10a):<br>+ matched comparison cohort"]:::optional
    D["Final study population<br>pnr + index_date<br>(→ Phase 11)"]:::result

    A --> B
    B --> C
    B -.->|attrition| E1
    C -.-> F
    C --> D
    F -.-> D

    classDef store fill:#eef0f2,stroke:#8a94a6,color:#1f2733;
    classDef step fill:#eaf2fb,stroke:#4a78b5,color:#173a5e;
    classDef excl fill:#fdecea,stroke:#d9534f,color:#7a1f1a;
    classDef optional fill:#fff3e0,stroke:#e69500,color:#7a4f00;
    classDef result fill:#e9f7ef,stroke:#3fae6b,color:#14532d;

How each step maps to the data:

  • Source registers: the whole population’s LPR data (you have no cohort list yet). LPR2 and LPR3 are opened lazily with open_dataset() - see Step 1 above.
  • Step 1: you filter on your exposure codes and attach pnr + index_date. The result is exposed.
  • Step 2 (exclusion): prevalent cases - people with the outcome before their index date - are found in the LPR diagnoses and removed. That is the n_before - nrow(exposed) shown in the exclusion box.
  • Final study population: pnr + index_date per person, ready for Phase 11 - Extract variables. If you build a comparison cohort, Phase 10a is inserted before the last step.
Warning

Output control: the exclusion boxes show raw counts. Small numbers (e.g. an exclusion group with very few people) can be disclosive - round or combine them before the diagram leaves DST. See Phase 14 - Export and repatriation.


Tip

What now - choose your design:

  • Prevalence study / cohort without a comparison group: after this page you are done: you have your study population with an index date. Continue to Phase 11 - Extract variables and Phase 12 - Assemble and prepare the dataset.
  • Cohort study with a comparison cohort: after this page continue to → Comparison cohort, where you build a match pool and risk-set match a comparison group.
  • (Nested) case-control: after this page continue to → Case-control, where your “exposed” group above is instead your cases, and you sample controls who were at risk when the case had the outcome.

See also

Back to top