Case-control

Nested case-control - sample controls for your cases

Published

July 2, 2026

This page continues from Build your study population. In a nested case-control study, your group from the landing page is instead your cases (those who developed the outcome), and their index date is the outcome date. For each case you sample a number of controls: people without the outcome who were at risk exactly when the case had the outcome.

Warning

Still under development. Structural guidance - adapt to your own project, and confirm package names (heaven::incidenceMatch, Epi::ccwc) on your project before use.

Note

Functions used here. left_join() (attach a variable) and transmute() (create new columns, keep only those) are explained in Comparison cohort and Link your extracts. pmin(), if_else() and as.factor() are in the Function guide. New here: incidenceMatch() / Epi::ccwc() for sampling and survival::clogit() for the analysis.

Note

Cohort or case-control? Both designs can answer the same question; case-control is especially efficient when the outcome is rare, or when a covariate is expensive to extract for the whole cohort. See Phase 1 - what type of study for the design choice.

Case-control vs. nested case-control - what is the difference?
  • Classic case-control: you find cases (with the outcome) and select controls (without the outcome) from the population, often without a defined cohort with follow-up behind it. Exposure is typically ascertained retrospectively (e.g. interview) - vulnerable to recall and selection bias, and choosing controls is hard.
  • Nested case-control: the case-control study sits inside a defined cohort (e.g. a registry cohort followed over time). Cases are those who develop the outcome during follow-up; controls are sampled from those who were at risk (outcome-free) exactly when the case had the outcome (risk-set / incidence-density sampling). Exposure comes from registry data collected before the outcome → less recall bias.

In register research at DST you almost always do the nested variant: the cohort already exists in the registers. Advantages: exposure is measured beforehand, and the odds ratio from incidence-density sampling estimates the rate/hazard ratio without a rare-outcome assumption. That is what this page shows.


The principle: incidence-density sampling

Controls are selected by risk-set / incidence-density sampling: for each case you draw controls who, on the case’s index date (= the outcome date), (1) have not yet developed the outcome and (2) meet the inclusion criteria. A person can be a control for one case and later become a case themselves - that is correct in this design (and is exactly what the risk set captures).

It is the same mechanic as the matching in Comparison cohort - only the “event” has changed: here it is the outcome (case), not an exposure. So this page also reuses the person table from there.


Step 3 - Build the dataset for matching

(Steps 1 and 2 - identify cases and exclude prevalent cases - are on the landing page.)

You need one row per person with matching variables and the dates eligibility depends on. Build it exactly like the pool in Comparison cohort, Steps 3-4 (demographics + earliest_bef + death date + emigration date + lookback). For case-control, though:

  • Exposure is not an eligibility criterion here (unlike the comparison cohort, where a control must be exposure-free at index). In case-control, exposure is exactly what you study: you measure it for both cases and controls afterwards and do not use it to decide who can be a control. A control just needs to be outcome-free (and meet your other criteria) at the case’s index.
  • You need to know who is a case (developed the outcome) and when (outcome date = index).
Show the code: build the dataset for matching
#=====================================================
# Step 3: build the dataset for matching
#=====================================================
# 'population' = one row per person from Comparison cohort (pnr, sex, birth_year,
#                death_date, emigration_date, earliest_bef_year). See Comparison cohort, Steps 3-4.
STUDY_END <- as.Date("2022-12-31")                # we choose: last day of the study
outcome <- readRDS("path/to/outcome_dates.rds")   # pnr, first_outcome_date (FIRST outcome; NA = never)

all_persons <- population %>%
  left_join(outcome, by = "pnr") %>%              # add outcome date (NA = never outcome)
  transmute(                                       # keep ONLY the columns we name
    pnr,
    case       = if_else(!is.na(first_outcome_date), 1L, 0L),  # 1 = case (had outcome), 0 = potential control
    sex        = as.factor(sex),                   # matching variable MUST be factor/character
    birth_year = as.factor(birth_year),            # matching variable
    case_index = first_outcome_date,               # the case index = outcome date (NA if never)
    # end_fu = when the person leaves the risk set: the EARLIEST of death, emigration,
    #          own outcome (they become a case then) or study end. pmin = earliest.
    end_fu     = pmin(death_date, emigration_date, first_outcome_date, STUDY_END, na.rm = TRUE)
  )

Step 4 - Sample controls

Two routes. Both do incidence-density sampling; pick one.

Route A - heaven::incidenceMatch() (sibling of exposureMatch() from Comparison cohort, but the “event” is the outcome).

Show the code (Route A): sample with incidenceMatch()
#=====================================================
# Step 4 (Route A): sample with incidenceMatch()
#=====================================================
library(heaven)   # incidenceMatch

# Argument (left of =) <- value we set (usually a column name in quotes)
matched <- incidenceMatch(
  ptid         = "pnr",                  # the person-id column
  event        = "case",                 # the column that is 1 for case (outcome), 0 for potential control
  terms        = c("sex", "birth_year"), # matching variables (exact match; factor/character)
  data         = all_persons,            # the dataset from Step 3
  n.controls   = 5,                      # controls per case
  case.index   = "case_index",           # the case's index date (the outcome date)
  end.followup = "end_fu",               # when a person stops being eligible as a control
  seed         = 20260620                # reproducibility
)
# Output: data.table with 'case.id' identifying each matched set (1 case + its controls)

Route B - Epi::ccwc() (counter-matched/risk-set sampling from entry and exit times). Confirm arguments with ?Epi::ccwc.

Show the code (Route B): sample with Epi::ccwc()
#=====================================================
# Step 4 (Route B): sample with Epi::ccwc()
#=====================================================
library(Epi)   # ccwc - case-control within cohort

cohort <- all_persons %>%
  mutate(
    entry = as.Date("2010-01-01"),                # study start (adapt; e.g. 18th birthday or first eligible date)
    exit  = end_fu,                               # exit time = end of risk time (from Step 3)
    fail  = case                                  # 1 if the person is a case at exit
  )

ncc <- ccwc(
  entry    = entry, exit = exit, fail = fail,     # entry time, exit time, failure at exit
  controls = 5,                                   # controls per case
  match    = list(sex, birth_year),               # matching variables
  include  = list(pnr),                           # keep pnr in the output
  data     = cohort
)
# Output: the matched-set id is called 'Set' here (not 'case.id')
Note

Same rules as in Comparison cohort: apply inclusion/exclusion criteria BEFORE sampling and to everyone (cases + pool); set a seed; and decide on with/without replacement (controls can typically be used for several cases) - which must be handled in the analysis. incidenceMatch() itself handles the risk-set time dimension via case.index + end.followup.


Analysis (brief)

The analysis itself belongs to Regression (Phase 13) - here only briefly, so you know where this leads. Matched case-control data can be analyzed with conditional logistic regression (survival::clogit() with a strata() term per matched set), so the matching is respected; the odds ratio from an incidence-density design estimates the rate/hazard ratio. This is the standard approach for matched case-control studies. Before the analysis you attach exposure and covariates to the matched persons (from Phase 11/Phase 12). The matched-set id is called case.id (Route A) or Set (Route B).


What now?

You have a matched dataset (cases + controls) with a matched-set id (case.id or Set). Use pnr to extract exposure and covariates for everyone in the dataset:

cc_pnrs <- unique(matched$pnr)   # all pnr's (cases + controls) for the other extractions
Extraction Phase
Exposure + covariates Phase 11
Comorbidity (NMI) NMI
Assemble into one analysis dataset Phase 12
Conditional logistic regression Phase 13

See also

Back to top