Case-control
Nested case-control - sample controls for your cases
This page continues from Build your study population. In a nested case-control study, your group from the landing page is instead your cases (those who developed the outcome), and their index date is the outcome date. For each case you sample a number of controls: people without the outcome who were at risk exactly when the case had the outcome.
Still under development. Structural guidance - adapt to your own project, and confirm package names (heaven::incidenceMatch, Epi::ccwc) on your project before use.
Functions used here. left_join() (attach a variable) and transmute() (create new columns, keep only those) are explained in Comparison cohort and Link your extracts. pmin(), if_else() and as.factor() are in the Function guide. New here: incidenceMatch() / Epi::ccwc() for sampling and survival::clogit() for the analysis.
Cohort or case-control? Both designs can answer the same question; case-control is especially efficient when the outcome is rare, or when a covariate is expensive to extract for the whole cohort. See Phase 1 - what type of study for the design choice.
Case-control vs. nested case-control - what is the difference?
- Classic case-control: you find cases (with the outcome) and select controls (without the outcome) from the population, often without a defined cohort with follow-up behind it. Exposure is typically ascertained retrospectively (e.g. interview) - vulnerable to recall and selection bias, and choosing controls is hard.
- Nested case-control: the case-control study sits inside a defined cohort (e.g. a registry cohort followed over time). Cases are those who develop the outcome during follow-up; controls are sampled from those who were at risk (outcome-free) exactly when the case had the outcome (risk-set / incidence-density sampling). Exposure comes from registry data collected before the outcome → less recall bias.
In register research at DST you almost always do the nested variant: the cohort already exists in the registers. Advantages: exposure is measured beforehand, and the odds ratio from incidence-density sampling estimates the rate/hazard ratio without a rare-outcome assumption. That is what this page shows.
The principle: incidence-density sampling
Controls are selected by risk-set / incidence-density sampling: for each case you draw controls who, on the case’s index date (= the outcome date), (1) have not yet developed the outcome and (2) meet the inclusion criteria. A person can be a control for one case and later become a case themselves - that is correct in this design (and is exactly what the risk set captures).
It is the same mechanic as the matching in Comparison cohort - only the “event” has changed: here it is the outcome (case), not an exposure. So this page also reuses the person table from there.
Step 3 - Build the dataset for matching
(Steps 1 and 2 - identify cases and exclude prevalent cases - are on the landing page.)
You need one row per person with matching variables and the dates eligibility depends on. Build it exactly like the pool in Comparison cohort, Steps 3-4 (demographics + earliest_bef + death date + emigration date + lookback). For case-control, though:
- Exposure is not an eligibility criterion here (unlike the comparison cohort, where a control must be exposure-free at index). In case-control, exposure is exactly what you study: you measure it for both cases and controls afterwards and do not use it to decide who can be a control. A control just needs to be outcome-free (and meet your other criteria) at the case’s index.
- You need to know who is a case (developed the outcome) and when (outcome date = index).
Show the code: build the dataset for matching
#=====================================================
# Step 3: build the dataset for matching
#=====================================================
# 'population' = one row per person from Comparison cohort (pnr, sex, birth_year,
# death_date, emigration_date, earliest_bef_year). See Comparison cohort, Steps 3-4.
STUDY_END <- as.Date("2022-12-31") # we choose: last day of the study
outcome <- readRDS("path/to/outcome_dates.rds") # pnr, first_outcome_date (FIRST outcome; NA = never)
all_persons <- population %>%
left_join(outcome, by = "pnr") %>% # add outcome date (NA = never outcome)
transmute( # keep ONLY the columns we name
pnr,
case = if_else(!is.na(first_outcome_date), 1L, 0L), # 1 = case (had outcome), 0 = potential control
sex = as.factor(sex), # matching variable MUST be factor/character
birth_year = as.factor(birth_year), # matching variable
case_index = first_outcome_date, # the case index = outcome date (NA if never)
# end_fu = when the person leaves the risk set: the EARLIEST of death, emigration,
# own outcome (they become a case then) or study end. pmin = earliest.
end_fu = pmin(death_date, emigration_date, first_outcome_date, STUDY_END, na.rm = TRUE)
)Step 4 - Sample controls
Two routes. Both do incidence-density sampling; pick one.
Route A - heaven::incidenceMatch() (sibling of exposureMatch() from Comparison cohort, but the “event” is the outcome).
Show the code (Route A): sample with incidenceMatch()
#=====================================================
# Step 4 (Route A): sample with incidenceMatch()
#=====================================================
library(heaven) # incidenceMatch
# Argument (left of =) <- value we set (usually a column name in quotes)
matched <- incidenceMatch(
ptid = "pnr", # the person-id column
event = "case", # the column that is 1 for case (outcome), 0 for potential control
terms = c("sex", "birth_year"), # matching variables (exact match; factor/character)
data = all_persons, # the dataset from Step 3
n.controls = 5, # controls per case
case.index = "case_index", # the case's index date (the outcome date)
end.followup = "end_fu", # when a person stops being eligible as a control
seed = 20260620 # reproducibility
)
# Output: data.table with 'case.id' identifying each matched set (1 case + its controls)Route B - Epi::ccwc() (counter-matched/risk-set sampling from entry and exit times). Confirm arguments with ?Epi::ccwc.
Show the code (Route B): sample with Epi::ccwc()
#=====================================================
# Step 4 (Route B): sample with Epi::ccwc()
#=====================================================
library(Epi) # ccwc - case-control within cohort
cohort <- all_persons %>%
mutate(
entry = as.Date("2010-01-01"), # study start (adapt; e.g. 18th birthday or first eligible date)
exit = end_fu, # exit time = end of risk time (from Step 3)
fail = case # 1 if the person is a case at exit
)
ncc <- ccwc(
entry = entry, exit = exit, fail = fail, # entry time, exit time, failure at exit
controls = 5, # controls per case
match = list(sex, birth_year), # matching variables
include = list(pnr), # keep pnr in the output
data = cohort
)
# Output: the matched-set id is called 'Set' here (not 'case.id')Same rules as in Comparison cohort: apply inclusion/exclusion criteria BEFORE sampling and to everyone (cases + pool); set a seed; and decide on with/without replacement (controls can typically be used for several cases) - which must be handled in the analysis. incidenceMatch() itself handles the risk-set time dimension via case.index + end.followup.
Analysis (brief)
The analysis itself belongs to Regression (Phase 13) - here only briefly, so you know where this leads. Matched case-control data can be analyzed with conditional logistic regression (survival::clogit() with a strata() term per matched set), so the matching is respected; the odds ratio from an incidence-density design estimates the rate/hazard ratio. This is the standard approach for matched case-control studies. Before the analysis you attach exposure and covariates to the matched persons (from Phase 11/Phase 12). The matched-set id is called case.id (Route A) or Set (Route B).
What now?
You have a matched dataset (cases + controls) with a matched-set id (case.id or Set). Use pnr to extract exposure and covariates for everyone in the dataset:
cc_pnrs <- unique(matched$pnr) # all pnr's (cases + controls) for the other extractions| Extraction | Phase |
|---|---|
| Exposure + covariates | Phase 11 |
| Comorbidity (NMI) | NMI |
| Assemble into one analysis dataset | Phase 12 |
| Conditional logistic regression | Phase 13 |
See also
- Build your study population: identify your cases (Step 1) and exclude prevalent cases (Step 2)
- Comparison cohort: same risk-set mechanic for a cohort design; build the pool and understand replacement there
- Phase 1 - Study preparation: cohort vs. case-control
- Heide-Jørgensen et al. (2018), Clinical Epidemiology 10:1325-1337: sampling strategies and bias in risk-set sampling