flowchart TD
E["Epidemiological studies"]:::neutral
O["Observational<br>- register research lives here"]:::active
X["Experimental"]:::ref
D["Descriptive"]:::active
A["Analytical"]:::active
CC["Case-control"]:::active
CO["Cohort"]:::active
R["RCT<br>- requires intervention,<br>not possible with register data"]:::ref
E --> O
E --> X
O --> D
O --> A
A --> CC
A --> CO
X --> R
classDef neutral fill:#eef0f2,stroke:#8a94a6,color:#1f2733;
classDef active fill:#eaf2fb,stroke:#4a78b5,color:#173a5e;
classDef ref fill:#f6f6f6,stroke:#cccccc,color:#999999;
Plan your study
Before opening R - define your question, cohort, and data model
Register-based research does not start in R. It starts with pen and paper. This page guides you through the things you should have in place before writing a single line of code.
In short: Settle four things on paper before you code - a precise research question, your data model (which registers cover exposure, outcome and covariates), your covariates chosen with a DAG, and your comparison cohort.
What type of study are you doing?
Almost all register-based research is observational and analytical - you observe what has already happened, without intervening. The two classic analytical designs are case-control and cohort; Phase 10 shows how to build a matched cohort study. Randomised trials (RCT) cannot be done with register data and are included here only for the overview.
Case-control or cohort - what is the difference?
The two analytical designs differ in which end you start from:
| Cohort | Case-control | |
|---|---|---|
| Starting point | Exposure | Outcome |
| Direction | Follows forward: exposed → outcome | Looks back: case → prior exposure |
| Best when | Exposure is rare; multiple outcomes | Outcome is rare; single outcome |
| Effect measure | Incidence, relative risk (RR), hazard ratio | Odds ratio (OR) |
| In registers | Define exposed group + comparator cohort, follow forward | Find all cases, select controls, look back at exposure |
Cohort follows persons forward in time from the index date and measures how many develop the outcome - which is why you can compute incidence and risk. Well suited when you have multiple outcomes (cf. the alle_dx approach in Extract from LPR).
Case-control starts from those who already have the outcome and matches them with controls without it - efficient for rare outcomes, but cannot compute absolute risk.
With register data you can do both, because the entire population’s history is available. Phase 10 shows a matched cohort study step by step.
Key concepts
Before planning a study it is worth knowing these terms - they are used throughout the guide.
Cohort A group of people followed over time because they share a particular characteristic at a particular point in time. Example: all patients who underwent bariatric surgery in the period 2010–2020.
Index date The start date of follow-up - the point from which you begin counting. For operated patients this is typically the date of surgery. For matched comparators the same date as the matched operated patient is assigned.
Exposure The factor whose effect you are investigating - e.g. a surgery, a medication, or a diagnosis.
Outcome What you are measuring - e.g. onset of a disease, a hospitalisation, or death.
Covariates Variables you include to account for confounding - factors that affect both exposure and outcome. Examples: age, sex, comorbidity, socioeconomic status.
1. What do I want to investigate?
Formulate your research question precisely before looking at any data. A vague question produces a messy dataset. A precise question produces a clear plan.
Ask yourself:
| Question | Example |
|---|---|
| Who is my population? | All adults with T2D in Denmark, 2010–2020 |
| What is my exposure? | Bariatric surgery |
| What is my outcome? | Dementia |
| When does follow-up start? | Date of surgery (index date) |
| When does it end? | Diagnosis, death, emigration, or end of study period |
| Which confounders should be adjusted for? | Age, sex, comorbidity, SES |
If your question is causal, you can sketch a target trial. When you ask whether an exposure changes an outcome, it often helps to picture the randomised trial you would run if you could (the target trial) and to treat your register study as an attempt to emulate it. The rows above are roughly that trial’s protocol. Thinking this way early helps you settle time zero: the point where eligibility, exposure assignment and start of follow-up should coincide. When they drift apart, you can get immortal time bias. (The framing fits effect questions, not purely descriptive or prediction studies.)
The seven things a target trial pins down (and where they live in this guide)
Spell each one out the way a trial protocol would:
- Eligibility: who can enter, judged only on information available at time zero (nothing from the future).
- Treatment strategies: the exposure and its comparator, each defined clearly enough that a person could in principle be assigned either (Comparison cohort).
- Assignment: a real trial randomises; a register study instead adjusts for the confounders that decided who got exposed (your DAG).
- Outcome: defined the same way in both groups.
- Time zero: where eligibility, assignment and follow-up start together (see the immortal-time warning above).
- Causal contrast (estimand): which effect you want, e.g. intention-to-treat (effect of being assigned a strategy) vs per-protocol (effect of actually following it).
- Analysis plan: model and adjustment set, fixed before you see results (§6).
2. Which registers cover what?
Before mapping your data model it is useful to know which registers exist.
| What do you need to find? | Register |
|---|---|
| Demographics (age, sex, residence) | BEF - Population Register |
| Hospital diagnoses and contacts | LPR - National Patient Register (LPR2 + LPR3) |
| Dispensed prescriptions | LMDB - Prescription Register |
| Date of death (for censoring) | DODSAARS - Death Register |
| Emigration (for censoring) | VNDS - Migration Register |
| Education | UDDA - Education Register |
| Income | FAIK - Family Income Register |
| Employment | AKM - Labour Classification Module |
A complete description of all registers with column names and join keys is in Overview of registers →
3. Choose your covariates using a DAG
Which variables should you adjust for? The answer is not “as many as possible”. Adjusting for the wrong variables can introduce bias rather than remove it.
A DAG (directed acyclic graph - a causal diagram) is a drawing of your assumptions about how exposure, outcome and other variables relate to each other. It makes your assumptions explicit and helps you choose the right set of covariates.
Rules of thumb:
- Adjust for confounders: variables that affect both exposure and outcome (e.g. age, comorbidity).
- Do NOT adjust for mediators: variables that lie on the causal pathway between exposure and outcome (this removes part of the effect you want to measure).
- Do NOT adjust for colliders: common effects of two variables (this opens a spurious association).
- You also “condition” by selection, not just by adjusting: restricting your population by a variable, or losing people through it, counts too. Building a cohort of only hospitalised patients, or analysing only those with complete follow-up, can open the same spurious path a collider would. That is selection bias; the censoring and dropout version is handled by IPCW.
What dagitty.net computes for you is the backdoor paths: the indirect routes from exposure to outcome that run “against the arrows” through a common cause. Confounding is an open backdoor path, and the minimal adjustment set is the smallest group of variables that blocks all of them without opening a new one through a collider. One case the simple rules do not cover: a confounder that is itself affected by past exposure (treatment-confounder feedback) cannot be handled by ordinary adjustment at all, see Time-varying variables.
Example: surgery and dementia - a DAG with confounder, mediator and collider
A concrete example: does surgery affect the risk of dementia?
- Age is a confounder - it affects both the probability of surgery and of dementia. Adjust for it.
- Delirium (post-operative delirium) is a mediator - it lies on the path surgery → delirium → dementia. Do not adjust - that removes part of the effect you want to measure.
- Hospitalisation is a collider - both surgery and dementia lead to hospitalisation. Do not adjust - it opens a spurious association.
You can paste the model straight into dagitty.net and have the minimal adjustment set computed:
dag {
Age [pos="0,-1"]
Surgery [exposure, pos="-1.5,0"]
Delirium [pos="0,0"]
Dementia [outcome, pos="1.5,0"]
Hospitalisation [pos="0,1"]
Age -> Surgery
Age -> Dementia
Surgery -> Delirium
Delirium -> Dementia
Surgery -> Hospitalisation
Dementia -> Hospitalisation
}
For this DAG the minimal adjustment set is {Age} - you only need to adjust for age.
Tools
- dagitty.net: draw your diagram in the browser; it automatically calculates the minimal set of covariates to adjust for.
- Causal Diagrams: Draw Your Assumptions Before Your Conclusions: free HarvardX course by Miguel Hernán on exactly this.
- Background: Hernán & Robins, Causal Inference: What If (free PDF) - also in Learning resources.
A DAG tells you what to adjust for - three conditions decide whether adjustment can recover a causal effect at all
Choosing the right covariates is necessary but not sufficient. For an adjusted estimate to carry a causal meaning, three conditions must hold (Hernán & Robins, What If, ch. 3). They underlie every method in this guide (regression, Cox, IP weighting), not just the advanced ones:
- Exchangeability (no unmeasured confounding): once you adjust for the covariates in your DAG, the assumption is that the exposed and unexposed are comparable. This is an assumption you can never fully verify.
- Positivity (overlap): within every combination of those covariates, both exposure and non-exposure actually occur. If some people could never be exposed (or never unexposed), there is no one to compare them with. Very large IP weights are the practical warning sign.
- Consistency (a well-defined exposure): the exposure corresponds to a clear enough intervention that a “what if everyone were / were not exposed” world is meaningful. A concrete exposure (“this surgery on this date”) satisfies it far better than a vague one (“obesity”).
4. The comparator cohort
Many studies compare an exposed group with a comparator cohort. How you build it is a design decision to be made on paper - before writing code.
Things to consider:
- Who is an appropriate comparator? Either an active comparator (unexposed people with the same indication, e.g. the same underlying disease but a different or no treatment - reduces confounding by indication) or a matched background population (maximal contrast). The choice depends on the question; expanded in Comparison cohort.
- Index date for the comparator cohort. Your exposed cohort has an index date determined by the exposure (e.g. the surgery date). The comparator cohort does not - it must be assigned a date, typically the same date as the matched exposed person, so both groups are followed from a comparable point in time.
- Eligibility at index. The comparator cohort must meet the inclusion criteria on their assigned index date - otherwise you risk immortal time bias (a distortion that arises when a person is assigned exposure time during which they by definition could not yet have experienced the outcome).
- Matching variables and ratio. E.g. age, sex and calendar year; decide the ratio (e.g. 1:5).
- Can anyone in the comparator cohort become exposed later? E.g. can a person who started as a control later undergo surgery? Decide what happens in that case - whether they remain a control or transfer to the exposed group.
- The same exclusions are applied to both groups.
→ The complete pattern for cohort construction and matching is in Phase 10 - Build your study population.
5. Get an overview - pen and paper
Before opening R, answer these questions in writing:
- Which variables do I need? (patient information - age, sex, diagnoses etc. - and for which years)
- Which registers contain this information? (LPR, BEF, LMDB, …)
- In what order should data be assembled? (define population → extract outcome → extract covariates)
A solid overview on paper saves many hours of debugging in code.
Example: overview for a dementia study
Population: Adults who have undergone bariatric surgery (identified via the Danish Obesity
Treatment Database - DBSO), 2010–2024
Matched comparators from the Population Register (BEF)
Outcome: First dementia diagnosis (LPR - ICD-10: F00–F03, G30–G31)
Date: first contact with a dementia code after the surgery date
Covariates: Age and sex (BEF)
Comorbidity (LPR - 5-year lookback, i.e. diagnoses in the 5 years before index date)
Education (UDDA)
Income (FAIK via BEF familie_id)
Employment (AKM)
Censoring: Death (DODSAARS)
Emigration (VNDS)
End of study period (31 Dec 2024)
6. Write an analysis plan
An analysis plan is a document you write before looking at your data. It forces you to commit to design, statistics and variables before results can colour your decisions.
Use the STROBE checklist as a skeleton: STROBE Statement - checklists →
For register-based studies, RECORD extends STROBE with items on routinely collected data, and RECORD-PE covers pharmacoepidemiology specifically: RECORD-PE (EQUATOR Network) →
Pre-register your analysis plan on e.g. OSF - this is good scientific practice and required by many journals: Open Science Framework - registration templates
Power and sample size. Even large registers have limited power (the ability to detect an effect that is really there) for rare outcomes or small subgroups. Already in the plan, consider the smallest effect you could meaningfully detect with your expected number of events. The pwr package does simple power/sample-size calculations; for survival and rate designs it is often the number of events (not the number of people) that drives your power.
7. Next steps
Once you have your overview in place:
- New to R? → Phase 2 - R: the bare essentials
- Ready for the DST server? → Phase 3 - Log in to DST
- Working on DARTER / project 708421? → Read this first: DARTER - overview and pipeline