Plan your study

Before opening R - define your question, cohort, and data model

Published

July 2, 2026

Register-based research does not start in R. It starts with pen and paper. This page guides you through the things you should have in place before writing a single line of code.

Tip

In short: Settle four things on paper before you code - a precise research question, your data model (which registers cover exposure, outcome and covariates), your covariates chosen with a DAG, and your comparison cohort.


What type of study are you doing?

Almost all register-based research is observational and analytical - you observe what has already happened, without intervening. The two classic analytical designs are case-control and cohort; Phase 10 shows how to build a matched cohort study. Randomised trials (RCT) cannot be done with register data and are included here only for the overview.

flowchart TD
    E["Epidemiological studies"]:::neutral
    O["Observational<br>- register research lives here"]:::active
    X["Experimental"]:::ref
    D["Descriptive"]:::active
    A["Analytical"]:::active
    CC["Case-control"]:::active
    CO["Cohort"]:::active
    R["RCT<br>- requires intervention,<br>not possible with register data"]:::ref

    E --> O
    E --> X
    O --> D
    O --> A
    A --> CC
    A --> CO
    X --> R

    classDef neutral fill:#eef0f2,stroke:#8a94a6,color:#1f2733;
    classDef active fill:#eaf2fb,stroke:#4a78b5,color:#173a5e;
    classDef ref fill:#f6f6f6,stroke:#cccccc,color:#999999;

Case-control or cohort - what is the difference?

The two analytical designs differ in which end you start from:

Cohort Case-control
Starting point Exposure Outcome
Direction Follows forward: exposed → outcome Looks back: case → prior exposure
Best when Exposure is rare; multiple outcomes Outcome is rare; single outcome
Effect measure Incidence, relative risk (RR), hazard ratio Odds ratio (OR)
In registers Define exposed group + comparator cohort, follow forward Find all cases, select controls, look back at exposure

Cohort follows persons forward in time from the index date and measures how many develop the outcome - which is why you can compute incidence and risk. Well suited when you have multiple outcomes (cf. the alle_dx approach in Extract from LPR).

Case-control starts from those who already have the outcome and matches them with controls without it - efficient for rare outcomes, but cannot compute absolute risk.

With register data you can do both, because the entire population’s history is available. Phase 10 shows a matched cohort study step by step.


Key concepts

Before planning a study it is worth knowing these terms - they are used throughout the guide.

Cohort A group of people followed over time because they share a particular characteristic at a particular point in time. Example: all patients who underwent bariatric surgery in the period 2010–2020.

Index date The start date of follow-up - the point from which you begin counting. For operated patients this is typically the date of surgery. For matched comparators the same date as the matched operated patient is assigned.

Exposure The factor whose effect you are investigating - e.g. a surgery, a medication, or a diagnosis.

Outcome What you are measuring - e.g. onset of a disease, a hospitalisation, or death.

Covariates Variables you include to account for confounding - factors that affect both exposure and outcome. Examples: age, sex, comorbidity, socioeconomic status.


1. What do I want to investigate?

Formulate your research question precisely before looking at any data. A vague question produces a messy dataset. A precise question produces a clear plan.

Ask yourself:

Question Example
Who is my population? All adults with T2D in Denmark, 2010–2020
What is my exposure? Bariatric surgery
What is my outcome? Dementia
When does follow-up start? Date of surgery (index date)
When does it end? Diagnosis, death, emigration, or end of study period
Which confounders should be adjusted for? Age, sex, comorbidity, SES
Tip

If your question is causal, you can sketch a target trial. When you ask whether an exposure changes an outcome, it often helps to picture the randomised trial you would run if you could (the target trial) and to treat your register study as an attempt to emulate it. The rows above are roughly that trial’s protocol. Thinking this way early helps you settle time zero: the point where eligibility, exposure assignment and start of follow-up should coincide. When they drift apart, you can get immortal time bias. (The framing fits effect questions, not purely descriptive or prediction studies.)

The seven things a target trial pins down (and where they live in this guide)

Spell each one out the way a trial protocol would:

  • Eligibility: who can enter, judged only on information available at time zero (nothing from the future).
  • Treatment strategies: the exposure and its comparator, each defined clearly enough that a person could in principle be assigned either (Comparison cohort).
  • Assignment: a real trial randomises; a register study instead adjusts for the confounders that decided who got exposed (your DAG).
  • Outcome: defined the same way in both groups.
  • Time zero: where eligibility, assignment and follow-up start together (see the immortal-time warning above).
  • Causal contrast (estimand): which effect you want, e.g. intention-to-treat (effect of being assigned a strategy) vs per-protocol (effect of actually following it).
  • Analysis plan: model and adjustment set, fixed before you see results (§6).
Background: Hernán & Robins, Causal Inference: What If (free PDF), §3.6 and ch. 22; and Hernán MA, Robins JM (2016), “Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available”, Am J Epidemiol 183(8):758-764.

2. Which registers cover what?

Before mapping your data model it is useful to know which registers exist.

What do you need to find? Register
Demographics (age, sex, residence) BEF - Population Register
Hospital diagnoses and contacts LPR - National Patient Register (LPR2 + LPR3)
Dispensed prescriptions LMDB - Prescription Register
Date of death (for censoring) DODSAARS - Death Register
Emigration (for censoring) VNDS - Migration Register
Education UDDA - Education Register
Income FAIK - Family Income Register
Employment AKM - Labour Classification Module

A complete description of all registers with column names and join keys is in Overview of registers →


3. Choose your covariates using a DAG

Which variables should you adjust for? The answer is not “as many as possible”. Adjusting for the wrong variables can introduce bias rather than remove it.

A DAG (directed acyclic graph - a causal diagram) is a drawing of your assumptions about how exposure, outcome and other variables relate to each other. It makes your assumptions explicit and helps you choose the right set of covariates.

Rules of thumb:

  • Adjust for confounders: variables that affect both exposure and outcome (e.g. age, comorbidity).
  • Do NOT adjust for mediators: variables that lie on the causal pathway between exposure and outcome (this removes part of the effect you want to measure).
  • Do NOT adjust for colliders: common effects of two variables (this opens a spurious association).
  • You also “condition” by selection, not just by adjusting: restricting your population by a variable, or losing people through it, counts too. Building a cohort of only hospitalised patients, or analysing only those with complete follow-up, can open the same spurious path a collider would. That is selection bias; the censoring and dropout version is handled by IPCW.

What dagitty.net computes for you is the backdoor paths: the indirect routes from exposure to outcome that run “against the arrows” through a common cause. Confounding is an open backdoor path, and the minimal adjustment set is the smallest group of variables that blocks all of them without opening a new one through a collider. One case the simple rules do not cover: a confounder that is itself affected by past exposure (treatment-confounder feedback) cannot be handled by ordinary adjustment at all, see Time-varying variables.

Example: surgery and dementia - a DAG with confounder, mediator and collider

A concrete example: does surgery affect the risk of dementia?

Causal diagram with five variables: surgery (exposure), dementia (outcome), age (confounder), delirium (mediator) and hospitalisation (collider).
Figure 1: DAG of surgery → dementia with a confounder (age), a mediator (delirium) and a collider (hospitalisation).
  • Age is a confounder - it affects both the probability of surgery and of dementia. Adjust for it.
  • Delirium (post-operative delirium) is a mediator - it lies on the path surgery → delirium → dementia. Do not adjust - that removes part of the effect you want to measure.
  • Hospitalisation is a collider - both surgery and dementia lead to hospitalisation. Do not adjust - it opens a spurious association.

You can paste the model straight into dagitty.net and have the minimal adjustment set computed:

dag {
  Age            [pos="0,-1"]
  Surgery        [exposure, pos="-1.5,0"]
  Delirium       [pos="0,0"]
  Dementia       [outcome,  pos="1.5,0"]
  Hospitalisation [pos="0,1"]
  Age      -> Surgery
  Age      -> Dementia
  Surgery  -> Delirium
  Delirium -> Dementia
  Surgery  -> Hospitalisation
  Dementia -> Hospitalisation
}

For this DAG the minimal adjustment set is {Age} - you only need to adjust for age.

Tip

Tools

A DAG tells you what to adjust for - three conditions decide whether adjustment can recover a causal effect at all

Choosing the right covariates is necessary but not sufficient. For an adjusted estimate to carry a causal meaning, three conditions must hold (Hernán & Robins, What If, ch. 3). They underlie every method in this guide (regression, Cox, IP weighting), not just the advanced ones:

  • Exchangeability (no unmeasured confounding): once you adjust for the covariates in your DAG, the assumption is that the exposed and unexposed are comparable. This is an assumption you can never fully verify.
  • Positivity (overlap): within every combination of those covariates, both exposure and non-exposure actually occur. If some people could never be exposed (or never unexposed), there is no one to compare them with. Very large IP weights are the practical warning sign.
  • Consistency (a well-defined exposure): the exposure corresponds to a clear enough intervention that a “what if everyone were / were not exposed” world is meaningful. A concrete exposure (“this surgery on this date”) satisfies it far better than a vague one (“obesity”).
These are the same three conditions restated for weighting on the IP weighting page.

4. The comparator cohort

Many studies compare an exposed group with a comparator cohort. How you build it is a design decision to be made on paper - before writing code.

Things to consider:

  • Who is an appropriate comparator? Either an active comparator (unexposed people with the same indication, e.g. the same underlying disease but a different or no treatment - reduces confounding by indication) or a matched background population (maximal contrast). The choice depends on the question; expanded in Comparison cohort.
  • Index date for the comparator cohort. Your exposed cohort has an index date determined by the exposure (e.g. the surgery date). The comparator cohort does not - it must be assigned a date, typically the same date as the matched exposed person, so both groups are followed from a comparable point in time.
  • Eligibility at index. The comparator cohort must meet the inclusion criteria on their assigned index date - otherwise you risk immortal time bias (a distortion that arises when a person is assigned exposure time during which they by definition could not yet have experienced the outcome).
  • Matching variables and ratio. E.g. age, sex and calendar year; decide the ratio (e.g. 1:5).
  • Can anyone in the comparator cohort become exposed later? E.g. can a person who started as a control later undergo surgery? Decide what happens in that case - whether they remain a control or transfer to the exposed group.
  • The same exclusions are applied to both groups.

→ The complete pattern for cohort construction and matching is in Phase 10 - Build your study population.


5. Get an overview - pen and paper

Before opening R, answer these questions in writing:

  1. Which variables do I need? (patient information - age, sex, diagnoses etc. - and for which years)
  2. Which registers contain this information? (LPR, BEF, LMDB, …)
  3. In what order should data be assembled? (define population → extract outcome → extract covariates)

A solid overview on paper saves many hours of debugging in code.

Example: overview for a dementia study
Population:   Adults who have undergone bariatric surgery (identified via the Danish Obesity
              Treatment Database - DBSO), 2010–2024
              Matched comparators from the Population Register (BEF)

Outcome:      First dementia diagnosis (LPR - ICD-10: F00–F03, G30–G31)
              Date: first contact with a dementia code after the surgery date

Covariates:   Age and sex (BEF)
              Comorbidity (LPR - 5-year lookback, i.e. diagnoses in the 5 years before index date)
              Education (UDDA)
              Income (FAIK via BEF familie_id)
              Employment (AKM)

Censoring:    Death (DODSAARS)
              Emigration (VNDS)
              End of study period (31 Dec 2024)

6. Write an analysis plan

An analysis plan is a document you write before looking at your data. It forces you to commit to design, statistics and variables before results can colour your decisions.

Use the STROBE checklist as a skeleton: STROBE Statement - checklists →

For register-based studies, RECORD extends STROBE with items on routinely collected data, and RECORD-PE covers pharmacoepidemiology specifically: RECORD-PE (EQUATOR Network) →

Pre-register your analysis plan on e.g. OSF - this is good scientific practice and required by many journals: Open Science Framework - registration templates

Note

Power and sample size. Even large registers have limited power (the ability to detect an effect that is really there) for rare outcomes or small subgroups. Already in the plan, consider the smallest effect you could meaningfully detect with your expected number of events. The pwr package does simple power/sample-size calculations; for survival and rate designs it is often the number of events (not the number of people) that drives your power.


7. Next steps

Once you have your overview in place:

Back to top