Packages, algorithms and help
Which tool for which task - and what is available on DST
This is a reference hub: which tool solves which task, whether it is available on DST, and where to find more help. The guide takes you to an analysis-ready dataset and points you towards the right tools for the actual analysis - but does not teach the statistical methods. For that you need a statistician, a course, or the packages’ own documentation.
1. Validated algorithms
Published, peer-reviewed methods you can use directly to construct covariates. Code for NMI and the SEPLINE algorithm is under development.
| Algorithm | What it does | Use it to | Article |
|---|---|---|---|
| Open Source Diabetes Classifier (OSDC) | Classifies everyone in Denmark as T1D, T2D or no diabetes | Define a diabetes population, type and onset date | doi:10.2147/CLEP.S407019 |
| Nordic Multimorbidity Index (NMI) | Weighted comorbidity score (50 predictors) | Adjust for comorbidity in a model or describe comorbidity burden | doi:10.2147/CLEP.S353398 |
| SEPLINE - Socioeconomic Position in Epidemiological Research | National guideline for socioeconomic variables | Education, income, employment as covariates | doi:10.2147/CLEP.S520772 |
Charlson as an alternative to NMI: both heaven::charlsonIndex() and EpiForsk (see below) calculate Charlson comorbidity score from LPR diagnoses - both are on DST.
2. Existing functions for data management and extraction
heaven - Danish register data toolkit
(pre-installed on DST; not on CRAN - github.com/tagteam/heaven)
Functions built specifically for Danish register data by Thomas Gerds (University of Copenhagen, Biostatistics): Lexis-splitting for time-varying covariates, drug exposure windows from LMDB, risk-set matching and Charlson score.
What heaven can do - and when to use it
Lexis-splitting - split follow-up time across multiple time axes when a covariate changes over time (age, calendar period, time since exposure).
library(heaven)
df_split <- lexisTwo(
indat = kohort,
invars = c("pnr", "inn", "out", "event"), # id, start date, end date, outcome
splitvars = list(age = c(40, 50, 60, 70), # age boundaries
cal = c(2010, 2015, 2020)) # calendar years
)
# Returns: one data frame with MORE rows per person - one per time interval.
# New columns are added for the current bands (age group, calendar period etc.).
# Ready for a Cox model with time-varying covariates.Drug exposure from LMDB - calculate exposure periods from prescription data (e.g. “was the person on metformin at index date?”).
exp_table <- medicinMacro(
indat = lmdb_data, id = "pnr", atc = "atc", eksd = "eksd",
ddds = "ddds", maxdepot = 3
)
# Returns: one row per exposure period per person with start and end date
# for the period. Use subsequent joins to determine exposure at index date.Charlson comorbidity score from LPR diagnoses:
charlson_data <- charlsonIndex(indat = lpr_diag_data, id = "pnr",
code = "c_diag", code_type = "icd10")
# Returns: one row per person with total Charlson score and binary indicators
# per Charlson disease group. The function matches ICD-10 codes to categories itself.Risk-set matching - match cases to controls at index date:
matched <- riskSetMatch(ptid = "pnr", event = "event",
terms = c("koen", "birth_year"), dat = kohort, ratio = 5)
# Returns: matched dataset with matchID linking cases and controls.heaven is confirmed available on DST - use library(heaven) directly. It is not on CRAN, so outside DST it must be fetched from GitHub.
EpiForsk - register data helper functions from SSI
(CRAN; on DST - github.com/Laksafoss/EpiForsk)
Code-sharing package from the Statens Serum Institut. Particularly useful: flatten_date_intervals() which merges overlapping hospital contacts into continuous periods - necessary before comorbidity lookback. Danish Charlson score included.
Most important functions and when to use them
flatten_date_intervals() - overlapping contacts
A person can have two overlapping admissions in LPR (e.g. a transfer). Before comorbidity lookback, overlapping periods must be merged:
library(EpiForsk)
lpr_flat <- flatten_date_intervals(
data = lpr_contacts, id = "pnr",
start_date = "d_inddto", end_date = "d_uddto"
)
# Returns: a data frame with merged start/end per person -
# overlapping periods are reduced to the total covered period.charlson_score() - Danish-calibrated Charlson comorbidity score from ICD-10 (incl. D-prefix). See the callout below for what it requires.
charlson <- charlson_score(data = lpr_diag_data, id = "pnr", icd_codes = "c_diag")
# Returns: one row per person with Charlson score.What does charlson_score() require? You must extract a lookback period from LPR yourself (typically 5 years before index date) and give the function all diagnoses in that window - A, B and G type. You do not need to filter on specific ICD codes in advance; the function matches the codes to Charlson categories itself. The D-prefix (DG30 etc.) is handled by EpiForsk - it does not need to be stripped first.
# Example: extract lookback and pass all diagnoses
# pnr and d_inddto are in lpr_adm - not lpr_diag
# Step 1: broadest possible time window for the entire cohort
min_lookback <- min(kohort$index_date) - 365.25 * 5 # earliest lookback start
# Step 2: fetch contacts and join to diagnoses
lpr2_lookback <- lpr_adm %>%
semi_join(tibble(pnr = kohort$pnr), by = "pnr") %>% # only the cohort's pnr's
filter(d_inddto >= !!min_lookback) %>% # broad pre-filter
select(pnr, recnum, d_inddto) %>%
inner_join(
lpr_diag %>%
filter(c_diagtype %in% c("A", "B", "G")) %>% # A/B/G for comorbidity
select(recnum, c_diag),
by = "recnum"
) %>%
collect() %>%
left_join(kohort %>% select(pnr, index_date), by = "pnr") %>%
filter(d_inddto >= index_date - 365.25 * 5, # precise per-person lookback
d_inddto < index_date)
charlson <- charlson_score(data = lpr2_lookback, id = "pnr", icd_codes = "c_diag")dataReporter - automatic data quality report
(CRAN; confirmed on DST - github.com/ekstroem/dataReporter)
Automatically generates an HTML/PDF report of all columns in a dataset: distribution, missing values, outliers and type checks. Useful for validating a new register extraction before starting analysis. Published in Journal of Statistical Software.
library(dataReporter)
makeDataReport(my_extract) # opens report in browserSee also Phase 7 - Inspect your data for manual inspection commands.
fakeregs - synthetic Danish register data
(GitHub only - local use only - github.com/steno-aarhus/fakeregs)
Fictitious but structurally realistic data for nine DST registers. For practising and testing code locally without DST access. Not used on the server. See Phase 6 - First extraction.
fastreg - SAS → parquet (and reading registers back)
(CRAN; install.packages("fastreg") - dp-next.github.io/fastreg)
The recommended tool for converting SAS registers to parquet. It converts (convert(), partitioned by year, with a use_template() targets pipeline for the whole workspace) and reads back (read_register("bef") returns a lazy DuckDB connection, reading all yearly subfolders as one dataset). Set options(fastreg.project_rawdata_dir=, fastreg.project_workdata_dir=) once, then refer to registers by name. See Phase 4 - SAS → parquet.
3. Analysis tools
All packages in this section are confirmed on DST (June 2026). The guide does not teach the statistical methods - see package documentation for methodological choices.
survival - the foundation for survival analysis
(CRAN; always available - cran.r-project.org/package=survival)
Provides Surv(time, status), coxph() and survfit() - the fundamental building blocks on which all survival packages below are built. You rarely call {survival} directly for plots and tables, but library(survival) belongs at the top of your analysis scripts.
library(survival)
cox <- coxph(Surv(time, event) ~ age + sex + nmi_score, data = cohort)
summary(cox) # HR, 95% CI, p-values - pass to Publish or gtsummary for formatted outputgtsummary - Table 1 and manuscript tables
(CRAN - cran.r-project.org/package=gtsummary)
Standard for manuscript tables: tbl_summary() for Table 1 stratified by exposure, tbl_regression() for Cox and logistic regression, tbl_survfit() for survival summaries. Published in R Journal 2021.
Example: Table 1
library(gtsummary)
table1 <- cohort %>%
select(group, age, sex, nmi_score, occupation_cat) %>%
tbl_summary(
by = group, # stratify by exposure
statistic = list(
all_continuous() ~ "{mean} ({sd})", # continuous: mean (SD)
all_categorical() ~ "{n} ({p}%)" # categorical: n (%)
),
missing = "no"
) %>%
add_overall() %>%
add_p()Output control: Counts of 0–4 may not be repatriated. Manual rounding: round(x / 5) * 5. Z.gtsummary.addons::round_5_gtsummary() automates this - not on DST, import via data manager.
carpenter - pipe-based table structure
(CRAN - github.com/lwjohnst86/carpenter)
Older alternative to gtsummary with a simpler pipe-based API: start with outline_table(), add rows with add_rows(), finish with build_table(). By Luke Johnston.
library(carpenter)
outline_table(cohort, "group") %>%
add_rows("age", stat_meanSD) %>% # continuous: mean (SD)
add_rows("sex", stat_nPct) %>% # categorical: n (%)
add_rows("nmi_score", stat_medianIQR) %>% # continuous: median (IQR)
build_table()prodlim - Kaplan–Meier and Aalen–Johansen cumulative incidence
(CRAN - github.com/tagteam/prodlim)
Estimates and plots survival curves and cumulative incidence. Hist() handles competing risks with a single status variable: 0 = censored, 1 = primary event, 2 = competing event (e.g. death). Use prodlim for descriptive curves - riskRegression for regression.
Example: KM curve and Aalen–Johansen
library(prodlim)
# Kaplan–Meier:
fit_km <- prodlim(Hist(time, event) ~ group, data = cohort)
plot(fit_km)
# Aalen–Johansen cumulative incidence (competing risks):
# status: 0 = censored, 1 = primary event, 2 = competing event (e.g. death)
fit_aj <- prodlim(Hist(time, status) ~ group, data = cohort)
plot(fit_aj, cause = 1) # cumulative incidence for primary eventKM alone overestimates cumulative incidence in the presence of competing risks. Use Aalen–Johansen (cause = 1) instead.
riskRegression - Cause-Specific Cox, Fine-Gray and absolute risk
(CRAN - github.com/tagteam/riskRegression)
For regression with competing risks. CSC() fits Cause-Specific Cox (one model per cause), FGR() fits Fine-Gray subdistribution hazard. predictRisk() computes absolute risk at specific time points. Score() validates and compares models (Brier score, time-dependent AUC).
Example: CSC model and absolute risk
library(riskRegression)
library(survival)
# Cause-Specific Cox:
csc_model <- CSC(Hist(time, status) ~ age + sex + nmi_score, data = cohort)
summary(csc_model)
# Absolute risk at 1, 3 and 5 years for primary event:
ar <- predictRisk(csc_model, newdata = cohort,
times = c(1, 3, 5), cause = 1)
# ar: matrix - one column per time point, one row per personPublish - regression results ready for manuscript
(CRAN - github.com/tagteam/Publish)
regressionTable() takes a fitted model (Cox, logistic, linear) and returns a table with estimates, 95% CI and p-values. plotConfidence() produces forest plots from the same output.
library(Publish)
library(survival)
cox_model <- coxph(Surv(time, event) ~ age + sex + nmi_score, data = cohort)
regressionTable(cox_model) # HR, 95% CI, p-values
plotConfidence(regressionTable(cox_model)) # forest plotMatchIt - propensity score and exact matching
(CRAN - cran.r-project.org/package=MatchIt)
Matches exposed to unexposed on observed covariates. Supports nearest-neighbour, exact and optimal matching. summary() gives balance statistics (SMD), match.data() returns the matched dataset. Alternative to heaven::riskSetMatch() with better balance diagnostics.
Example: 1:1 nearest-neighbour matching
library(MatchIt)
m <- matchit(exposed ~ age + sex + nmi_score,
data = cohort,
method = "nearest", # nearest-neighbour matching
ratio = 1) # 1:1
# Balance check - SMD < 0.1 is a common rule of thumb:
summary(m, standardize = TRUE)
# Matched dataset:
matched <- match.data(m)pec - prediction error curves
(CRAN - cran.r-project.org/package=pec)
Computes and plots Brier score over time for comparing and validating prediction models. Niche - only relevant for prediction models; riskRegression::Score() usually covers the need.
BuyseTest - Generalised Pairwise Comparisons
(CRAN, confirmed on DST v3.3.3 - github.com/bozenne/BuyseTest)
Implements Generalised Pairwise Comparisons (GPC): compares two groups via win ratio, net benefit and win probability on prioritised endpoints. Niche - relevant when you have a hierarchy of endpoints and want to avoid classical event-time assumptions.
library(BuyseTest)
# One time-to-event endpoint:
result <- BuyseTest(group ~ tte(time, event, threshold = 0),
data = cohort,
method.inference = "u-statistic")
summary(result) # win ratio, net benefit, win probabilitypowerBuyseTest() calculates the required sample size given desired power and expected effect - useful before data collection or grant applications.
kinship2 - pedigrees and kinship matrices
(CRAN - cran.r-project.org/web/packages/kinship2)
Creates pedigree objects from parent–child relationships and computes kinship coefficients (probability of shared allele: 0.5 = parent/child, 0.25 = siblings). Useful for multi-generational family studies. By Terry Therneau (Mayo Clinic).
Example: create pedigree and compute kinship
library(kinship2)
# Create pedigree:
# DST's koen variable (1 = male, 2 = female) matches kinship2's expectation directly
ped <- pedigree(
id = family$pnr,
dadid = family$pnr_father, # 0 or NA if father unknown
momid = family$pnr_mother, # 0 or NA if mother unknown
sex = family$koen # 1 = male, 2 = female
)
# Kinship coefficients:
km <- kinship(ped)
# Visualise family tree:
plot(ped)tame - cluster medication exposure from LMDB
(CRAN - github.com/Laksafoss/tame)
Agglomerative hierarchical clustering of medication data using a custom distance measure combining ATC classification, timing and dose. medic() is the primary function. Relevant for characterising polypharmacy patterns in cohorts. By Anders Laksafoss (SSI).
Example: cluster medication patterns
library(tame)
# Cluster on ATC code alone:
clust <- medic(lmdb_data, id = pnr, atc = atc, k = 5)
# k = desired number of clusters - choose with scree plot or domain knowledge
# With timing information (column range):
clust <- medic(lmdb_data, id = pnr, atc = atc,
timing = trimester_1:trimester_3, # columns defining time windows
k = 5)
summary(clust) # summarise clustersid, atc and timing use bare column names (NSE - no quotes). lmdb_data must have one row per prescription dispensing.
Possible inspiration - not confirmed on DST
famnet (GitHub, experimental) - identifies family relationships from person IDs: given a dataset with PersonID, FatherID and MotherID, fmn_siblings() finds all sibling pairs with relationship type. Relevant for family linkage across generations in Danish register data. By Luke Johnston.
plotsurv (GitHub, not on DST) - publication-ready Kaplan–Meier and Aalen–Johansen competing-risks plots with at-risk table, CI bands and censoring ticks. plotsurv(survfit_object).
Z.gtsummary.addons (GitHub, not on DST) - extends gtsummary with add_SMD() (standardised mean differences for Table 1) and round_5_gtsummary() (rounds counts to the nearest 5 - relevant for DST output control).
Packages mentioned in the analysis section (Phase 13)
Tools from the analysis section beyond survival and gtsummary above. Most are on CRAN; confirm availability on DST before use.
| Package | For | Page |
|---|---|---|
MASS (glm.nb) |
negative binomial regression (overdispersion in rates) | Rates (Poisson) |
Epi, popEpi |
person-time, Lexis splitting, rates, SIR/SMR | Rates (Poisson) |
splines (ns) |
non-linear relationships (smooth curves) | Regression |
interactionR |
additive interaction (RERI) | Regression |
lme4, coxme |
mixed models / random effects | Regression |
sandwich, lmtest |
cluster-robust standard errors | Regression |
rms |
flexible regression modelling with splines | Regression |
survRM2 |
RMST (when proportional hazards is violated) | Time-to-event |
SCCS |
self-controlled case series | Special designs |
CMAverse |
causal mediation analysis | Regression |
pwr |
power and sample size | Plan your study |
4. Where to get more help?
Community and courses - see the prioritised help list and course links in Phase 2 - R: the bare essentials (Stack Overflow, Zheers R Coding Café, DDEA, R for Data Science, Epi R Handbook).
Institutional reference material:
| Source | What you find |
|---|---|
| ctpteam/RDST | “R on DST” manual: project setup, data access, good coding practice on the DST server |
| ctpteam/DST | Guide to LPR3, DST manuals, slides and a heaven PDF |
| steno-aarhus/registers-project-database | SDCA’s own database: variable lists, register descriptions and application templates for the Danish Health Data Authority |
| github.com/tagteam | riskRegression, prodlim, Publish, pec, heaven - Thomas Gerds and Brice Ozenne, KU Biostatistics |
Next steps
→ Phase 14 - Export and repatriation
Further depth in The Epidemiologist R Handbook: