Design • osdc

Principles

These are the guiding principles for this package:

Functionality is as agnostic to data format as possible (e.g. can be used with SQL or Arrow connections, in a data.table format, or as a data.frame).
Functions have consistent inputs and outputs (e.g. inputs and outputs are the same, regardless of specific conditions).
Functions have predictable outputs based on inputs (e.g. if an input is a data frame, the output is a data frame).
Functions have consistent naming based on their action.
Functions have limited additional arguments.
Casing of input variables (upper or lower case) is agnostic, all internal variables are lower case, and output variables are lower case.

Use cases

We make these assumptions on how this package will be used, based on our experiences and expectations for use cases:

Entirely used within the Denmark Statistics or the Danish Health Authority’s servers, since that is where their data are kept.
Used by researchers within or affiliated with Danish research institutions.
Used specifically within a Danish register-based context.

Below is a set of “narratives” or “personas” with associated needs that this package aims to fulfill:

“As a researcher, …”
- “… I want to determine which registers and variables to request from Denmark Statistics and Danish Health Data Authority, so that I am certain I will be able to classify diabetes status of individuals in the registers.”
- “… I want to easily and simply create a dataset that contains data on diabetes status in my population, so that I can begin conducting my research that involves persons with diabetes without having to tinker with coding the correct algorithm to classify them.”
- “… I want to be informed early and in a clear way whether my data fits with the required data type and values, so that I can fix and correct these issues without having to do extensive debugging of the code and/or data.”

Core functionality

This is the list of functionality we aim to have in the osdc package

Classify individuals type 1 and type 2 diabetes status and create a data frame with that information and the date of onset of diabetes.
Provide helper functions to check and process individual registers for the variables required to enter into the classifier.
Provide a list of required variables and registers in order to calculate diabetes status.
Provide validation helper functions to check that variables match what is expected of the algorithm.
Provide a common and easily accessible standard for determining diabetes status within the context of research using Danish registers.

Function conventions

Effectively developing both the main user-exposed and internal functions requires following some conventions and design patterns for building these functions. There are a few conventions we describe here: naming patterns for functions and arguments, their argument input requirements, and their output data structure.

The below conventions are ideals only, to be used as a guidelines to help with development and understanding of the code. They are not hard rules.

Naming

First word is an action verb, later words are objects or conditions.
Exclusion criteria are prefixed with exclude_.
Inclusion criteria are prefixed with include_.
Helpers that get or extract a condition (e.g., “pregnancy” or “date of visit”) are prefixed with get_.
Helpers that drop or keep a specific condition are prefixed with drop_ or keep_ (e.g., “first visit date to maternal care for pregnancy after 40 weeks”). These types of helpers likely are contained in the get_ functions.
Helpers that join registers or output of other functions are prefixed with join_.

Input

As few arguments as is possible, with as few core required arguments as possible (ideally one or two).
include_ functions take a register as the first argument.
- One input register database at a time.
exclude_ functions can take a register as the first argument or take the output from an include_ function.
All functions take a data.frame type object as input. This input object doesn’t need to be strictly a data.frame as long as it acts like a data.frame. For instance, it could be a data.table, a tibble, or a duckdb object.
The first argument will always take a data frame type object.
The second argument could be an output data frame object from another function (usually include_).

Output

All functions output the same type of object as the input object (a data.frame type object). For instance, if the input is a data.table object, the output will also be a data.table.

Interface

The osdc package contains one main function that classifies individuals into those with either type 1 or type 2 diabetes using the Danish registers: classify_diabetes(). This function classifies those with diabetes (type 1 or 2) based on the Danish registers described in the vignette("design") and vignette("data-sources"). All data sources are used as input for this function. The specific inclusion and exclusion details are described in the vignette("algorithm").

Input

There is one argument in classify_diabetes() for each required register, so the argument is:

variable_description |>
  distinct(register_name, register_abbrev) |>
  glue::glue_data("- `{register_abbrev}`: The register that is called '{register_name}' in Danish.")

bef: The register that is called ‘CPR-registerets befolkningstabel’ in Danish.
lmdb: The register that is called ‘Laegemiddelstatistikregisteret’ in Danish.
lpr_adm: The register that is called ‘Landspatientregisterets administrationstabel (LPR2)’ in Danish.
lpr_diag: The register that is called ‘Landspatientregisterets diagnosetabel (LPR2)’ in Danish.
kontakter: The register that is called ‘Landspatientregisterets kontakttabel (LPR3)’ in Danish.
diagnoser: The register that is called ‘Landspatientregisterets diagnosetabel (LPR3)’ in Danish.
sysi: The register that is called ‘Sygesikringsregisteret’ in Danish.
sssy: The register that is called ‘Sygesikringsregisteret’ in Danish.
lab_forsker: The register that is called ‘Laboratoriedatabasens forskertabel’ in Danish.

Output

The output is a data.frame type object which includes four columns:

pnr: The pseudonymised social security number of individuals in the diabetes population (one row per individual).
stable_inclusion_date: The stable inclusion date (i.e., the raw date mutated so only individuals included in the time-period where data coverage is sufficient to make incident cases reliable)¹
raw_inclusion_date: The raw inclusion date (i.e., the date of the second inclusion event as described in the Extracting the diabetes population section above)
diabetes_type The classified diabetes type

For an example, see below.

Example rows of the `data.frame` output of the osdc package.
pnr	stable_inclusion_date	raw_inclusion_date	diabetes_type
0000000001	2020-01-01	2020-01-01	T1D
0000000004	NULL	1995-04-19	T2D

The individuals 0000000001 and 0000000004 have been classified as having diabetes (T1D and T2D, respectively). 0000000001 is classified as having type 1 diabetes (T1D) with an inclusion date of 2020-01-01. Since this date is within a time-period of sufficient data coverage, the column stable_inclusion_date is populated with the same date as raw_inclusion_date.

The individual in the second row, 0000000004 is classified as having type 2 diabetes T2D with an inclusion date of 1995-19-04. Since 1995 is within a time-period of insufficient data coverage, stable_inclusion_date is NULL. However, raw_inclusion_date still contains the inclusion date of this individual.