Function conventions
To effectively develop both the user-facing and internal functions, we follow some conventions and design patterns for building these functions. There are a few conventions we describe here: naming patterns for functions and arguments, their argument input requirements, and their output data structure.
The below conventions are ideals only, to be used as a guidelines to help with development and understanding of the code; they are not hard rules.
Naming
- First word is an action verb, later words are objects or conditions.
- Functions that filter by dropping rows based on specific criteria are prefixed with
drop_.
- Functions that filter by keeping rows based on specific criteria are prefixed with
keep_.
- Helpers that add columns needed for classification are prefixed with
add_.
- Helpers that join the output of other functions are prefixed with
join_.
- Functions that prepare and process register data are prefixed with
prepare_.
- We assume the register data is not taken directly from the original SAS files, but has undergone prior pre-processing and cleaning. These assumptions are checked, and the user is informed if they are not met.
- We also assume that the original source files have been loaded and joined into a single dataset object per register. Although Denmark Statistics stores data by year, all years for a register must be merged into one dataset.
- As few arguments as is possible, with as few core required arguments as possible (ideally one or two).
-
keep_ functions take a register as the first argument.
- One input register database at a time.
-
drop_ functions can take a register as the first argument or take the output from a keep_ function.
- Function arguments take a single DuckDB type object as register input (e.g.
duckplyr_df), consistent with the assumption that each register is provided as a single, unified data frame.
- The first argument will always take a data frame type object.
- The second argument could be an output data frame object from another function.
Output
- All functions output the same type of object as the input object (a
duckplyr_df type object).
Interface
The osdc package contains one main function that classifies individuals into those with either type 1 or type 2 diabetes using the Danish registers and a few helper pre-processing functions.
prepare_lpr*()
In order to classify diabetes status, we need the patient registers with diagnosis information (known collectively as Landspatientregisteret, LPR). There isn’t just one LPR but several different LPRs that have evolved over time. Statistics Denmark (DST) in fact relatively recently created a new LPR (LPR3A) that resolves some issues with the previous LPR registers. Each version of LPR contains different tables and variables, though osdc only needs specific variables from two tables.
We originally required each original LPR register as separate arguments in classify_diabetes(), but this became an issue after the new LPR3A was created. So, we re-designed classify_diabetes() to take only one lpr argument and instead require the different LPRs be pre-processed and joined before entering classify_diabetes(). This way, we can add new pre-processing functions for any future changes to LPR without changing the interface of classify_diabetes().
To help with this pre-processing, we designed several helper functions that follow the pattern prepare_lpr*(), e.g. for LPR2 it is prepare_lpr2(). This way, if DST update the LPR again, we can add another prepare_lpr*() function to prepare the new LPR format for classification.
Unfortunately, the data covered by different revisions of the same registers are not cleanly separated. E.g. data from the year 2005 overlaps between sysi (years 1990 through 2005) and sssy (2005 onward), and data from 2017 and 2018 are contained in both lpr2 (1977 through 2018) and lpr3a (2017 onward). This means that the user must be careful to pre-process these data to avoid duplicated rows!
Each prepare_lpr*() outputs a DuckDB object with the following variables: pnr, date, is_primary_diagnosis, is_diabetes_code, is_t1d_code, is_t2d_code, is_endocrinology_dept, is_medical_dept, and is_pregnancy_code. And a final join_registers() helper function combines the outputs of each prepare_lpr*() into a single data object. See the help docs for prepare_lpr() for more details on these variables. See the diagram below for the general flow of data sources and the different functions that prepare them for the classify_diabetes() function.
classify_diabetes()
This function classifies those with diabetes (type 1 or 2) based on the Danish registers described in this vignette and vignette("data-sources"). All data sources needed by osdc are used as input for this function. The specific details of the classification algorithm are described in the vignette("algorithm").
There is one argument in classify_diabetes() for each required data source. The names and descriptions of these arguments are as follows:
-
bef: The register or set of registers called ‘CPR-registerets befolkningstabel’ in Danish.
-
lmdb: The register or set of registers called ‘Laegemiddelstatistikregisteret’ in Danish.
-
lpr_adm: The register or set of registers called ‘Landspatientregisterets administrationstabel (LPR2)’ in Danish.
-
lpr_diag: The register or set of registers called ‘Landspatientregisterets diagnosetabel (LPR2)’ in Danish.
-
lpr3a_kontakt: The register or set of registers called ‘Landspatientregisterets kontakttabel (LPR3A)’ in Danish.
-
lpr3a_diagnose: The register or set of registers called ‘Landspatientregisterets diagnosetabel (LPR3A)’ in Danish.
-
lpr3f_kontakter: The register or set of registers called ‘Landspatientregisterets kontakttabel (LPR3F)’ in Danish.
-
lpr3f_diagnoser: The register or set of registers called ‘Landspatientregisterets diagnosetabel (LPR3F)’ in Danish.
-
sysi: The register or set of registers called ‘Sygesikringsregisteret’ in Danish.
-
sssy: The register or set of registers called ‘Sygesikringsregisteret’ in Danish.
-
lab_forsker: The register or set of registers called ‘Laboratoriedatabasens forskertabel’ in Danish.
The output is a DuckDB object with four columns:
-
pnr: The pseudonymised social security number of individuals in the diabetes population (one row per individual).
-
stable_inclusion_date: The stable inclusion date (i.e., the raw date mutated so only individuals included in the time-period where data coverage is sufficient to make incident cases reliable).
-
raw_inclusion_date: The raw inclusion date (i.e., the date of the second inclusion event as described in the
vignette("algorithm")).
-
has_t1d: A logical column indicating whether the individual has type 1 diabetes.
-
has_t2d: A logical column indicating whether the individual has type 2 diabetes.
For an example, see below.
Example rows of the data.frame output of the osdc package.
| 1 |
2020-01-01 |
2020-01-01 |
TRUE |
FALSE |
| 4 |
NA |
1995-04-19 |
FALSE |
TRUE |
The individuals 1 and 4 have been classified as having diabetes (either has_t1d or has_t2d, respectively). 1 is classified as having type 1 diabetes (T1D) with an inclusion date of 2020-01-01. Since this date is within a time-period of sufficient data coverage, the column stable_inclusion_date is populated with the same date as raw_inclusion_date.
The individual in the second row, 4 is classified as having type 2 diabetes T2D with an inclusion date of 1995-19-04. Since 1995 is within a time-period of insufficient data coverage, the validity of this inclusion date is uncertain and stable_inclusion_date is NULL. However, raw_inclusion_date still contains the inclusion date of this individual.
In the context of generating a diabetes population with valid inclusion dates (e.g. true incident cases), three aspects of the register records were considered when determining which periods of time had sufficient data available:
-
Sufficient data on inclusion events: While HbA1c test results are the diagnostic standard, these records are the newest addition to the register data ecosystem and have limited historical coverage nationwide. According to supplementary analyses by Isaksen et al.[@Isaksen2023sup], this data has complete nationwide coverage from Q4 2015 onward (direct link to supplementary file S9). However, as the vast majority of diabetes patients are treated with glucose-lowering drugs at some point, we made the pragmatic assessment that prescription drug purchase data are sufficient to identify incident cases. These are available from 1995 onward.
-
Sufficient data on exclusion events: In order to correctly identify pregnancies and discard inclusion events that may occur due to gestational diabetes rather than T1D or T2D, register information on pregnancy occurrences is necessary. In the patient register, this information is available from 1994 onward, but coverage is insufficient until 1997, according to supplementary analyses by Isaksen[@isaksen2023thesis] (direct link to analysis).
-
Sufficient wash-out period: In order to “wash out” prevalent cases from true incident cases, a period of time with valid data is necessary to capture prevalent cases, before new inclusions can be considered true incident cases and the incidence stabilizes. We considered a full year to be enough.
Given the above requirements of complete nationwide data on inclusion and exclusion events, as well as a sufficient wash-out period to establish valid incident cases, the algorithm was designed to restrict valid inclusion dates to periods where all criteria are met. Consequently, only inclusion dates occurring from 1998 onward are considered true incident cases and assigned a stable_inclusion_date value.