Internal function flow • osdc

The function flow describes the functions within the package, both internal and user-facing, which data sources they rely on, and how they are connected to each other. First, the functions for classifying diabetes status are presented, followed by the functions for classifying the diabetes type.

Function flow

This results in the functionality flow for classifying diabetes status seen below. This flow can be divided into two sections: extracting the diabetes population and classifying diabetes type which we will detail in the following sections.

Flow of functions, as well as their required input registers, for classifying diabetes status using the osdc package. Light blue and orange boxes represent filtering functions (inclusion and exclusion events, respectively). Uncoloured boxes are helper functions that get or extract a condition or joins data or function outputs.

Population extraction

In the following sections, we describe the functions used to extract the diabetes population from the Danish registers. The functions are divided into inclusion and exclusion events, and the final diagnosis date is calculated based on these events.

Flow of functions, as well as their required input registers, for extracting the population with diabetes using the osdc package. Light blue and orange boxes represent filtering functions (inclusion and exclusion events, respectively). Uncoloured boxes are helper functions that get or extract a condition or joins data or function outputs.

Inclusion events

prepare_lpr2(): See ?prepare_lpr2 for more information.
prepare_lpr3(): See ?prepare_lpr3 for more information.

`include_diabetes_diagnosis()`

#' Include diabetes diagnoses from LPR2 and LPR3.
#'
#' Uses the hospital contacts from LPR2 and LPR3 to include all dates of diabetes
#' diagnoses to use for inclusion, as well as additional information needed to classify diabetes
#' type. Diabetes diagnoses from both ICD-8 and ICD-10 are included.
#'
#' The output is used as inputs to `join_inclusions()`.
#' This output is passed to the `join_inclusions()` function, where the
#' `dates` variable is used for the final step of the inclusion process.
#' The variables of counts of diabetes type-specific primary diagnoses (the
#' four columns prefixed `n_` above) are carried over for the subsequent
#' classification of diabetes type, initially as inputs to the
#' `get_t1d_primary_diagnosis()` and `get_majority_of_t1d_diagnoses()`
#' functions.
#'
#' @param lpr2 The output from `prepare_lpr2()`.
#' @param lpr3 The output from `prepare_lpr3()`.
#'
#' @return The same type as the input data, default as a [tibble::tibble()],
#'  with the following columns and up to two rows per individual:
#'
#'  -   `pnr`: The personal identification variable.
#'  -   `dates`: The dates of the first and second hospital diabetes diagnosis.
#'  -   `n_t1d_endocrinology`: The number of type 1 diabetes-specific primary
#'      diagnosis codes from endocrinology departments.
#'  -   `n_t2d_endocrinology`: The number of type 2 diabetes-specific primary
#'      diagnosis codes from endocrinology departments.
#'  -   `n_t1d_medical`: The number of type 1 diabetes-specific primary
#'      diagnosis codes from medical departments.
#'  -  `n_t2d_medical`: The number of type 2 diabetes-specific primary
#'      diagnosis codes from medical departments.
#'  -  `has_lpr_diabetes_diagnosis`: A logical variable that acts as a helper
#'      indicator for use in later functions.
#'
#' @keywords internal
#' @inherit algorithm seealso
#'
#' @examples
#' register_data <- simulate_registers(c("lpr_diag", "lpr_adm", "diagnoser", "kontakter"))
#' include_diabetes_diagnosis(
#'   lpr2 = prepare_lpr2(register_data$lpr_diag, register_data$lpr_adm),
#'   lpr3 = prepare_lpr3(register_data$diagnoser, register_data$kontakter)
#' )
include_diabetes_diagnosis <- function(lpr2, lpr3) {
  # Combine and process the two inputs
  lpr2 |>
    dplyr::full_join(lpr3, by = dplyr::join_by(.data$pnr)) |>
    dplyr::select(
      "pnr",
      "dates" = "date"
      # n_t1d_endocrinology =
      # n_t2d_endocrinology =
      # n_t1d_medical =
      # n_t2d_medical =
    ) |>
    dplyr::mutate(has_lpr_diabetes_diagnosis = TRUE)
}

`include_podiatrist_services()`

See ?include_podiatrist_services for more information.

`include_hba1c()`

See ?include_hba1c for more information.

`include_gld_purchases()`

See ?include_gld_purchases for more information.

Exclusion events

`exclude_potential_pcos()`

See ?exclude_potential_pcos for more information.

`exclude_pregnancy()`

#' Exclude any pregnancy events that could be gestational diabetes.
#'
#'
#' The function `exclude_pregnancy()` takes the combined outputs from
#' `prepare_lpr2()`, `prepare_lpr3()`, `include_hba1c()`, and
#' `exclude_potential_pcos()` and uses diagnoses from LPR2 or LPR3 to
#' exclude both elevated HbA1c tests and GLD purchases during pregnancy, as
#' these may be due to gestational diabetes, rather than type 1 or type 2
#' diabetes. The aim is to identify pregnancies based on diagnosis codes
#' specific to pregnancy-ending events (e.g. live births or miscarriages),
#' and then use the dates of these events to remove inclusion events in the
#' preceding months that may be related to gestational diabetes (e.g.
#' elevated HbA1c tests or purchases of glucose-lowering drugs during
#' pregnancy).
#'
#' After these exclusion functions have been applied, the output serves as
#' inputs to two sets of functions:
#'
#' 1.  The censored HbA1c and GLD data are passed to the
#'     `join_inclusions()` function for the final step of the inclusion
#'     process.
#' 2.  the censored GLD data is passed to the
#'     `get_only_insulin_purchases()`,
#'     `get_insulin_purchases_within_180_days()`, and
#'     `get_insulin_is_two_thirds_of_gld_doses()` helper functions for the
#'     classification of diabetes type.
#'
#' @param excluded_pcos Output from `exclude_potential_pcos()`.
#' @param pregnancy_dates Output from `get_pregnancy_dates()`.
#' @param included_hba1c Output from `include_hba1c()`.
#'
#' @returns The same type as the input data, default as a [tibble::tibble()].
#'    Has the same output data as the input `excluded_potential_pcos()`, except
#'    for a helper logical variable `no_pregnancy` that is used in later functions.
#' @keywords internal
#' @inherit algorithm seealso
#'
#' @examples
#' register_data <- simulate_registers(c(
#'   "lmdb", "bef", "lpr_diag", "lpr_adm",
#'   "diagnoser", "kontakter", "lab_forsker"
#' ))
#' lpr2 <- prepare_lpr2(register_data$lpr_diag, register_data$lpr_adm)
#' lpr3 <- join_lpr3(register_data$diagnoser, register_data$kontakter)
#' lmdb |>
#'   include_gld_purchases() |>
#'   exclude_potential_pcos(register_data$bef) |>
#'   exclude_pregnancy(
#'     get_pregnancy_dates(lpr2, lpr3),
#'     include_hba1c(register_data$lab_forsker)
#'   )
exclude_pregnancy <- function(excluded_pcos, pregnancy_dates, included_hba1c) {
  # Filter using the algorithm for pregnancy
  excluded_pcos |>
    # Exclude those who are not pregnant.
    dplyr::full_join(pregnancy_dates, by = dplyr::join_by(.data$pnr)) |>
    dplyr::full_join(included_hba1c, by = dplyr::join_by(.data$pnr)) |>
    # Filtering here...
    dplyr::mutate(no_pregnancy = TRUE)
}

#' Simple function to get only the pregnancy event dates.
#'
#' @param lpr2 Output from `prepare_lpr2()`.
#' @param lpr3 Output from `prepare_lpr3()`.
#'
#' @returns The same type as the input data, default as a [tibble::tibble()].
#' @keywords internal
#' @inherit algorithm seealso
#'
#' @examples
#' register_data <- simulate_registers(c("lpr_diag", "lpr_adm", "diagnoser", "kontakter"), 100)
#' lpr2 <- prepare_lpr2(register_data$lpr_diag, register_data$lpr_adm)
#' lpr3 <- prepare_lpr3(register_data$diagnoser, register_data$kontakter)
#' get_pregnancy_dates(lpr2, lpr3)
get_pregnancy_dates <- function(lpr2, lpr3) {
  # Filter using the algorithm for pregnancy
  lpr2 |>
    dplyr::full_join(lpr3, by = dplyr::join_by(pnr)) |>
    dplyr::filter(has_pregnancy_events) |>
    dplyr::select(
      pnr,
      pregnancy_event_date = date
    )
}

Joining inclusions and exclusions

`join_inclusions()`

#' Join included events.
#'
#' This function joins the outputs from all the inclusion and exclusion
#' functions, by `pnr` and `dates`. Input datasets:
#'
#' - `included_diabetes_diagnoses`: Dates are the first and second hospital diabetes diagnosis.
#' - `included_podiatrist_services`: Dates are the first and second diabetes-specific podiatrist record.
#' - `hba1c_censored_pregnancy`: Dates are the first and second elevated HbA1c test results (after censoring potential gestational diabetes).
#' - `gld_censored_pcos_pregnancy`: Dates are the first and second purchase of a glucose-lowering drug (after censoring potential polycystic ovary syndrome and gestational diabetes).
#'
#' @param included_diabetes_diagnoses Output from [include_diabetes_diagnoses()].
#' @param included_podiatrist_services Output from [include_podiatrist_services()].
#' @param hba1c_censored_pregnancy Output from [exclude_pregnancy()] when given `hba1c` data.
#' @param gld_censored_pcos_pregnancy Output from [exclude_pregnancy()] when given `gld_censored_pcos` data.
#'
#' @returns The same type as the input data, default as a [tibble::tibble()],
#'   with the joined columns from the output of [include_diabetes_diagnoses()],
#'   [include_podiatrist_services()] and [exclude_pregnancy()]. There will be 
#'   1-8 rows per `pnr`.
#' @keyword internal
#' @inherit algorithm seealso
join_inclusions <- function(
    included_diabetes_diagnoses,
    included_podiatrist_services,
    hba1c_censored_pregnancy,
    gld_censored_pcos_pregnancy
    ) {
  # Combine the outputs from the inclusion and exclusion events
  purrr::reduce(
    list(
      included_diabetes_diagnoses,
      included_podiatrist_services,
      excluded_pregnancy
    ),
    # This joins *only* by pnr and dates. If datasets have the same column
    # names, they will be renamed to differentiate them.
    # TODO: We may need to ensure that no two datasets have the same columns.
    \(x, y) dplyr::full_join(x, y, by = dplyr::join_by(.data$pnr, .data$dates))
  )
}

Create inclusion dates

#' Create the final diagnosis date based on all the inclusion event types.
#'
#' The function `create_inclusion_dates()` takes the output from `join_inclusions()`
#' and defines the final diagnosis date based on all the inclusion event types.
#' Keeps only those with 2 or more recorded inclusion events, regardless of the
#' type of these events (e.g. two elevated HbA1c tests will lead to inclusion as
#' well as one elevated HbA1c test followed by a purchase of glucose-lowering
#' drugs).
#'
#' @param inclusions Output from [join_inclusions()].
#' @param stable_inclusion_start_date The date from when the inclusion events
#'    from all sources are considered more 'stable' (e.g. time after the change
#'    in how medication drugs are labeled and how doctors actually regularly
#'    input the new change into the database).
#'
#' @returns The same type as the input data, default as a [tibble::tibble()],
#'   along with the `purchase_date`, `atc`, `contained_doses` columns from
#'   `exclude_pregnancy()`, and the `n_t1d_endocrinology`,
#'   `n_t2d_endocrinology`, `n_t1d_medical`, and `n_t2d_medical` columns from
#'   `include_diabetes_diagnoses()`. It also creates two new columns:
#'
#'   - `raw_inclusion_date`: Date of inclusion, which is the second
#'      earliest recorded event.
#'   - `stable_inclusion_date`: Date of inclusion of individuals included
#'      at least one year after the incorporation of inclusions based on
#'      glucose-lowering drug data (1998 onwards when using National Patient
#'      Register data for censoring of gestational diabetes). Limits the
#'      included cohort to only individuals with a valid date of inclusion
#'      (and thereby valid age at inclusion & duration of diabetes).
#'
#' @keywords internal
#' @inherit algorithm seealso
create_inclusion_dates <- function(inclusions, stable_inclusion_start_date = "1998-01-01") {
  inclusions |>
    # TODO: May need to consider more efficient ways than using group by.
    dplyr::group_by(.data$pnr) |>
    # Drop earliest date so only those with two or more events are included.
    dplyr::filter(.data$dates != min(.data$dates, na.rm = TRUE)) |>
    dplyr::mutate(
      # Earliest date in the rows for each individual.
      raw_inclusion_date = min(.data$dates, na.rm = TRUE),
      stable_inclusion_date = dplyr::if_else(
        .data$raw_inclusion_date < lubridate::as_date(stable_inclusion_start_date),
        NA,
        .data$raw_inclusion_date
      )
    ) |>
    dplyr::ungroup() |>
    dplyr::select(
      "pnr",
      "raw_inclusion_date",
      "stable_inclusion_date",

      # From `exclude_pregnancy()` via the GLD purchases
      # TODO: this might need to be renamed in a previous step, rather than here.
      "purchase_date" = "date",
      "atc",
      "contained_doses",

      # From `include_diabetes_diagnoses()`
      "n_t1d_endocrinology",
      "n_t2d_endocrinology",
      "n_t1d_medical",
      "n_t2d_medical"
    )
}

Classifying the diabetes type

The next step of the OSDC algorithm classifies individuals from the extracted diabetes population as having either T1D or T2D. As described in the vignette("design"), individuals not classified as T1D cases are classified as T2D cases.

As the diabetes type classification incorporates an evaluation of the time from diagnosis/inclusion to first subsequent purchase of insulin, the get_diabetes_type() function has to take the date of diagnosis and all purchases of GLD drugs (after censoring) as inputs. In addition, information on diabetes type-specific primary diagnoses from hospitals is also a requirement.

Thus, the function takes the following inputs from get_inclusion_date(), exclude_pregnancy(), and include_diabetes_diagnoses():

From get_inclusion_date(): Information on date of diagnosis of diabetes
- pnr
- raw_inclusion_date
- stable_inclusion_date
From exclude_pregnancy(): Information on historic GLD purchases:
- pnr: identifier variable
- date: dates of all purchases of GLD.
- atc: type of drug
- contained_doses: defined daily doses of drug contained in purchase
From include_diabetes_diagnoses(): Information on diabetes type-specific primary diagnoses from hospitals:
- pnr: identifier variable
- n_t1d_endocrinology: number of type 1 diabetes-specific primary diagnosis codes from endocrinological departments
- n_t2d_endocrinology: number of type 2 diabetes-specific primary diagnosis codes from endocrinological departments
- n_t1d_medical: number of type 1 diabetes-specific primary diagnosis codes from medical departments
- n_t2d_medical: number of type 2 diabetes-specific primary diagnosis codes from medical departments

For each pnr number, several helper functions are applied to these inputs to extract additional information from the censored GLD data and diagnoses to use for classification of diabetes type. All of these return a single value (TRUE, otherwise FALSE) for each individual:

get_only_insulin_purchases():
- Inputs passed from exclude_pregnancy():
  - atc
- Outputs:
  - only_insulin_purchases = TRUE if no purchases with atc starting with “A10A” are present
get_insulin_purchases_within_180_days()
- Inputs passed from exclude_pregnancy():
  - date & atc
- Inputs passed from get_inclusion_date():
  - raw_inclusion_date
- Outputs: TRUE If any purchases with atc starting with “A10A” have a date between 0 and 180 days higher than raw_inclusion_date
get_insulin_is_two_thirds_of_gld_doses()
- Inputs passed from exclude_pregnancy():
  - contained_doses & atc
- Outputs: TRUE If the sum of contained_doses of rows of atc starting with “A10A” (except “A10AE5”) is at least twice the sum of contained_doses of rows of atc starting with “A10B” or “A10AE5”
get_any_t1d_primary_diagnoses():
- Inputs passed from include_diabetes_diagnoses():
  - n_t1d_endocrinology & n_t1d_medical
- Outputs: TRUE if the combined sum of the inputs is 1 or above.
get_type_diagnoses_from_endocrinology():
- Inputs passed from include_diabetes_diagnoses():
  - n_t1d_endocrinology, n_t2d_endocrinology
- Outputs: type_diagnoses_from_endocrinology = TRUE if the combined sum of the inputs is 1 or above
get_type_diagnosis_majority():
- Inputs passed from include_diabetes_diagnoses():
  - n_t1d_endocrinology, n_t2d_endocrinology, n_t1d_medical & n_t2d_medical
- Inputs passed from get_type_diagnoses_from_endocrinology():
  - type_diagnoses_from_endocrinology
- Outputs: TRUE if type_diagnoses_from_endocrinology == TRUE and n_t1d_endocrinology is above n_t2d_endocrinology. Also TRUE if type_diagnoses_from_endocrinology = FALSE and n_t1d_medical is above n_t2d_medical

get_diabetes_type() evaluates all the outputs from the helper functions to define diabetes type for each individual. Diabetes type is classified as “T1D” if:

only_insulin_purchases == TRUE & any_t1d_primary_diagnoses == TRUE
Or only_insulin_purchases == FALSE & any_t1d_primary_diagnoses == TRUE & type_diagnosis_majority == TRUE & insulin_is_two_thirds_of_gld_doses == TRUE & insulin_purchases_within_180_days == TRUE

get_diabetes_type() returns a data.frame with one row per pnr number and four columns: pnr, stable_inclusion_date, raw_inclusion_date & diabetes_type. This is the final product of the OSDC algorithm. See the vignette("design") for an more detail on the two inclusion dates and their intended use-cases.

Flow of functions for classifying diabetes status using the osdc package.

Type 1 classification

The details for the classification of type 1 diabetes is described in vignette("design"). To classify whether an individual has T1D, the OSDC algorithm includes the following criteria:

get_t1d_primary_diagnosis(), which relies on the hospital diagnoses extracted from lpr_diag (LPR2) and diagnoser (LPR3) in the previous steps.
get_only_insulin_purchases() which relies on the GLD purchases from Lægemiddeldatabasen to get patients where all GLD purchases are insulin only.
get_majority_of_t1d_diagnoses() (as compared to T2D diagnoses) which again relies on primary hospital diagnoses from LPR.
get_insulin_purchase_within_180_days() which relies on both diagnosis from LPR and GLD purchases from Lægemiddeldatabasen.
get_insulin_is_two_thirds_of_gld_doses which relies on the GLD purchases from Lægemiddeldatabasen.

Note the following hierarchy in first function above: First, the function checks whether the individual has primary diagnoses from endocrinological specialty. If that’s the case for a given person, the check of whether they have a majority of T1D primary diagnoses are based on data from endocrinological specialty. If that’s not the case, the check will be based on primary diagnoses from medical specialties.

Type 2 classification

As described in the vignette("design"), individuals not classified as type 1 cases are classified as type 2 cases.