Functions: overview

What each function does - explained for those who have never coded before

Published

July 21, 2026

Know the function name? Use Ctrl+F (Windows) or Cmd+F (Mac) and search directly on the page.

This is the data-handling toolkit (extract → clean → reshape → prepare). Statistical models - glm, coxph, lm … - are explained in Phase 13 where they’re used.

Jump directly to

Topic	Section
The pipe `%>%`	The pipe
Load register / open parquet file	Data loading
Filter rows	Rows: filter and deduplicate
Select or add columns	Columns: select, add, transform
Rename columns	Column names
Group, count and aggregate	Groups and aggregation
Join two tables	Join - assemble tables
Pivot (wide ↔︎ long format)	Reshaping
`collect()` and lazy evaluation	Lazy query
Dates and text	Text operations · Numbers and dates
Debugging	Diagnostics and debugging

The most important functions - start here

If you understand the 6 functions below, you can read most lines of register data code.

Function	What it does
`read_register("name")`	Opens a lazy connection to a register (without fastreg: `open_dataset("path")`)
`filter(condition)`	Keeps only the rows that satisfy the condition
`select(col1, col2)`	Keeps only the specified columns
`mutate(new_col = ...)`	Adds or changes a column
`left_join(other_df, by = "key")`	Joins two tables on a shared key
`collect()`	Fetches data from parquet into R’s memory

The order is intentional: read_register/open_dataset → filter → select → collect is the pattern that appears in almost every script.

What is a function?

Think of a function as a machine on an assembly line. You send something in one end, it does something with it, and you get something new out the other end. filter() is for example a sieve: you send a large table in, specify which rows you want to keep, and get a smaller table out.

All functions in R are written with parentheses: functionname(what_is_sent_in).

What you write inside the parentheses is called the arguments. Many functions have named arguments of the form name = value: on the left of = is the argument’s name (fixed by the function, cannot be changed), and on the right is the value you supply, e.g. ratio = 5 or na.rm = TRUE. You do not need to memorize them: ?functionname shows which arguments a function takes and what they do.

Want to see what a function does? Place your cursor inside the function name and press F1 - the help page opens with description, arguments and examples directly in RStudio’s Help panel. You can also type ?functionname or help(functionname) in the console, or args(functionname) for a quick list of arguments.

For packages on CRAN there is an online page with documentation and vignettes (e.g. https://cran.r-project.org/package=MatchIt). Some DST packages (e.g. heaven) live on GitHub instead - use ?functionname for the arguments and confirm the package’s availability on your own project.

The pipe - `%>%`

The most important symbol in all the code is %>%, “the pipe”.

lpr_adm %>%
  semi_join(tibble(pnr = my_pnrs), by = "pnr") %>%
  select(pnr, recnum) %>%
  collect()

What it does: the pipe passes the result from the line on the left forward as the first argument to the function on the right.

The two ways of writing do the same thing:

# Without the pipe - from inside out, like Russian dolls:
collect(select(
  semi_join(lpr_adm, tibble(pnr = my_pnrs), by = "pnr"),
  pnr,
  recnum
))

# With the pipe - from top to bottom, like a recipe:
lpr_adm %>%
  semi_join(tibble(pnr = my_pnrs), by = "pnr") %>%
  select(pnr, recnum) %>%
  collect()

Both versions give exactly the same result. The pipe version is easier to read because you can follow the steps in order - and easier to debug because you can add or remove one step at a time.

A line break after %>% is not required - it is only for readability. You can write the whole chain on one line (bef %>% filter(...) %>% collect()) or split it with one step per line. The only rule: if you break the line, %>% must sit at the end of the line, not the start of the next one. R reads line by line, so a trailing %>% signals “more is coming”:

# Works - %>% at the end of the line:
bef <- bef %>%
  filter(year == 2015) %>%
  collect()

# Fails - R thinks the expression ended after "bef":
bef <- bef
  %>% filter(year == 2015)

The same applies to |> and + in ggplot2.

Analogy: Imagine cooking a meal. You chop the onions - and pass them on to the pot - which passes its contents on to the plate. The pipe does exactly that: it chains steps together so you can read the code from top to bottom like a recipe.

%>% and |> are the same - just two different ways of writing the pipe.

%>% comes from the magrittr package and is available via library(dplyr). |> is a built-in version introduced in R 4.1 - it requires no package.

The two work identically in almost all situations. You will see both in R code online. The project uses %>%, but if you write |> that is perfectly fine.

Data loading

`open_dataset("path")` - `arrow`

What it does: opens a lazy connection to a parquet file or folder.

Analogy: Imagine calling the library and asking them to find all books about cardiac surgery from 1990 to 2020. The librarian says “yes, I’ll find those” - but they have not arrived yet. That is exactly what open_dataset() does: it tells the computer what you want, but the data has not been fetched into memory yet. The rest of your commands (filter, select) add further instructions, before you finally say “send them now” - that is collect().

library(fastreg)
bef <- read_register("bef") %>% rename_with(tolower) # by name - fastreg knows the path

read_register() finds the path from your project config (set once) - see Phase 4. It returns a DuckDB connection.

bef <- open_dataset(
  "E:/workdata/[projectnumber]/cleaned-data/parquet-registers/bef/"
) %>%
  rename_with(tolower) # standardise column names to lowercase

The confirmed paths for your project are in Overview of registers.

bef <- open_dataset("synth_data/bef/") %>% rename_with(tolower) # path to locally saved synthetic register

Generate and save synthetic data locally before use - see Phase 6 - First extraction.

Used for all registers stored as parquet files: bef, lpr_adm, lpr_diag, lmdb, dodsaars, vnds, udda, faik, akm, t_psyk_adm, t_psyk_diag, lpr_a_kontakt, lpr_a_diagnose and more.

Arrow or DuckDB? open_dataset() returns an Arrow connection. Arrow is fast, but it does not support every dplyr function: if a step fails with an “unsupported function” error, switching to DuckDB usually fixes it, because DuckDB supports almost all dplyr verbs. Pipe through to_duckdb():

library(arrow) # open_dataset, to_duckdb
bef <- open_dataset("path/to/bef/") %>%
  to_duckdb() %>% # hand the data over to DuckDB
  rename_with(tolower)

You do not need this step with read_register() (fastreg) - they hand you a DuckDB connection already, so the to_duckdb() conversion is built in. For when to use which, see Phase 5 - Arrow vs. DuckDB.

`collect()`

What it does: executes the lazy query and pulls data into R’s memory.

This is the point where the librarian actually brings the books to you. Call it late - after all filter() and select() steps - so only the necessary data is moved.

result <- large_register %>%
  semi_join(tibble(pnr = cohort_pnrs), by = "pnr") %>% # only the cohort's pnr's
  select(pnr, d_inddto) %>% # only the two columns we use
  collect() # data is pulled into R's memory

The concept of lazy evaluation - why data is not in memory before collect() - is explained in detail on the page Extracting data step by step.

`readRDS("path/file.rds")`

What it does: reads a saved R file from disk into memory.

Analogy: It is like opening a Word file you saved last time. The .rds format is R’s own save format - faster and more compact than CSV. It is used to pass data from one script to the next in the pipeline.

full_cohort <- readRDS("sti/til/full_cohort.rds") # fetch saved dataset from disk

`saveRDS(object, "path/file.rds")`

What it does: saves an R object to disk.

The opposite of readRDS(). Each pipeline script saves its result with saveRDS(), so the next script can fetch it.

saveRDS(full_cohort, "sti/til/full_cohort.rds") # save dataset to disk

`haven::read_sas("file.sas7bdat")`

What it does: reads a SAS data file into R.

Used to load SAS files - e.g. DST’s format tables or registers not yet converted to parquet. See File types and Format tables for practical examples.

`arrow::write_parquet(df, "path/file.parquet")`

What it does: saves a data frame as a parquet file.

Parquet is a particularly efficient file format for large datasets - it is far faster to read than CSV. Used when you want to save an R dataset as a parquet file, e.g. for use with open_dataset() or read_register() in another script.

`file.path(folder, "filename.rds")`

What it does: correctly assembles a folder path and filename into a full path.

Analogy: Think of it as writing an address: file.path("C:/data", "results.rds") gives "C:/data/results.rds". It is safer than just pasting strings together with paste0(), because it handles slashes correctly on all operating systems.

`file.exists("path")`

What it does: returns TRUE or FALSE - does the file exist?

Used to give an understandable error message before the code tries to open a file that might not be there.

`dir.create(path, showWarnings = FALSE, recursive = TRUE)`

What it does: creates a folder if it does not already exist.

recursive = TRUE also creates any parent folders. showWarnings = FALSE suppresses the harmless warning you would otherwise get if the folder already exists.

Column names

`rename_with(tolower)`

What it does: converts ALL column names to lower case at once.

Analogy: Imagine receiving a patient list from five different departments. One writes “CPR”, another “cpr”, a third “Cpr”. It is the same thing - but the computer treats them as three entirely different things. rename_with(tolower) fixes it in one second: all names become consistent.

bef <- open_dataset("path/to/bef") %>% rename_with(tolower) # all columns become lowercase: pnr, koen, foed_dag, ...

Important: always call it on the same line as open_dataset() (or read_register() in DARTER). If you forget it, semi_join(..., by = "pnr") will fail because the column might be called PNR.

`rename(new_name = old_name)`

What it does: renames one or more specific columns.

rename(surgery_date = index_date) # index_date is renamed to surgery_date

`names(df)`

What it does: shows all column names in a data frame.

Useful when you are not sure what a table contains.

names(full_cohort) # prints all column names

Columns: select, add, transform

`select(col1, col2, new = old, -remove)`

What it does: keeps only the columns you name.

Analogy: You have a large Excel file with 50 columns. You only need 3. select() is like saving a copy with only the three columns you need. It reduces the amount of data pulled from the server, and is one of the reasons the code is fast.

select(pnr, recnum, date = d_inddto) # keeps three columns; d_inddto is renamed to date
select(-aar) # removes the column aar; keeps all others

When renaming inside select(), the direction is new_name = old_name - the new name on the left, the existing column name on the right.

`mutate(new_column = expression)`

What it does: adds a new column (or modifies an existing one), computed from the columns you already have. The number of rows is the same - mutate() adds information, does not remove rows. (df is just the name of your data frame - call it whatever you like.)

library(dplyr)

df <- df %>%
  mutate(
    bmi = weight / (height^2), # new column computed from two existing ones
    obesity = bmi >= 30 # TRUE/FALSE - a newly created column is usable right away
  )

You can create several variables in one call, and a variable you just made is immediately available further down (here obesity_class uses the just-computed bmi):

df <- df %>%
  mutate(
    bmi = weight / (height^2),
    obesity_class = case_when(
      # case_when: several conditions; the FIRST true one wins
      bmi < 25 ~ "Normal",
      bmi < 30 ~ "Overweight",
      TRUE ~ "Obesity" # TRUE = "everything else"
    )
  )

Check the units first. BMI = weight (kg) / height (m)². If height is in cm, weight / (height^2) is wrong. Convert height to metres in a mutate() step before you compute BMI:

df <- df %>%
  mutate(height_m = height_cm / 100) %>% # cm -> m FIRST
  mutate(bmi = weight / (height_m^2)) # ... then BMI is correct

case_when() has its own entry below. More examples of mutate():

mutate(icd3 = substr(c_diag, 2, 4)) # new column: ICD code without D-prefix, 3 characters
mutate(birth_year = year(foed_dag)) # birth year from date of birth

`transmute(new_column = expression)`

What it does: like mutate(), but keeps only the columns you name - the rest are dropped. Handy when you want a clean table with exactly the columns a function needs.

df %>% mutate(event = 1L) # keeps ALL columns + the new 'event'
df %>% transmute(pnr, event = 1L) # keeps ONLY 'pnr' and 'event'

`case_when(condition1 ~ value1, condition2 ~ value2, TRUE ~ default)`

What it does: an advanced if-else with many conditions.

Analogy: Think of a traffic light: is it red → stop, is it yellow → be careful, is it green → go. case_when() works the same way: conditions are evaluated in order, and the first condition that is true determines the result. TRUE ~ default is the “in all other cases” arm.

mutate(
  education_cat = case_when(
    edu_level == 1 ~ "Short",
    edu_level == 2 ~ "Medium",
    edu_level == 3 ~ "Long",
    TRUE ~ "Unknown"
  )
)

`if_else(condition, true_value, false_value)`

What it does: a simple two-way condition.

if_else(is.na(death_date), 0L, 1L) # 1 if the person has died, 0 if not

Stricter than base R’s ifelse() - both values must have the same type.

0L and 1L are integers in R. The L suffix is R’s way of specifying that a number is an integer rather than a decimal (0 and 1 without L are double by default). if_else() requires both values to have exactly the same type - use either 0L/1L (integer) or 0/1 (double), but not a mix.

`coalesce(x, replacement)`

What it does: replaces NA values with another value.

Analogy: After a left join, many persons will have NA in flag columns because they did not have that condition. coalesce(mi_flag, 0L) says: “if mi_flag is NA, set it to 0 instead”. Used systematically after each left join that produces flag columns.

coalesce(mi_flag, 0L) # NA (= not found) -> 0 (= absent)

`across(columns, function)`

What it does: applies the same function to many columns at once inside mutate().

mutate(across(all_of(nmi_variables), ~ coalesce(.x, 0L))) # replace NA with 0 in all nmi flag columns

~ coalesce(.x, 0L) is an anonymous function: ~ means “function of”, and .x is the current column. It is equivalent to function(x) coalesce(x, 0L) - but shorter. across() calls this function once per column in nmi_variables.

`rowSums(matrix, na.rm = TRUE)`

What it does: sums the values in each row across columns.

Used in two places:

NMI score (Nordic Multimorbidity Index): sums the product of 0/1 flag columns and their individual weights → one weighted comorbidity score per person. A patient with cardiovascular disease and cancer scores higher than a patient with two milder conditions.
Multimorbidity count: sums all 0/1 flags → simple count of the number of conditions per person.

Rows: filter and deduplicate

`filter(condition1, condition2, ...)`

What it does: keeps only the rows that satisfy the conditions.

Analogy: You have a patient list and only want to see women over 50. filter(koen == 2, alder > 50) is your sieve - everything else is removed.

Signs for comparing a column with a value:

Sign	Means	Example
`==`	equal to (two equals signs - a single `=` is assignment)	`filter(year == 2015)`
`!=`	not equal to	`filter(koen != 2)`
`>` · `>=`	greater than · greater than or equal to	`filter(alder >= 18)`
`<` · `<=`	less than · less than or equal to	`filter(alder < 65)`
`%in%`	is in the list (see below)	`filter(icd3 %in% c("F00","F03"))`

Signs for combining several conditions:

Sign	Means	Example
`,` or `&`	AND - both must be true	`filter(koen == 2, alder > 50)` → women and over 50
`\\|`	OR - at least one true	`filter(icd3 == "F00" \\| icd3 == "F03")` → F00 or F03
`!`	NOT - invert the condition	`filter(!is.na(alder))` → keep rows where alder is not missing

Pitfall - never mix AND and OR without parentheses. & binds before |, so filter(alder > 50 | alder < 18 & koen == 2) reads as alder > 50 | (alder < 18 & koen == 2). If you mean “(over 50 or under 18) and woman”, add parentheses: filter((alder > 50 | alder < 18) & koen == 2).

filter(c_diagtype %in% c("A", "B")) # only action and secondary diagnoses
filter(date_contact >= surgery_date) # only post-operative contacts
filter(icd3 %in% c("G30", "F00", "F03")) # only dementia codes

`%in%` - “is in the list”

What it does: checks whether each element on the left side appears in the vector on the right side. Returns TRUE or FALSE for each element.

%in% is optional: it is just one of filter()’s comparison signs (alongside ==, >, <, etc.). You use it only when you want to match against a list of values rather than a single value - e.g. a code list. To match pnr against the cohort, use semi_join (see above).

Analogy: Imagine a guest list for a party. icd3 %in% c("G30", "F00", "F03") is like standing at the entrance and checking: “is this diagnosis code on the list?”

icd3 %in% c("G30", "F00", "F03") # TRUE for these three codes, FALSE for everything else
atc %in% !!my_atc # TRUE for all ATC codes on your local list (!! - see below)

You will see %in% in almost every filter() call in the project.

To filter on pnr against the whole cohort, use semi_join(tibble(pnr = cohort_pnrs), by = "pnr") instead of filter(pnr %in% ...) - it pushes down into the database more efficiently and needs no !!.

`distinct(col1, col2)`

What it does: removes duplicates - keeps only unique combinations.

Analogy: A person may have received the diagnosis F00 ten times. You only need to know whether they ever had it. distinct(pnr, icd3) reduces it to one row per person per code. distinct(pnr) gives you just the list of unique person IDs.

`slice(n)`

What it does: keeps only the nth row within each group (used after group_by()).

group_by(pnr) %>% # group per person
  arrange(desc(aar)) %>% # newest year first
  slice(1) # keeps the newest record per person

Groups and aggregation

`group_by(col1, col2)`

What it does: divides data into groups so subsequent operations happen separately within each group.

Analogy: Imagine you have a stack of patient records and sort them into piles by ID number. group_by(pnr) does exactly that - but only in memory. All subsequent steps (arrange, slice, summarise) now happen one pile at a time.

group_by(pnr) %>% # group per person
  arrange(date_contact) %>% # oldest date first
  slice(1) %>% # earliest contact per person
  ungroup() # remove the grouping afterwards

`ungroup()`

What it does: removes the grouping.

Important: always call ungroup() after you are done with group_by(). If you forget it, data remains grouped, and later operations can behave unexpectedly.

`arrange(column)` / `arrange(desc(column))`

What it does: sorts rows ascending (default) or descending (desc()).

Typically used with group_by() %>% slice(1) to find the first or newest record per person.

`summarise(new_col = function(col), .groups = "drop")`

What it does: reduces each group to one summary row.

group_by(pnr) %>% # group per person
  summarise(first_e66 = min(date_contact, na.rm = TRUE)) # earliest E66 date per person

.groups = "drop" removes the grouping automatically afterwards.

`ntile(x, n)`

What it does: divides rows into n equal groups (quantiles).

Not used for income quintiles in register-based studies following SEPLINE guidelines. SEPLINE recommends comparing against population-specific cut-points (Q20/Q40/Q60/Q80) stratified by sex × 5-year age group × reference year - not ntile() on your cohort alone. See SEPLINE.

Join - assemble tables

This is one of the things that takes the longest to understand, but is essential for all register work. A join puts two tables together based on a shared key - typically pnr.

`inner_join(y, by = "key")`

What it does: keeps only rows that exist in BOTH tables.

Analogy: It is like a VIP list at the entrance. You must be on BOTH lists to get in. Used when a match is meaningful - e.g. inner_join(bs_cohort) keeps only hospital contacts for persons who are actually in the study.

inner_join(bs_cohort %>% select(pnr, surgery_date), by = "pnr") # only contacts from BS cohort members

`semi_join(y, by = "key")`

What it does: keeps the rows from the left table that have a match in y - but adds no columns from y. So it is a filter where the list of allowed values lives in another table.

Analogy: You have all of LPR and a list of your cohort. semi_join keeps only the LPR rows whose pnr is on the cohort list - the same result as filter(pnr %in% ...), but faster and more reliable on lazy (Arrow/DuckDB) tables.

lpr_adm %>%
  semi_join(tibble(pnr = cohort_pnrs), by = "pnr") # keep only the cohort's rows

Where is the local list? The list itself is the tibble you pass as the second argument (y); by = "pnr" only says which column to match on - not where the list is. Because y is an ordinary table, you do not need !!. That is the difference from filter(pnr %in% !!cohort_pnrs), where the vector sits directly in the condition and !! is what injects it. Use semi_join to filter a register down to the cohort.

What it does: keeps all rows from the left table. Rows without a match in the right table get NA for the columns that came from the right.

Analogy: It is like checking whether your patients have a particular finding, without discarding any of them. All patients are still there - those with the finding have a date, those without have NA. Used everywhere when adding flags and covariates to the cohort.

full_cohort %>% # start with all cohort members
  left_join(dementia, by = "pnr") # all retained; only those with dementia get a date

`bind_rows(df1, df2, ...)`

What it does: stacks data frames on top of each other (same columns, more rows).

Analogy: Like taking three piles of paper and putting them in one pile. Used e.g. to combine LPR2 + psychiatric LPR2 + LPR3 into one combined diagnosis table.

bind_rows(lpr2_results, lpr2_psyk_results, lpr3_results) # combine all three source tables

Reshaping

`pivot_wider(names_from = col, values_from = col)`

What it does: transforms a long format (one row per visit) to a wide format (one row per person with one column per time point).

Analogy: Imagine a patient with five weigh-in visits - all in the same column with five rows. pivot_wider() transforms it into one single row with five columns: weight_3mo, weight_6mo, weight_12mo, etc. Used in extraction of weight and insulin outcomes.

Lazy query

`!!` (bang-bang, two exclamation marks)

What it does: injects a local R variable into a DuckDB/dplyr query.

Analogy: Imagine asking an assistant to find all rows with a code from a list. If you say “find all with a code from the list my_atc”, the assistant will look for a column in the database with that name - and it does not exist. You must say: “find all with a code from this list” and hold the list up. !! is the equivalent of holding the list up.

filter(atc %in% !!my_atc) # !! says: "my_atc is an R vector, not a column name"

You will see !! in front of local R variables (typically code and year lists) inside filter() calls.

For pnr filtering against the cohort, use semi_join(tibble(pnr = cohort_pnrs), by = "pnr"), which takes the local table directly and needs no !!.

`!!column_name := value` (inside `mutate`)

What it does: creates a column whose NAME is determined by an R variable - not written directly in the code.

Normally you write a fixed column name to the left of = in mutate():

mutate(mi = 1L) # always creates a column called "mi"

But in the NMI calculation (Nordic Multimorbidity Index) we loop over a list of chronic conditions ("mi", "stroke", "diabetes", …) and want to create one column per condition. The column name is therefore stored in a variable:

condition_name <- "mi" # the variable contains the name as a string

mutate(!!condition_name := 1L) # !! injects the variable's content: equivalent to mutate(mi = 1L)
# Next iteration: condition_name <- "stroke" → mutate(stroke = 1L)

Two things differ from normal: - !!: as in filter(): injects the R variable’s contents instead of interpreting it as a column name - :=: used instead of =, because R requires it when the left side of an assignment is dynamic. It is not possible to write mutate(!!name = 1L) - only mutate(!!name := 1L) works.

1L is an integer (see as.integer() / 1L) - flag columns are stored as integers to save memory.

Text operations

`substr(string, start, end)`

What it does: extracts part of a text string based on character positions.

Analogy: You have the ICD code "DG30". DST has prepended a “D” - it does not belong in standard ICD-10. substr("DG30", 2, 4) says: “give me characters from position 2 to 4” and returns "G30".

substr(c_diag, 2, 4) # 3-digit code: "DG30" -> "G30"
substr(c_diag, 2, 5) # 4-digit code: "DI110" -> "I110"

`paste0(x, y)`

What it does: concatenates text strings without spaces.

paste0("C", 10:43) # creates "C10", "C11", "C12", ..., "C43"

Used for compact construction of ICD code lists.

`paste(x, y, sep = "_")`

What it does: concatenates text strings with a chosen separator.

paste(koen, birth_year, sep = "_") # creates e.g. "1_1975" as a matching key

`toupper(x)` / `tolower(x)`

What it does: converts text to upper or lower case respectively.

toupper(c_opr) # ensures procedure codes match regardless of capitalisation

`grepl(pattern, x)`

What it does: returns TRUE/FALSE for each element in x that matches a search pattern (regular expression).

Analogy: It is like “Ctrl+F” on a text document, but applied to entire columns at once. grepl("^C34", icd4) finds all 4-digit codes starting with C34.

grepl("^C34", icd4) # TRUE for "C340", "C341", "C342", etc.

Used e.g. to match ICD codes against diagnosis patterns in comorbidity measures such as NMI (Nordic Multimorbidity Index).

Numbers and dates

`as.Date(x)`

What it does: converts text or datetime to a simple date object.

DST stores some date-times as "2021-03-15 14:32:00". as.Date() removes the time parts and gives a clean calendar date.

as.Date(kont_starttidspunkt) # "2021-03-15 14:32:00" -> 2021-03-15

`ymd(x)` / `dmy(x)` (from `lubridate`)

What it does: reads dates that are not in ISO form ("2021-03-15"). as.Date() assumes ISO and returns NA on e.g. "15/03/2021" or "15-03-2021"; the forgiving lubridate parsers infer the order from the name (ymd = year-month-day, dmy = day-month-year, mdy = month-day-year).

ymd("2021-03-15") # year-month-day -> 2021-03-15
dmy("15/03/2021") # day-month-year -> 2021-03-15  (as.Date would return NA)

`year(date)` (from `lubridate`)

What it does: extracts the year from a date.

year(surgery_date) # 2021-03-15 -> 2021

`difftime(date1, date2, units = "days")`

What it does: calculates the difference between two dates.

as.numeric(difftime(surgery_date, foed_dag, units = "days")) / 365.25 # age at surgery in years

`min(x, na.rm = TRUE)` / `max(x, na.rm = TRUE)`

What it does: finds the smallest/largest element in a vector and ignores NA.

summarise(first_date = min(date_contact, na.rm = TRUE)) # earliest contact date per person

`pmin(x, y)` / `pmax(x, y)`

What it does: compares two vectors position by position and returns the smallest/largest for each element.

Analogy: Imagine two lists of dates - date of death and study end date. pmin(death_date, study_end) selects for each person what came first.

pmin(death_date, as.Date("2024-12-31")) # censoring date: either death date or study end date

Dates across the guide

The functions above are the building blocks. The concrete date tasks are shown where they belong in the workflow:

Age at index (/365.25, leap years): Comparison cohort
Baseline year for annual registers (year(index_date) - 1): Socioeconomic variables
SAS integer dates (origin = "1960-01-01"): Pitfalls on DST
Follow-up time and event variable (pmin()): Joins
Start/stop format (time-varying covariate): Time-varying variables

`as.integer(x)` / `1L`

What it does: converts to integer.

The L suffix (e.g. 1L, 0L) specifies that it is an integer rather than a decimal. Flag columns are stored as integers (1L/0L) to save memory.

`is.na(x)`

What it does: returns TRUE for NA values (missing values).

filter(!is.na(pnr)) # remove rows without person ID
filter(!is.na(date_dementia)) # remove rows without dementia date

`set.seed(n)`

What it does: fixes the starting point for random number generation.

Analogy: Imagine shuffling a deck of cards. Without set.seed() you will shuffle differently every time. With set.seed(42) you always shuffle in the same way - and can thus reproduce your results exactly. Always call it before matching loops to ensure reproducibility.

set.seed(42) # fix random seed for reproducibility
sample(pool_pnrs, size = 5) # always selects the same 5 random pnr's

`sample(x, size)`

What it does: draws random elements from a vector.

Used in matching logic to select random control persons from a pool.

Lists and loops

`split(df, group_vector)`

What it does: splits a data frame into a list of sub-data-frames, one per unique group.

Analogy: Imagine sorting patient records into piles by year and sex. split(pool, paste(koen, birth_year, sep = "_")) gives you one pile per combination, so the matching code can work quickly within one pile at a time.

`vector("list", n)`

What it does: creates an empty list with space for n elements.

Pre-allocation is faster than letting R expand the list one element at a time in a loop.

`seq_len(n)`

What it does: generates the sequence 1, 2, …, n.

Safer than 1:n in loops, because it handles the case n = 0 correctly.

`unlist(list, use.names = FALSE)`

What it does: flattens a list of vectors into one long vector.

Diagnostics and debugging

`class(x)`

What it does: tells you what type of object x is.

class(my_object)
# "tbl_df" "data.frame"        -> data is in R's memory
# "tbl_duckdb_connection"      -> lazy DuckDB query, not yet fetched
# "Table" "ArrowObject"        -> lazy Arrow connection, not yet fetched

Always check class() first if you get a strange error.

`nrow(df)`

What it does: returns the number of rows.

Used to print cohort sizes and verify that exclusions have worked.

`cat("text\n")`

What it does: prints text to the console without quotation marks. \n is a newline.

Used for progress messages: cat("Extracting NMI score...\n").

`stop("message")`

What it does: stops the code with an error message.

Used to give an understandable error if a required file is missing.

`stopifnot(condition)`

What it does: a sanity check. If the condition is TRUE, nothing happens and the code continues. If it is FALSE, the code stops with an error. Use it for assumptions that MUST hold, e.g. “one row per person”:

stopifnot(n_distinct(df$pnr) == nrow(df)) # errors if there are duplicates

What do you do if it stops? Then the assumption does not hold - e.g. there is more than one row per person. Find the duplicates and fix the cause:

df %>% count(pnr) %>% filter(n > 1) # see which pnr's recur
df <- df %>% distinct(pnr, .keep_all = TRUE) # keep one row per person (if that is correct)

Duplicates often come from a join that multiplied rows, or from duplicate person-year rows in the source data.

`gc()`

What it does: releases unused memory back to the operating system.

rm(large_register) # remove the object from R
gc() # release the memory

Use it after you are done with large registers - you share RAM with everyone else on the DST server.

Package overview

See Overview of registers for confirmed column names on all registers these packages work against.

Package	What it provides
`fastreg`	`convert()`, `read_register()` - SAS → parquet, then read registers by name (CRAN)
`dplyr`	`%>%`, `filter`, `select`, `mutate`, `join`, `group_by`, `arrange`, etc.
`tidyr`	`pivot_wider()` - reshaping from long to wide format
`lubridate`	`year()`, `as.Date()`, date calculations
`arrow`	`read_parquet()`, `write_parquet()` - parquet file handling
`haven`	`read_sas()` - reading SAS data files

Further information

Further depth in The Epidemiologist R Handbook:

What is a function?

The pipe - %>%

Data loading

open_dataset("path") - arrow

collect()

readRDS("path/file.rds")

saveRDS(object, "path/file.rds")

haven::read_sas("file.sas7bdat")

arrow::write_parquet(df, "path/file.parquet")

file.path(folder, "filename.rds")

file.exists("path")

dir.create(path, showWarnings = FALSE, recursive = TRUE)

Column names

rename_with(tolower)

rename(new_name = old_name)

names(df)

Columns: select, add, transform

select(col1, col2, new = old, -remove)

mutate(new_column = expression)

transmute(new_column = expression)

case_when(condition1 ~ value1, condition2 ~ value2, TRUE ~ default)

if_else(condition, true_value, false_value)

coalesce(x, replacement)

across(columns, function)

rowSums(matrix, na.rm = TRUE)

Rows: filter and deduplicate

filter(condition1, condition2, ...)

%in% - “is in the list”

distinct(col1, col2)

slice(n)

Groups and aggregation

group_by(col1, col2)

ungroup()

arrange(column) / arrange(desc(column))

summarise(new_col = function(col), .groups = "drop")

ntile(x, n)

Join - assemble tables

inner_join(y, by = "key")

semi_join(y, by = "key")

bind_rows(df1, df2, ...)

Reshaping

pivot_wider(names_from = col, values_from = col)

Lazy query

!! (bang-bang, two exclamation marks)

!!column_name := value (inside mutate)

Text operations

substr(string, start, end)

paste0(x, y)

paste(x, y, sep = "_")

toupper(x) / tolower(x)

grepl(pattern, x)

Numbers and dates

as.Date(x)

ymd(x) / dmy(x) (from lubridate)

year(date) (from lubridate)

difftime(date1, date2, units = "days")

min(x, na.rm = TRUE) / max(x, na.rm = TRUE)

pmin(x, y) / pmax(x, y)

as.integer(x) / 1L

is.na(x)

set.seed(n)

sample(x, size)

Lists and loops

split(df, group_vector)

vector("list", n)

seq_len(n)

unlist(list, use.names = FALSE)

Diagnostics and debugging

class(x)

nrow(df)

cat("text\n")

stop("message")

stopifnot(condition)

gc()

Package overview

The pipe - `%>%`

`open_dataset("path")` - `arrow`

`collect()`

`readRDS("path/file.rds")`

`saveRDS(object, "path/file.rds")`

`haven::read_sas("file.sas7bdat")`

`arrow::write_parquet(df, "path/file.parquet")`

`file.path(folder, "filename.rds")`

`file.exists("path")`

`dir.create(path, showWarnings = FALSE, recursive = TRUE)`

`rename_with(tolower)`

`rename(new_name = old_name)`

`names(df)`

`select(col1, col2, new = old, -remove)`

`mutate(new_column = expression)`

`transmute(new_column = expression)`

`case_when(condition1 ~ value1, condition2 ~ value2, TRUE ~ default)`

`if_else(condition, true_value, false_value)`

`coalesce(x, replacement)`

`across(columns, function)`

`rowSums(matrix, na.rm = TRUE)`

`filter(condition1, condition2, ...)`

`%in%` - “is in the list”

`distinct(col1, col2)`

`slice(n)`

`group_by(col1, col2)`

`ungroup()`

`arrange(column)` / `arrange(desc(column))`

`summarise(new_col = function(col), .groups = "drop")`

`ntile(x, n)`

`inner_join(y, by = "key")`

`semi_join(y, by = "key")`

`bind_rows(df1, df2, ...)`

`pivot_wider(names_from = col, values_from = col)`

`!!` (bang-bang, two exclamation marks)

`!!column_name := value` (inside `mutate`)

`substr(string, start, end)`

`paste0(x, y)`

`paste(x, y, sep = "_")`

`toupper(x)` / `tolower(x)`

`grepl(pattern, x)`

`as.Date(x)`

`ymd(x)` / `dmy(x)` (from `lubridate`)

`year(date)` (from `lubridate`)

`difftime(date1, date2, units = "days")`

`min(x, na.rm = TRUE)` / `max(x, na.rm = TRUE)`

`pmin(x, y)` / `pmax(x, y)`

`as.integer(x)` / `1L`

`is.na(x)`

`set.seed(n)`

`sample(x, size)`

`split(df, group_vector)`

`vector("list", n)`

`seq_len(n)`

`unlist(list, use.names = FALSE)`

`class(x)`

`nrow(df)`

`cat("text\n")`

`stop("message")`

`stopifnot(condition)`

`gc()`