File types and how to open them
What you encounter in your workspace - and what opens it
When you open File Explorer on the DST server, you encounter files with different extensions. This page is your reference tool: what file type is it, which function opens it, and does the data enter memory immediately or not?
The last point - lazy vs. full loading - is the most important distinction on this page. The mechanism behind it is explained in Phase 5 - Extracting data step by step; here you just need to know which file types behave which way.
What a project folder looks like
A typical project on the DST server is organised roughly like this. The registers live under cleaned-data/; your own scripts and results sit under workspace/[your-name]/:
E:/rawdata/[projectnumber]/
├── Grunddata/ # raw register data from DST (often SAS) - read-only
│ └── bef/ lpr_adm/ lpr_diag/ …
└── Eksterne data/
│ └── project-specific extracts etc/ …
E:/workdata/[projectnumber]/
├── cleaned-data/
│ └── parquet-registers/ # registers converted to parquet
│ └── bef/ lpr_adm/ lpr_a_kontakt/ …
└── workspaces/
└── [your-name]/ # your own working area
├── R/ # your analysis scripts (01_, 02_, …)
├── datasets/ # your own intermediate results (.rds)
└── output/ # tables, figures and logs
The paths in the code examples follow this structure, but your own folder names may well differ - check with File Explorer. Two things are worth knowing:
rawdata/is read-only. You cannot edit or write to the raw register data DST delivers. You work inworkspace/, where each project member typically has their own subfolder ([your-name]/) for scripts, intermediate results and output.- Almost all the code in this guide assumes the registers are in parquet. If your project only has raw SAS files, converting them to parquet - and setting up a sensible folder structure for the project - is the first job. See Convert SAS files to Parquet below.
How to name the scripts in R/ is covered in Good code practice.
Lazy vs. full loading - what is the difference in practice?
This is the most important distinction on the page, so we take it first. Every file type below falls into one of two loading behaviours.
Think of the difference as two ways to do your grocery shopping. Full loading is driving the whole supermarket home and only then picking out the three items you needed. Lazy loading is sending a shopping list and getting just those three items delivered. When a register runs to millions of rows, that difference is decisive.
Full loading (readRDS, read_sas, read_csv, read_xlsx) The entire file is read into RAM (the computer’s working memory - it is limited and shared with everyone else on the server) immediately. For your own intermediate results this is fine - they are relatively small. But it would crash your session if you tried it on a whole register.
Lazy loading (parquet via open_dataset()) The file is opened as a connection to the data on disk, not as the data itself. You can filter and select columns (write your “shopping list”), and only when you call collect() are the selected rows fetched into RAM. This is how you work with registers of millions of rows without running out of memory.
SAS files are also large - and are shared with everyone on the server. On DST all users share the server’s RAM. read_sas() on a large SAS file burdens the server for everyone at the same time. DST automatically kills processes (RStudio sessions, jobs, etc.) when RAM is close to full - so an oversized extraction can cost you your unsaved work. If you use a SAS file repeatedly, it is worth converting it to parquet once - this saves significant RAM and makes loading much faster. See Convert SAS to parquet below for the procedure, and Mind your RAM in the shared environment for the practical habits. DST’s official advice is collected in DST guide: Reducing RAM use in the shared environment (PDF, Danish).
Overview - file type, package, function
| File type | Package | Function | Loading |
|---|---|---|---|
.parquet / parquet folder |
arrow or fastreg |
open_dataset("path/to/folder/") or read_register("name") |
Lazy - nothing in RAM until collect() |
.rds |
base R | readRDS("path/to/file.rds") |
Full - entire file into RAM |
.sas7bdat |
haven |
read_sas("path/to/file.sas7bdat") |
Full - entire file into RAM |
.csv |
readr |
read_csv("path/to/file.csv") |
Full - entire file into RAM |
.xlsx |
readxl |
read_xlsx("path/to/file.xlsx") |
Full - entire file into RAM |
The last column is the lazy/full distinction from above: parquet is the only lazy format, everything else loads fully into RAM.
What do you write in the parentheses?
- With
open_dataset()(arrow) you write the path to the parquet folder - e.g.open_dataset("path/to/bef/"). It works on any parquet, on any project. - With
read_register()(fastreg) you write just the register name - e.g.read_register("bef")- because fastreg already knows where your parquet lives (you set that once during conversion). It also hands you a DuckDB connection, so more dplyr functions work without an extrato_duckdb()step. It requires that the registers were set up with fastreg.
The exact path depends on your project and server. Column names for each register are in Overview of registers.
Didn’t convert the data yourself? If a colleague already converted the registers with fastreg, you only point fastreg at that folder once per script - set options(fastreg.project_workdata_dir = "...") to the converted folder (ask whoever set it up for the exact path, typically cleaned-data/parquet-registers/). After that, read_register("bef") works by name, with no conversion and no path in each call. The Getting started vignette shows the option.
A register often arrives as many yearly files (e.g. bef2015.parquet, bef2016.parquet … in the same folder, sometimes one subfolder per year). Both open_dataset("path/to/bef/") and fastreg’s read_register("bef") read the whole folder (including any per-year subfolders) as one combined dataset. Pick years by filtering on the year column in the data, e.g. filter(year == 2015) (some registers use aar).
RDS is the format you write yourself most. You save intermediate results from one script and reload them in the next:
saveRDS(cohort, "sti/til/full_cohort.rds") # save an R object to disk
cohort <- readRDS("sti/til/full_cohort.rds") # read it back in the next scriptRarer formats (Stata, SPSS, Feather, RData)
You rarely encounter these in a typical DST cohort study, but here they are for completeness:
| File type | Package | Function |
|---|---|---|
.dta (Stata) |
haven |
read_dta() |
.sav (SPSS) |
haven |
read_sav() |
.feather |
arrow |
read_feather() |
.rdata / .rda |
base R | load() |
.rdata/.rda differs from .rds in that it can save multiple objects at once - but .rds is preferred because you control what the object is called when you read it back in.
When to use each format (Parquet, RDS, SAS, CSV)
The three formats you work with day to day:
| File type | Used for |
|---|---|
| Parquet | The large registers (BEF, LPR, LMDB …). You load them lazily and filter before fetching data. |
| RDS | Your own intermediate results - datasets you save from one script and reload in the next. |
| SAS | Format tables and raw register data not yet converted to parquet. |
RDS is R’s own format. It is fast, compact and preserves all R properties (data types, factor levels, column names) perfectly. If you work with a pipeline of scripts - e.g. one script that builds your cohort and another that extracts diagnoses - you save the result from script 1 as .rds, so script 2 can read it in and continue from there, without re-running everything.
SAS - for format tables and unconverted register data:
library(haven)
df <- read_sas("E:/rawdata/[projectnumber]/lpr_adm2018.sas7bdat")Loading large SAS files is very slow - which is exactly why data on DST is converted to parquet. Only use SAS for format tables and files without a parquet version.
CSV - for exporting finished tables (e.g. at repatriation):
library(readr)
write_csv(my_table, "output/table1.csv")Never save raw register data as CSV - only aggregated results. See Phase 14 - Export and repatriation for the rules.
Convert SAS files to Parquet
Most projects on DST receive registers as SAS files (.sas7bdat). Before you can use them with open_dataset() and lazy evaluation, they must be converted to Parquet once. After that you use them exactly like any other register.
Relevant for most projects outside DARTER. If you are working on a project where the registers have not already been converted to parquet, this step is necessary before you can run extractions. Done once per register - after that the normal extraction pattern applies.
Why Parquet is worth it (SAS vs Parquet)
| SAS (.sas7bdat) | Parquet | |
|---|---|---|
| Read time (1M rows) | ~30–120 sec | ~1–3 sec |
| Disk space | Large | 50–75% smaller |
| Requires package | haven |
arrow |
| Lazy evaluation | No - all into RAM | Yes - filter BEFORE collect |
Recommended: convert with fastreg
The recommended tool is the fastreg package (dp-next, on CRAN). It converts SAS registers to Parquet (partitioned by year) and lets you read them back by name.
Use fastreg’s own guide for the conversion code. The exact commands are documented - and kept up to date by the maintainers - in fastreg’s Getting started vignette. We link to the relevant section below instead of reproducing code that could drift out of date. You install it once with install.packages("fastreg") and point it at your raw-data and output folders as shown there.
Convert a single file
If you only need to convert one SAS file, use fastreg’s convert() function - it writes a single register to Parquet. See the Getting started vignette for the exact call.
Convert many files at once
To convert a whole workspace, the fastreg team recommends its targets pipeline: use_template() copies a ready-to-run pipeline that converts all your registers in parallel - reproducibly, and re-runnable when a register is updated. See Converting multiple registers in parallel.
If fastreg is not available on your project: convert manually with haven + arrow
If you cannot install fastreg, you can convert a single register yourself. This is essentially what fastreg does under the hood:
library(haven) # read SAS file
library(arrow) # write Parquet
sas_file <- "E:/rawdata/[projectnumber]/rawdata/my_register.sas7bdat"
parq_path <- "E:/workdata/[projectnumber]/cleaned-data/parquet-registers/my_register/"
# 1. Read the SAS file into R
df <- read_sas(sas_file) # reads the entire file into RAM - we call it "df", but you can use any name
# 2. Standardise column names
df <- df %>% rename_with(tolower)
# 3. Write as Parquet
dir.create(parq_path, recursive = TRUE, showWarnings = FALSE)
write_parquet(df, file.path(parq_path, "my_register.parquet"))
# 4. Verify - open it lazily like any other register
open_dataset(parq_path) %>% glimpse()Read only what you need. Even before converting you can save RAM by limiting what is read in:
read_sas()(haven) takescol_select = c(pnr, alder, civst)(pick columns),n_max = 10000(first rows only - good for testing) andskip =.heaven::import_SAS()(pre-installed on DST) is even more efficient for large files and can filter on values - e.g.keep = c("pnr","atc"),where = "..."(filter rows) orobs = 1000.
You can also convert to Parquet (or .dta) in StatTransfer, available on every server in the shared environment.
Next step
Why lazy loading works, and how collect() functions, is the topic of the next phase.
→ Phase 5 - Extracting data step by step
Further depth (in English):
- Import and export in The Epidemiologist R Handbook.
- Arrow in R for Data Science: reading parquet with
open_dataset()and using dplyr directly on arrow data - exactly the loading pattern this guide builds on.