File types and how to open them

What you encounter in your workspace - and what opens it

Published

July 2, 2026

When you open File Explorer on the DST server, you encounter files with different extensions. This page is your reference tool: what file type is it, which function opens it, and does the data enter memory immediately or not?

The last point - lazy vs. full loading - is the most important distinction on this page. The mechanism behind it is explained in Phase 5 - Extracting data step by step; here you just need to know which file types behave which way.


What a project folder looks like

A typical project on the DST server is organised roughly like this. The registers live under cleaned-data/; your own scripts and results sit under workspace/[your-name]/:

E:/rawdata/[projectnumber]/
├── Grunddata/                     # raw register data from DST (often SAS) - read-only
│   └── bef/  lpr_adm/  lpr_diag/  …
└── Eksterne data/
│   └── project-specific extracts etc/  …

E:/workdata/[projectnumber]/
├── cleaned-data/
│   └── parquet-registers/       # registers converted to parquet
│       └── bef/  lpr_adm/  lpr_a_kontakt/  …
└── workspaces/
    └── [your-name]/             # your own working area
        ├── R/                   # your analysis scripts (01_, 02_, …)
        ├── datasets/            # your own intermediate results (.rds)
        └── output/              # tables, figures and logs

The paths in the code examples follow this structure, but your own folder names may well differ - check with File Explorer. Two things are worth knowing:

  • rawdata/ is read-only. You cannot edit or write to the raw register data DST delivers. You work in workspace/, where each project member typically has their own subfolder ([your-name]/) for scripts, intermediate results and output.
  • Almost all the code in this guide assumes the registers are in parquet. If your project only has raw SAS files, converting them to parquet - and setting up a sensible folder structure for the project - is the first job. See Convert SAS files to Parquet below.

How to name the scripts in R/ is covered in Good code practice.


Lazy vs. full loading - what is the difference in practice?

This is the most important distinction on the page, so we take it first. Every file type below falls into one of two loading behaviours.

Think of the difference as two ways to do your grocery shopping. Full loading is driving the whole supermarket home and only then picking out the three items you needed. Lazy loading is sending a shopping list and getting just those three items delivered. When a register runs to millions of rows, that difference is decisive.

Full loading (readRDS, read_sas, read_csv, read_xlsx) The entire file is read into RAM (the computer’s working memory - it is limited and shared with everyone else on the server) immediately. For your own intermediate results this is fine - they are relatively small. But it would crash your session if you tried it on a whole register.

Lazy loading (parquet via open_dataset()) The file is opened as a connection to the data on disk, not as the data itself. You can filter and select columns (write your “shopping list”), and only when you call collect() are the selected rows fetched into RAM. This is how you work with registers of millions of rows without running out of memory.

Important

SAS files are also large - and are shared with everyone on the server. On DST all users share the server’s RAM. read_sas() on a large SAS file burdens the server for everyone at the same time. DST automatically kills processes (RStudio sessions, jobs, etc.) when RAM is close to full - so an oversized extraction can cost you your unsaved work. If you use a SAS file repeatedly, it is worth converting it to parquet once - this saves significant RAM and makes loading much faster. See Convert SAS to parquet below for the procedure, and Mind your RAM in the shared environment for the practical habits. DST’s official advice is collected in DST guide: Reducing RAM use in the shared environment (PDF, Danish).


Overview - file type, package, function

File type Package Function Loading
.parquet / parquet folder arrow or fastreg open_dataset("path/to/folder/") or read_register("name") Lazy - nothing in RAM until collect()
.rds base R readRDS("path/to/file.rds") Full - entire file into RAM
.sas7bdat haven read_sas("path/to/file.sas7bdat") Full - entire file into RAM
.csv readr read_csv("path/to/file.csv") Full - entire file into RAM
.xlsx readxl read_xlsx("path/to/file.xlsx") Full - entire file into RAM

The last column is the lazy/full distinction from above: parquet is the only lazy format, everything else loads fully into RAM.

What do you write in the parentheses?

  • With open_dataset() (arrow) you write the path to the parquet folder - e.g. open_dataset("path/to/bef/"). It works on any parquet, on any project.
  • With read_register() (fastreg) you write just the register name - e.g. read_register("bef") - because fastreg already knows where your parquet lives (you set that once during conversion). It also hands you a DuckDB connection, so more dplyr functions work without an extra to_duckdb() step. It requires that the registers were set up with fastreg.

The exact path depends on your project and server. Column names for each register are in Overview of registers.

Note

Didn’t convert the data yourself? If a colleague already converted the registers with fastreg, you only point fastreg at that folder once per script - set options(fastreg.project_workdata_dir = "...") to the converted folder (ask whoever set it up for the exact path, typically cleaned-data/parquet-registers/). After that, read_register("bef") works by name, with no conversion and no path in each call. The Getting started vignette shows the option.

Note

A register often arrives as many yearly files (e.g. bef2015.parquet, bef2016.parquet … in the same folder, sometimes one subfolder per year). Both open_dataset("path/to/bef/") and fastreg’s read_register("bef") read the whole folder (including any per-year subfolders) as one combined dataset. Pick years by filtering on the year column in the data, e.g. filter(year == 2015) (some registers use aar).

RDS is the format you write yourself most. You save intermediate results from one script and reload them in the next:

saveRDS(cohort, "sti/til/full_cohort.rds")   # save an R object to disk
cohort <- readRDS("sti/til/full_cohort.rds")   # read it back in the next script
Rarer formats (Stata, SPSS, Feather, RData)

You rarely encounter these in a typical DST cohort study, but here they are for completeness:

File type Package Function
.dta (Stata) haven read_dta()
.sav (SPSS) haven read_sav()
.feather arrow read_feather()
.rdata / .rda base R load()

.rdata/.rda differs from .rds in that it can save multiple objects at once - but .rds is preferred because you control what the object is called when you read it back in.

When to use each format (Parquet, RDS, SAS, CSV)

The three formats you work with day to day:

File type Used for
Parquet The large registers (BEF, LPR, LMDB …). You load them lazily and filter before fetching data.
RDS Your own intermediate results - datasets you save from one script and reload in the next.
SAS Format tables and raw register data not yet converted to parquet.

RDS is R’s own format. It is fast, compact and preserves all R properties (data types, factor levels, column names) perfectly. If you work with a pipeline of scripts - e.g. one script that builds your cohort and another that extracts diagnoses - you save the result from script 1 as .rds, so script 2 can read it in and continue from there, without re-running everything.

SAS - for format tables and unconverted register data:

library(haven)
df <- read_sas("E:/rawdata/[projectnumber]/lpr_adm2018.sas7bdat")

Loading large SAS files is very slow - which is exactly why data on DST is converted to parquet. Only use SAS for format tables and files without a parquet version.

CSV - for exporting finished tables (e.g. at repatriation):

library(readr)
write_csv(my_table, "output/table1.csv")

Never save raw register data as CSV - only aggregated results. See Phase 14 - Export and repatriation for the rules.


Convert SAS files to Parquet

Most projects on DST receive registers as SAS files (.sas7bdat). Before you can use them with open_dataset() and lazy evaluation, they must be converted to Parquet once. After that you use them exactly like any other register.

Important

Relevant for most projects outside DARTER. If you are working on a project where the registers have not already been converted to parquet, this step is necessary before you can run extractions. Done once per register - after that the normal extraction pattern applies.

Why Parquet is worth it (SAS vs Parquet)
SAS (.sas7bdat) Parquet
Read time (1M rows) ~30–120 sec ~1–3 sec
Disk space Large 50–75% smaller
Requires package haven arrow
Lazy evaluation No - all into RAM Yes - filter BEFORE collect

Next step

Why lazy loading works, and how collect() functions, is the topic of the next phase.

Phase 5 - Extracting data step by step

TipFurther information

Further depth (in English):

  • Import and export in The Epidemiologist R Handbook.
  • Arrow in R for Data Science: reading parquet with open_dataset() and using dplyr directly on arrow data - exactly the loading pattern this guide builds on.
Back to top