Inspect and understand your data

The most important commands for seeing what you actually have

Published

July 21, 2026

In Phase 6 you made your first extraction - with synthetic data from fakeregs. Now you have data, and before you analyse it you need to understand what you have. It is only here that these commands make sense: they are tools for looking at a dataset, so they are meaningless without data to look at.

This page shows the commands you will use again and again to inspect your extracts.

$ - access one column in a table data$koen means: “the column koen in the table data”. Replace data with your own dataset name and koen with your own column name. You will see $ everywhere in R code.

Want to practise the commands in RStudio? All examples on the page use bef_data - a BEF extract with columns such as koen, alder and foed_dag. You can generate it with two lines, provided you have run the preparation block in Phase 6:

# Continuing from Phase 6 - bef_data is already opened as a lazy Arrow connection
bef_data <- bef_data %>% filter(year == 2015) %>% collect() # filter and fetch into R

Seeing a red error message?

Before looking in the code - do this in order:

Read the error message: which line is mentioned? Which object name appears?
Run class(object): is it data ("data.frame") or still a connection ("tbl_duckdb_connection")?
Run names(object): is the column named exactly what you think? A single letter or difference in capitalisation is enough to fail.
Isolate the failing line: run it alone and see what happens.
Use ?functionname: type e.g. ?colSums in the console to open the help documentation in the Help panel (bottom right). It shows what the function does, which arguments it takes, and examples.
Ask a colleague or search for the error message: see 2 - R: the bare essentials for the prioritised help list.

An overview of common error messages and what they typically mean is in DST pitfalls.

See what you have

dim(bef_data) # number of rows and columns - e.g. "1200 rows, 8 columns"
nrow(bef_data) # number of rows only
ncol(bef_data) # number of columns only
names(bef_data) # column names as a vector

To check the column names on a lazy object before collect(), use colnames(bef) - it works on both Arrow- and DuckDB-based lazy objects. (names() works on a data frame after collect(), but does not necessarily return the columns on a lazy DuckDB/read_register() object - use colnames() to be safe.)

Example with simple data

df <- data.frame(
  pnr        = c("001", "002", "003"),
  sex        = c("M", "F", "M"),
  index_date = as.Date(c("2015-03-01", "2016-07-14", "2014-11-30"))
)

names(df)
# [1] "pnr"        "sex"        "index_date"

Understand the structure

glimpse(bef_data) # column name, type and first values - compact and readable (requires dplyr)
str(bef_data) # same information, but more verbose output
class(bef_data) # object type - is data actually in R, or is it still a connection?
class(bef_data$alder) # type for one column: "numeric", "character", "Date" etc.

class(bef_data) tells you whether you have real data ("data.frame"/"tbl_df") or still just an unsent query. You can see three different return values:

"data.frame" / "tbl_df" - data is in R. You can use all functions.
"tbl_duckdb_connection" / "Table" - lazy Arrow/DuckDB query. Missing collect().
"arrow_dplyr_query" - an Arrow query with one or more piped steps (e.g. filter() or select()), but not yet executed. Missing collect().

class() can help you debug Does the object look like data but behave strangely, or does your code fail with a mysterious message? Run class(your_object) - if it is not "data.frame", you are probably missing a collect(). The full table of what class() can return - and why - is in Phase 5 - Extracting data step by step.

See the first and last rows

head(bef_data) # the first 6 rows - do the columns and types look right?
head(bef_data, 10) # the first 10 rows
tail(bef_data) # the last 6 rows - useful for detecting incomplete datasets

Explore the contents

Two things must be in place before the commands in the rest of this section work

1. Give your extraction a name. If you just write open_dataset(...) (or read_register(...) in DARTER) without storing it in an object, R only prints a quick preview and throws the result away - there is nothing to inspect afterwards. Assign it a name with <- (see Phase 2 - What is an object?) so you can reuse it.

2. Pull the data into R with collect() first. Every command that uses $ (e.g. table(bef$koen), unique(), summary(), min()/max()/median(), hist(), colSums(is.na())) needs real data in R - $ cannot extract a column from a lazy Arrow/DuckDB connection. Check with class(bef): if it does not say "data.frame"/"tbl_df", you are missing a collect().

You do not rewrite read_register()/open_dataset() to pull the data in - keep working with the object you already made. Reuse the name and store the result, usually under the same name, so bef is overwritten and is now real data. You can pull everything in as-is, or reduce it with filter()/select() first:

bef <- read_register("bef") # lazy connection - just a name so far

bef <- bef %>% collect() # pull EVERYTHING in as-is - overwrites bef

# OR reduce first (recommended on large registers):
bef <- bef %>% filter(year == 2015) %>% select(pnr, koen, alder) %>% collect()

# now bef$koen, table(bef$koen), summary(bef) etc. all work

(If you want to keep the lazy connection, give the fetched data a new name instead: bef_2015 <- bef %>% filter(year == 2015) %>% collect().)

colnames(), glimpse(), head() and dplyr verbs (count(), filter(), select()) do work fine on the lazy connection, though. So peek lazily, filter()/select() it down, and collect() a small slice before you dig into the values.

As noted at the top of the page, $ is R’s way of saying “this column in this table”: data$koen is the column koen in the table data.

unique(bef_data$koen) # which unique values exist in the koen column?
table(bef_data$koen) # frequency table: how many rows have each value?
table(bef_data$koen, bef_data$civst) # cross-table: distribution of sex across marital status
table(bef_data$koen, useNA = "ifany") # count NAs too - otherwise they are hidden (see warning below)

table() hides NA by default. Without useNA = "ifany", table() only counts the “real” values and drops missing values entirely - so the distribution looks complete even when part of the column is NA. Always add useNA = "ifany" (show the NA row only if NAs exist) or useNA = "always" (always show the NA row) when inspecting a column, so you spot missing values instead of overlooking them.

Example with simple data

# A small example with five patients and two variables:
df <- data.frame(
  sex          = c("M", "F", "M", "F", "M"),
  age_group    = c("18-40", "18-40", "41-60", "41-60", "41-60")
)

table(df$sex)
#  F  M
#  2  3       # 2 women, 3 men

table(df$sex, df$age_group)
#    18-40  41-60
# F      1      1    # 1 woman in 18-40, 1 woman in 41-60
# M      1      2    # 1 man in 18-40, 2 men in 41-60

Summarise data

What is NA? NA (Not Available) is R’s term for a missing or unknown value. A cell can have NA because the information was not recorded, not reported, or does not exist for that person. Most calculation functions return NA if there is one NA in the data - unless you write na.rm = TRUE (“remove NAs”). is.na(x) returns TRUE for NA values and FALSE for everything else.

When is NA a problem? It depends on which column is missing:

NA in a key variable (index date, pnr, outcome) is serious - these persons cannot be correctly included in the analysis, and you must decide whether to exclude them.
NA in a covariate (e.g. income) can often be handled - e.g. with a separate “unknown” category or imputation.
NA from a join usually means a person was not found in the right-hand dataset - e.g. no prescription record. Here NA is effectively a “no/none”, not an error.

Always check colSums(is.na(data)) immediately after an extraction or a join, so you detect unexpected gaps before they propagate silently through the analysis.

summary(bef_data) # min, max, median, mean and quartiles for all columns
summary(bef_data$foed_dag) # summary of one column

For continuous variables:

min(bef_data$alder, na.rm = TRUE) # smallest value (na.rm removes NAs - rm = remove)
max(bef_data$alder, na.rm = TRUE) # largest value
mean(bef_data$alder, na.rm = TRUE) # mean
median(bef_data$alder, na.rm = TRUE) # median
sd(bef_data$alder, na.rm = TRUE) # standard deviation
IQR(bef_data$alder, na.rm = TRUE) # interquartile range (Q3 - Q1)

Check missing values

sum(is.na(bef_data$koen)) # number of NAs in the koen column - replace with your own column name
colSums(is.na(bef_data)) # number of NAs per column - gives an overview of the entire dataset

colSums(is.na(bef_data)) returns one line with a counter per column:

#  pnr  koen  alder  foed_dag  year  civst  opr_land  reg
#    0     0      0         3     0      0         0    0

Here foed_dag is missing for 3 - everything else is complete. A column with 0 has no NAs at all.

colSums(is.na()) counts only true NA. In register data, “missing” is often stored as an empty or blank string (whitespace) or a sentinel code - not as NA. A column can therefore show 0 NAs and still be full of empty values: e.g. a fixed-width key field filled with blanks (a 13-character field that looks empty but is 13 spaces, not NA). So also check for blank values - e.g. sum(trimws(bef_data$dw_ek_forloeb) == "") - and look at nchar()/unique() on a slice. This matters most for join keys: a blank-but-not-NA key causes silent join errors (rows that look filled but don’t match).

Check dates

Date columns can contain impossible values - dates far outside the study period are a sign of a conversion error or wrong column.

min(bef_data$foed_dag, na.rm = TRUE) # is the earliest birth date plausible?
max(bef_data$foed_dag, na.rm = TRUE) # is the latest birth date plausible?

# Check for dates BEFORE an expected interval:
sum(bef_data$foed_dag < as.Date("1900-01-01"), na.rm = TRUE) # replace the date with your lower bound

# Check for dates AFTER an expected interval:
sum(bef_data$foed_dag > as.Date("2015-12-31"), na.rm = TRUE) # replace the date with your upper bound

See DST pitfalls for the most common date conversion errors and how to fix them.

More exploration: count, sort and quick plots

Count and sort (dplyr):

Replace bef_data with your own dataset name and column names with your own.

bef_data %>% count(koen)               # number of rows per category
bef_data %>% count(koen, civst)        # per combination of two variables
bef_data %>% arrange(foed_dag)         # sort ascending by birth date
bef_data %>% arrange(desc(foed_dag))   # sort descending

Quick visualisations - for getting an overview, not for publication:

Replace bef_data with your own dataset name and column names with your own.

# Continuous variables:
hist(bef_data$alder)                            # histogram
boxplot(bef_data$alder)                         # box plot
boxplot(alder ~ koen, data = bef_data)          # box plot split by sex

# Categorical variables:
barplot(table(bef_data$koen))                   # bar chart

Next steps

You can now inspect a dataset. Next steps are to know which registers contain what:

Phase 8 - Find your registers: decision table: which register contains what
Functions: overview: filter(), select(), mutate(), left_join() and more
dataReporter: automatically generates an HTML report of all columns (distribution, missingness, outliers). Confirmed on DST.