Descriptive tables (Table 1)
Baseline characteristics of your study population with gtsummary
Under development. More examples (subgroups, more statistics) are coming.
A “Table 1” describes your study population: age, sex and other baseline variables, usually split by exposure. It is almost always the first table in a register-based paper. Here we use the gtsummary package, which builds a publication-ready table in a few lines.
The code examples use generic path and variable names. Adapt them to your project. gtsummary (and huxtable, if you want to export to Excel) must be installed in your R environment on DST; if they are not, contact your data manager.
Starting point
You start from your analysis-ready dataset with one row per person (Phase 12).
library(dplyr) # %>% (pipe)
library(gtsummary) # tbl_summary() + export functions
df <- readRDS("path/to/analysis.rds") # analysis-ready dataset, one row per personExample
tbl_summary() summarises the variables you point it at. With by = you split the table into columns by a group (e.g. exposure).
table1 <- df %>%
tbl_summary(
by = exposure, # one column per group (omit for an overall table)
include = c(age, sex, bmi), # variables to include
statistic = list(
all_continuous() ~ "{median} ({p25}, {p75})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1, # number of decimals for continuous numbers
label = list(age ~ "Age (years)", sex ~ "Sex", bmi ~ "BMI"), # nice variable names
missing = "ifany", # show an "unknown" row if values are missing
missing_text = "Missing" # the text for that row
) %>%
add_overall() %>% # add an "Overall" column: ALL participants combined (both groups)
bold_labels() %>% # bold the variable names
modify_header(label ~ "**Variable**") # text in the header of the variable column (your choice)
table1 # print the table (display it)More detail
tbl_summary()detects the variable type itself from the column’s contents and picks a suitable statistic:- factor and character variables - and numeric variables with fewer than 10 unique values - become categorical and are shown as
{n} ({p}%)(count and percentage); - other numeric variables become continuous and are shown as
{median} ({p25}, {p75})(median and quartiles, i.e. IQR); - variables coded
0/1,yes/noorTRUE/FALSEbecome dichotomous and are shown with only the one (“yes”) row.
- factor and character variables - and numeric variables with fewer than 10 unique values - become categorical and are shown as
statistic =is therefore optional. Omit the line entirely and you get exactly the default display above. It is only included to show how you change it - for mean instead of median, writeall_continuous() ~ "{mean} ({sd})".- In
gtsummarythe tilde~means “assign”:what ~ value. E.g.age ~ "Age (years)"gives theagevariable the label “Age (years)”, andall_continuous() ~ "{mean} ({sd})"sets that statistic for all continuous variables. all_continuous()/all_categorical()mean “all continuous” resp. “all categorical” variables, so the rule applies to all of them at once.digitscontrols the number of decimals (here 1 for the continuous variables).add_overall()adds an Overall column with all participants combined (both groups in one column), next to the per-group columns - handy for showing the total count and overall distribution. Omit it if you only want the groups.
Should you add p-values?
You can add a p-value column with add_p(). You can either run it afterwards on the finished table1 (as here), or insert add_p() into the chain above right after tbl_summary() - the result is the same. Note that hypothesis tests in a plain baseline table are often discouraged (with large register populations they do not test what people assume). Use them deliberately.
table1 %>% # take the already-built table ...
add_p() # ... and add a column of p-valuesCrude incidence rate (per 1000 person-years)
Register papers often report the crude incidence rate per exposure group alongside Table 1: the number of events divided by the total follow-up time (person-years), scaled to per 1000 person-years. It needs event (1 = event, 0 = censored) and followup_years from your analysis-ready dataset (built in Phase 12).
df %>%
group_by(exposure) %>% # one line per exposure group
summarise(
events = sum(event), # number of outcomes in the group
person_years = sum(followup_years), # total follow-up time (person-years)
rate_per_1000 = sum(event) / sum(followup_years) * 1000 # crude rate per 1000 person-years
)This is a descriptive rate with no adjustment, and counts as aggregated output (check for small cells before export). To compare the groups with control for confounding, that belongs to regression and time-to-event, where a Cox model gives a hazard ratio.
Export the table (for output control)
Anything leaving DST must go through output control. A table like this is aggregated, but always check for small cells before exporting.
# To Excel: as_hux_xlsx() requires the huxtable package to be INSTALLED
# (you don't need to call library(huxtable) yourself - gtsummary uses it behind the scenes).
table1 %>% as_hux_xlsx(file = "table1.xlsx") # write the table to a .xlsx file
# To Word: requires the flextable package installed
table1 %>% as_flex_table() # gives a flextable object for a Word document
# Works ALWAYS without extra packages (base R): turn the table into a data frame and save as CSV
table1 %>%
as_tibble() %>% # the table as a plain data frame
write.csv("table1.csv", row.names = FALSE) # a CSV opens directly in ExcelIf huxtable or flextable is not available on DST, use the CSV route (base R, no packages needed) or contact your data manager.
Alternative package
finalfit::summary_factorlist() builds a similar baseline table and is popular in epidemiology. gtsummary is used here because exporting to Excel and Word is straightforward.
Remember: anything leaving DST must go through output control - no small cells, only aggregated results. See Phase 14 - Export and repatriation.
Further depth in The Epidemiologist R Handbook: