6  Demographics Table With Table1

Author
Affiliations

Lea Nehme, Zhayda Reilly and Melanie Freire

Florida International University

Robert Stempel College of Public Health and Social Work

#install.packages("tidyverse")
library(tidyverse)

6.1 Introduction

In most scientific research journals, the first included table is often referred to as Table1. It is a table that presents descriptive statistics of baseline characteristics of the study population stratified by exposure. This package makes it fairly straightforward to produce such a table using R. Table1 includes descriptive statistics for the total study sample, with the rows (explanatory variables) consisting of the key study variables that are often included in the final analysis1. Then within the columns (outcome of interest/response variable), you will find cells given as an (%) for categorical variables, whereas a mean, SD, or the median will be provided for continuous variables. Additionally, there will be a total column provided which can help in the assessment of the overall sample.

6.2 Necessary Packages

The htmlTable package allows for the usage of the table1() function to create a table 1, while also making life easy when attempting to copy this table into a Word document.

The boot package was created to aid in performing bootstrapping analysis. With it comes numerous data sets, specifically clinical trial data sets to make this possible. However, there is no code book provided within the package when the data is downloaded as a csv file. This is a link on Github that explains and elaborates on every data within the package itself2.

#install.packages("htmlTable")
#install.packages("boot")

# Load libraries
library(htmlTable)
library(table1)
library(boot)

6.3 Data source and description

Today, we will be using the melanoma data set which consists of malignant melanoma measurements of patients. Each patient had their tumor surgically removed between the years of 1962 and 1977 at the Department of Plastic Surgery, University Hospital of Odense located in Denamrk. Each surgery consisted of the complete removal of the tumor with an additional removal of about 2.5cm of the surrounding skin. When this was completed, the thickness of the tumor was recorded along with the physical appearance of ulceration vs no ulceration, as it is an important prognostic indication of those with a thick/ulcerated tumor to have an increased chance of death as a consequence of melanoma.

data(melanoma, package = "boot")
melanoma_data <- melanoma

#Now that we loaded the raw data set, we will conduct a visual exploration before wrangling #the data and applying any functions, while also considering the requirements involved in #the construction of a table1.

summary(melanoma_data)
      time          status          sex              age             year     
 Min.   :  10   Min.   :1.00   Min.   :0.0000   Min.   : 4.00   Min.   :1962  
 1st Qu.:1525   1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:42.00   1st Qu.:1968  
 Median :2005   Median :2.00   Median :0.0000   Median :54.00   Median :1970  
 Mean   :2153   Mean   :1.79   Mean   :0.3854   Mean   :52.46   Mean   :1970  
 3rd Qu.:3042   3rd Qu.:2.00   3rd Qu.:1.0000   3rd Qu.:65.00   3rd Qu.:1972  
 Max.   :5565   Max.   :3.00   Max.   :1.0000   Max.   :95.00   Max.   :1977  
   thickness         ulcer      
 Min.   : 0.10   Min.   :0.000  
 1st Qu.: 0.97   1st Qu.:0.000  
 Median : 1.94   Median :0.000  
 Mean   : 2.92   Mean   :0.439  
 3rd Qu.: 3.56   3rd Qu.:1.000  
 Max.   :17.42   Max.   :1.000  

6.4 Cleaning the data to create a model data frame

Let us now explore the type of variables within the data set.

typeof(melanoma_data$status) 
[1] "double"

We will first provide a basic table1 to illustrate how the function works. Currently, all the variables are in numeric/double formats, however for the creation of a basic table1, it is of importance to convert the dependent/response variable of interest to reflect categories (factor).

Our main variable of interest (dependent/response) is the status. According to the code book found in Github, status is coded into three levels that indicate the patients status at the end of the study. Level 1 indicates that they had died from melanoma, Level 2 indicates that they were still alive at the conclusion of the study, and Level 3 indicates that they had died from causes unrelated to their melanoma. As such, we will factor the “status” variable into three levels. With this in mind, let us go ahead and convert melanoma into a factor variable with three levels. For ease of analysis we will use 2 = “Alive” as the reference level. This can be done in two ways:

  1. Although more time consuming, it is highly recommended that beginners utilize the function as.factor() and then utilize the recode_factor() function to minimize the errors.

  2. When you become more skilled and are able to understand how the factor function works, it is possible to do everything in one step with the factor() function. In this function you can put levels and labels all in one function instead of having to break it up into more than one function.

For our example we will use as.factor then recode_factor() using 2 = “Alive” as our reference group.

melanoma_data$status <-
  as.factor(melanoma_data$status)

# print the first six observations
head(melanoma_data$status)
[1] 3 3 2 3 1 1
Levels: 1 2 3
# Recode
melanoma_data$status <- recode_factor(
  melanoma_data$status, 
  "2" = "Alive", # this is the reference group
  "1" = "Died from melanoma",
  "3" = "Non-Melanoma death"
)

# Print the first six observations
head(melanoma_data$status)
[1] Non-Melanoma death Non-Melanoma death Alive              Non-Melanoma death
[5] Died from melanoma Died from melanoma
Levels: Alive Died from melanoma Non-Melanoma death

As you can see in the variable levels, “Alive” is the reference level. It is extremely important to pick a reference level to lay the foundation of the table along with highlighting the outcome of interest of your hypothesis. In summary, this lays the foundation of a well organized table.

6.5 Creation of basic table 1

Now that our main variable of interest is a factor with three levels, we will run a basic table1 with the independent/explanatory variables of interest: sex, age, ulcer, and thickness.

Recall that the explanatory variables of interest are still in “double” formats. Conveniently, to analyze data before the independent variables are converted to factors and labeled, the table1 provides the ability to highlight level results. This only applies for independent variables that are in numeric/double formats in which each number represents a group. For instance 0 although is a number format we know it has a group meaning such as male.

For the independent variables, if they have factors in the front, it provides the number of cases (aka observations). If they are a continuous variable, we will get the mean, the SD, the minimum and the maximum amounts.

basic_table1 <- table1( 
  ~ factor(sex) + age + factor(ulcer) + thickness | status, 
  data = melanoma_data
)

basic_table1
Alive
(N=134)
Died from melanoma
(N=57)
Non-Melanoma death
(N=14)
Overall
(N=205)
factor(sex)
0 91 (67.9%) 28 (49.1%) 7 (50.0%) 126 (61.5%)
1 43 (32.1%) 29 (50.9%) 7 (50.0%) 79 (38.5%)
age
Mean (SD) 50.0 (15.9) 55.1 (17.9) 65.3 (10.9) 52.5 (16.7)
Median [Min, Max] 52.0 [4.00, 84.0] 56.0 [14.0, 95.0] 65.0 [49.0, 86.0] 54.0 [4.00, 95.0]
factor(ulcer)
0 92 (68.7%) 16 (28.1%) 7 (50.0%) 115 (56.1%)
1 42 (31.3%) 41 (71.9%) 7 (50.0%) 90 (43.9%)
thickness
Mean (SD) 2.24 (2.33) 4.31 (3.57) 3.72 (3.63) 2.92 (2.96)
Median [Min, Max] 1.36 [0.100, 12.9] 3.54 [0.320, 17.4] 2.26 [0.160, 12.6] 1.94 [0.100, 17.4]

Note that the table1 package uses a familiar formula interface, where the variables to include in the table are separated by ‘+’ symbols, the “stratification” variable (which creates the columns) appears to the right of a “conditioning” symbol ‘|’, and the data argument specifies a data.frame that contains the variables in the formula.

If we do not put factor for a grouped variable then the following will happen:

wrong_table1 <- table1(
  ~ sex + age + ulcer + thickness | status, 
  data = melanoma_data
)

wrong_table1
Alive
(N=134)
Died from melanoma
(N=57)
Non-Melanoma death
(N=14)
Overall
(N=205)
sex
Mean (SD) 0.321 (0.469) 0.509 (0.504) 0.500 (0.519) 0.385 (0.488)
Median [Min, Max] 0 [0, 1.00] 1.00 [0, 1.00] 0.500 [0, 1.00] 0 [0, 1.00]
age
Mean (SD) 50.0 (15.9) 55.1 (17.9) 65.3 (10.9) 52.5 (16.7)
Median [Min, Max] 52.0 [4.00, 84.0] 56.0 [14.0, 95.0] 65.0 [49.0, 86.0] 54.0 [4.00, 95.0]
ulcer
Mean (SD) 0.313 (0.466) 0.719 (0.453) 0.500 (0.519) 0.439 (0.497)
Median [Min, Max] 0 [0, 1.00] 1.00 [0, 1.00] 0.500 [0, 1.00] 0 [0, 1.00]
thickness
Mean (SD) 2.24 (2.33) 4.31 (3.57) 3.72 (3.63) 2.92 (2.96)
Median [Min, Max] 1.36 [0.100, 12.9] 3.54 [0.320, 17.4] 2.26 [0.160, 12.6] 1.94 [0.100, 17.4]

As you can see above, we have the incorrect values provided of the explanatory variables. For example, in the variable of sex, we expect to see the number of individuals who identify as male or female, but instead we observe the mean, which is not a proper descriptive statistic as sex is a categorical variable.

To avoid this issue as well as problems in other procedures (like logistic regressions), it is crucial that we remember to factor the variables before we run any function. But because we don’t have nice labels for the variables and categories, it doesn’t look great. To improve things, we can create factors with descriptive labels for the categorical variables (sex and ulcer), label each variable the way we want, and specify units for the continuous variables (age and thickness). According to the code book, the patient’s sex: 1 = male, 0 = female, and ulcer is an indicator of ulceration : 1 = present, 0 = absent. We also specify that the overall column to be labeled “Total” and be positioned on the left, and add a caption and footnote:

melanoma_data$sex <- as.factor(melanoma_data$sex)

# print the first six observations
head(melanoma_data$sex)
[1] 1 1 1 0 1 1
Levels: 0 1
# Recode
melanoma_data$sex <- recode_factor(
  melanoma_data$sex, 
  "0" = "Female",
  "1" = "Male"
)

# Print the first six observations
head(melanoma_data$sex)
[1] Male   Male   Male   Female Male   Male  
Levels: Female Male
typeof(melanoma_data$ulcer)
[1] "double"
melanoma_data$ulcer <- as.factor(melanoma_data$ulcer)

# print the first six observations
head(melanoma_data$ulcer)
[1] 1 0 0 0 1 1
Levels: 0 1
# Recode
melanoma_data$ulcer <- recode_factor(
  melanoma_data$ulcer, 
  "0" = "Absent",
  "1" = "Present"
)

# Print the first six observations
head(melanoma_data$ulcer)
[1] Present Absent  Absent  Absent  Present Present
Levels: Absent Present

In addition, we need to add units to the two continuous variables age and thickness. According to the code book, age is the patient’s age measured in years and thickness corresponds to the tumor’s thickness in millimeters (mm). The package table1 provides an easy way to demonstrate measurement information:

units(melanoma_data$age) <- "years"
units(melanoma_data$thickness) <- "mm"

Additionally, for visual and descriptive purposes, the function table1 is able to easily provide labels for the variables that will be shown in the final table using the label() function. Also, (caption \<-) provides a title for the table and (footnote \<-) provides any footnote information.

label(melanoma_data$sex) <- "Sex"
label(melanoma_data$age) <- "Age"
label(melanoma_data$ulcer) <- "Ulceration"
label(melanoma_data$thickness) <-"Thickness*"

caption_char <- "Table 1. Melanoma Dataset Descriptive Statistics"
footnote_char <- "*Also known as Breslow thickness"

Below, we can demonstrate the final table1 layout. As you can see, you no longer use factor() in front of the variable as we already factorized it in the previous steps.

table1(
  ~ sex + age + ulcer + thickness | status, 
  data = melanoma_data,
  overall = c(left = "Total"), 
  caption = caption_char, 
  footnote = footnote_char
)
Table 1. Melanoma Dataset Descriptive Statistics
Total
(N=205)
Alive
(N=134)
Died from melanoma
(N=57)
Non-Melanoma death
(N=14)

*Also known as Breslow thickness

Sex
Female 126 (61.5%) 91 (67.9%) 28 (49.1%) 7 (50.0%)
Male 79 (38.5%) 43 (32.1%) 29 (50.9%) 7 (50.0%)
Age (years)
Mean (SD) 52.5 (16.7) 50.0 (15.9) 55.1 (17.9) 65.3 (10.9)
Median [Min, Max] 54.0 [4.00, 95.0] 52.0 [4.00, 84.0] 56.0 [14.0, 95.0] 65.0 [49.0, 86.0]
Ulceration
Absent 115 (56.1%) 92 (68.7%) 16 (28.1%) 7 (50.0%)
Present 90 (43.9%) 42 (31.3%) 41 (71.9%) 7 (50.0%)
Thickness* (mm)
Mean (SD) 2.92 (2.96) 2.24 (2.33) 4.31 (3.57) 3.72 (3.63)
Median [Min, Max] 1.94 [0.100, 17.4] 1.36 [0.100, 12.9] 3.54 [0.320, 17.4] 2.26 [0.160, 12.6]

6.6 Changing the table’s appearance

The default style of table1 uses an Arial font, and resembles the booktabs style commonly used in LaTeX. While this default style is not ugly, inevitably there will be a desire to customize the visual appearance of the table (fonts, colors, gridlines, etc). The package provides a limited number of built-in options for changing the style, while further customization can be achieved in R Markdown documents using CSS.3

6.6.1 Using built-in styles

The package includes a limited number of built-in styles including:

  • zebra: alternating shaded and unshaded rows (zebra stripes)

  • grid: show all grid lines

  • shade: shade the header row(s) in gray

  • times: use a serif font

These styles can be selected using the topclass argument of table1. Some examples follow:

table1(~ sex + age + ulcer + thickness | status, 
       data = melanoma_data,
       overall = c(left = "Total"), 
       caption = caption_char, 
       footnote = footnote_char, 
       topclass="Rtable1-zebra"
)
Table 1. Melanoma Dataset Descriptive Statistics
Total
(N=205)
Alive
(N=134)
Died from melanoma
(N=57)
Non-Melanoma death
(N=14)

*Also known as Breslow thickness

Sex
Female 126 (61.5%) 91 (67.9%) 28 (49.1%) 7 (50.0%)
Male 79 (38.5%) 43 (32.1%) 29 (50.9%) 7 (50.0%)
Age (years)
Mean (SD) 52.5 (16.7) 50.0 (15.9) 55.1 (17.9) 65.3 (10.9)
Median [Min, Max] 54.0 [4.00, 95.0] 52.0 [4.00, 84.0] 56.0 [14.0, 95.0] 65.0 [49.0, 86.0]
Ulceration
Absent 115 (56.1%) 92 (68.7%) 16 (28.1%) 7 (50.0%)
Present 90 (43.9%) 42 (31.3%) 41 (71.9%) 7 (50.0%)
Thickness* (mm)
Mean (SD) 2.92 (2.96) 2.24 (2.33) 4.31 (3.57) 3.72 (3.63)
Median [Min, Max] 1.94 [0.100, 17.4] 1.36 [0.100, 12.9] 3.54 [0.320, 17.4] 2.26 [0.160, 12.6]
table1(~ sex + age + ulcer + thickness | status, 
       data = melanoma_data,
       overall = c(left = "Total"), 
       caption = caption_char, 
       footnote = footnote_char, 
       topclass="Rtable1-grid"
)
Table 1. Melanoma Dataset Descriptive Statistics
Total
(N=205)
Alive
(N=134)
Died from melanoma
(N=57)
Non-Melanoma death
(N=14)

*Also known as Breslow thickness

Sex
Female 126 (61.5%) 91 (67.9%) 28 (49.1%) 7 (50.0%)
Male 79 (38.5%) 43 (32.1%) 29 (50.9%) 7 (50.0%)
Age (years)
Mean (SD) 52.5 (16.7) 50.0 (15.9) 55.1 (17.9) 65.3 (10.9)
Median [Min, Max] 54.0 [4.00, 95.0] 52.0 [4.00, 84.0] 56.0 [14.0, 95.0] 65.0 [49.0, 86.0]
Ulceration
Absent 115 (56.1%) 92 (68.7%) 16 (28.1%) 7 (50.0%)
Present 90 (43.9%) 42 (31.3%) 41 (71.9%) 7 (50.0%)
Thickness* (mm)
Mean (SD) 2.92 (2.96) 2.24 (2.33) 4.31 (3.57) 3.72 (3.63)
Median [Min, Max] 1.94 [0.100, 17.4] 1.36 [0.100, 12.9] 3.54 [0.320, 17.4] 2.26 [0.160, 12.6]
table1(~ sex + age + ulcer + thickness | status, 
       data = melanoma_data,
       overall = c(left = "Total"), 
       caption = caption_char, 
       footnote = footnote_char, 
       topclass="Rtable1-grid Rtable1-shade Rtable1-times"
)
Table 1. Melanoma Dataset Descriptive Statistics
Total
(N=205)
Alive
(N=134)
Died from melanoma
(N=57)
Non-Melanoma death
(N=14)

*Also known as Breslow thickness

Sex
Female 126 (61.5%) 91 (67.9%) 28 (49.1%) 7 (50.0%)
Male 79 (38.5%) 43 (32.1%) 29 (50.9%) 7 (50.0%)
Age (years)
Mean (SD) 52.5 (16.7) 50.0 (15.9) 55.1 (17.9) 65.3 (10.9)
Median [Min, Max] 54.0 [4.00, 95.0] 52.0 [4.00, 84.0] 56.0 [14.0, 95.0] 65.0 [49.0, 86.0]
Ulceration
Absent 115 (56.1%) 92 (68.7%) 16 (28.1%) 7 (50.0%)
Present 90 (43.9%) 42 (31.3%) 41 (71.9%) 7 (50.0%)
Thickness* (mm)
Mean (SD) 2.92 (2.96) 2.24 (2.33) 4.31 (3.57) 3.72 (3.63)
Median [Min, Max] 1.94 [0.100, 17.4] 1.36 [0.100, 12.9] 3.54 [0.320, 17.4] 2.26 [0.160, 12.6]

Note that the style name needs to be preceded by the prefix Rtable1-. Multiple styles can be applied in combination by separating them with a space.

6.7 Conclusion

In conclusion, table1 is one of the most utilized tools in the scientific research field. Understanding how to use the table1 package in R can be of benefit to many. It is important to note that this presentation is just a brief summary with what is possible with this package. For example, you can add extra columns to the table, other than descriptive statistics. This can be accomplished using the extra.col option. In addition, you can also stratify the response variable to highlight two of the responses, like dead or alive in our example.

References

1.
Hayes-Larson E, Kezios KL, Mooney SJ, Lovasi G. Who is in this study, anyway? Guidelines for a useful Table 1. Journal of Clinical Epidemiology [Internet] 2019;114:125–32. Available from: http://dx.doi.org/10.1016/j.jclinepi.2019.06.011
2.
A. C. Davison, D. V. Hinkley. Bootstrap methods and their applications [Internet]. Cambridge: Cambridge University Press; 1997. Available from: doi:10.1017/CBO9780511802843
3.
Cohen J. Statistical power analysis for the behavioral sciences. 2nd ed. Hillsdale, N.J: L. Erlbaum Associates; 1988.