3  Mosaic & Box/Violin Plots

Author
Affiliations

Ashlee Perez & Michaela Larson

Florida International University

Robert Stempel College of Public Health and Social Work

3.1 Packages for this Lesson

# Installing Required Packages
# install.packages("public.ctn0094data")
# install.packages("tidyverse")
# install.packages("ggmosaic")

# Loading Required Packages
library(public.ctn0094data)
library(tidyverse)
library(ggmosaic)

3.2 Introduction to Mosaic and Box/Violin Plots

Mosaic, box, and violin plots are useful for visualizing summary statistics.

A mosaic plot is a special type of stacked bar chart used for two or more categorical variables. The width of the columns is proportional to the number of observations in each level of the variable plotted on the horizontal, or x-axis. The vertical length of the bars is proportional to the number of observations in the second variable within each level of the first variable.

Box and violin plots are used for continuous variables by group. Box plots display six summary measures (the minimum, first quartile (Q1), median, third quartile (Q3), the interquartile range, and maximum). A violin plot illustrates the distribution of numerical data for one or more level of a categorical variable by combining summary statistics and density of each variable. Each curve corresponds to the respective frequency of data points within each region. A box plot is typically overlaid to provide additional information.

3.3 Data Source and Description

The National Drug Abuse Treatment Clinical Trials Network (CTN) is a means by which medical and specialty treatment providers, treatment researchers, participating patients, and the National Institute on Drug Abuse cooperatively develop, validate, refine, and deliver new treatment options to patients. The CTN 094 demographics and everybody data sets from the public.ctn0094data package were utilized for the following visualizations. CTN 094 is a comprehensive, harmonized and normalized database of treatment data from CTN_0027, CTN_0030, and CTN_0051, where experiences of individuals with opioid use disorder (OUD) who seek care are described.

3.4 Cleaning the Data to Create a Model Data Frame

The demographics and everybody data sets within the public.ctn0094data package were joined by ID (who variable). Race, age, is_male (gender), and project were selected features for the following visualizations.

# Creating model data frame to include age, race, project, and is_male
# from demographics and everybody data sets. Joined by subject ID (who)
demographics_df <- demographics %>% 
  left_join(everybody, by = "who")  %>%
  select(age, race, project, is_male)

3.5 Assumptions with Mosaic & Box/Violin Plots

In mosaic plots, two categorical variables are plotted along the horizontal (x) and vertical (y) axis. Each combination of categories forms a rectangle or tile within the plot.

In box and violin plots, a categorical variable is plotted along the horizontal or x-axis, while a continuous variable is plotted along the vertical or y-axis. Violin plots can be limiting if symmetry, skew, or other shape and variability characteristics are different between groups because precise comparison cannot be easily interpreted between density curves. For this reason, violin plots are typically rendered with another overlaid chart type, like box plot quartiles.

3.6 Code to Run Mosaic & Box/Violin Plots & output

3.6.1 Mosaic Plots

In order to create a Mosaic plot, you must specify what data object you will be using within the ggplot() function. Then you will set aesthetic mapping options within the following geometric object layer: geom_mosaic().

In geom_mosaic(), the following aesthetics can be specified:

  1. weight: a weighting variable.
  2. x: categorical variable for the x-axis.
    • Specified as x = product(var1, var2, ...)
    • The product() function is used to extract the values from the categorical variable specified.
  3. alpha: a variable specifying transparency.
    • If the variable is not called in x:, then alpha: will be added in the first position.
  4. fill: a variable specifying fill color.
    • If the variable is not called in x:, then fill: will be added after the optional alpha: variable.
  5. conds: a variable specifying conditions.
    • Specified as conds = product(var1, var2, ...)

The ordering of the variables is vital as the product plot is created hierarchically.

3.6.1.1 Basic Mosaic Plot

In the following example of a basic mosaic plot, we visualize the distribution of Race among CTN Projects 27, 30, and 51.

# Basic Mosaic Plot
mosaic_basic <- demographics_df %>% 
  ggplot() +
  geom_mosaic(
    aes(
      # geom_mosaic() does not have one-to-one mapping between a variable and the x- 
      # or y-axis. So you must use the product() function when assigning a variable
      # to the x-axis to account for the variable number of variables.
      x = product(project),
      fill = race
    )
  ) +
  labs(
    y = "Race",
    x = "Project",
    title = "Mosaic Plot of Race by CTN Project") +
  # Specifies default `geom_mosaic` aesthetics, e.g white panel background, 
  # removes grid lines, adjusts widths and heights of rows and columns to 
  # reflect frequencies
  theme_mosaic() +
  # Removes legend illustrating Race and respective fill colors
  theme(legend.position = "None")
  
mosaic_basic
Figure 3.1

3.6.1.2 More Advanced Mosaic Plot

In a more advanced version of a mosaic plot, we can visualize more than 2 categorical variables. The following example utilizes race, project, and ethnicity among CTN Projects 27, 30, and 51.

# Advanced Mosaic Plot
mosaic_advanced <- demographics_df %>% 
  ggplot() +
  geom_mosaic(
    aes(
      x = product(race, project),
      fill = is_male
    )
  ) +
  labs(
    y = "Race",
    x = "Project",
    title = "Mosaic Plot of Race by Gender and CTN Project",
    fill = "Gender"
  ) +
  scale_fill_manual(
    labels = c("No" = "Female", "Yes" = "Male"),
    values = c("darkseagreen2", "darkslategray3", "grey")
  ) +
  theme_mosaic() +
  # Adjust axis tick labels to 60 degrees and justification to the right
  # with hjust (horizontal justification) and vjust (vertical justification)
  theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))
  
mosaic_advanced
Figure 3.2

3.6.2 Box Plots

In order to create a box plot, you must specify what data object you will be using within the ggplot() function. Then you will set aesthetic mapping options within the aes() or aesthetic layer. The geom_boxplot() layer specifies the box plot.

The following aesthetics are understood by geom_boxplot():

  1. x or y: Specifies the categorical variable along the x- or y-axis.
  2. lower or xlower: Specifies the 25th percentile/first quartile.
  3. upper or xupper: Specifies the 75th percentile/third quartile.
  4. middle or xmiddle: Specifies the 50th percentile/second quartile/median.
  5. ymin or xmin: Specifies the y or x minimum for the plot.
  6. ymax or xmax: Specifies the y or x maximum for the plot.
  7. alpha: Specifies a variable to determine transparency.
  8. color: Assigns an outline color to respective levels of a specified categorical variable.
  9. fill: Assigns a fill color to respective levels of a specified categorical variable.
  10. group: Partitions data by a discrete variable when no other grouping variable is specified, or grouping is incorrectly defaulted by R.
  11. linetype: Specifies line type of box plot.
  12. linewidth: Specifies line width of box plot.
  13. shape: Specifies the shape of the (outlier) points.
  14. size: Specifies the size of the points and text.
  15. weight: Specifies a weight variable.

3.6.2.1 Basic Box Plot

The following is a basic box plot showing the relationship between one continuous and one categorical variable.

# Box Plot
box_basic <- demographics_df %>% 
  ggplot() +
  aes(x = race, y = age, color = race) +
  labs(
    x = "Race",
    y = "Age",
    title = "Box Plot of Race and Age",
    color = "Race"
  ) +
  # Using width to adjust the width of the boxes
  geom_boxplot(width = 0.5) +
  theme(legend.position = "None")

box_basic
Figure 3.3

3.6.2.2 More Advanced Box Plot

With geom_box(), you can also specify a additional categorical variable (different from your x and y variables) to break up your plot by that variable. For example, the following plot takes the previous plot of race and age and adds information side-by-side by gender (is_male).

# Box Plot
box_advanced <- demographics_df %>% 
  ggplot() +
  aes(x = race, y = age, color = is_male) +
  # changing the labels for is_male, and specifying the colors we want
  scale_color_manual(
    labels = c("No" = "Female", "Yes" = "Male"),
    values = c("darkorchid4", "darkolivegreen4")
  ) +
  labs(
    x = "Race",
    y = "Age",
    title = "Box Plot of Race and Age",
    color = "Gender"
  ) +
  # Using width to adjust the width of the boxes
  geom_boxplot(width = 0.5)

box_advanced
Figure 3.4

3.6.3 Violin Plot

In order to create a Violin plot, you must specify what data object you will be using within the ggplot() function. Then you will set aesthetic mapping options within the aes() or aesthetic layer. The geom_violin() layer specifies the violin plot. An additional call for geom_boxplot() will overlay box quartiles on the violin plot to display summary statistics.

The following aesthetics are understood by geom_violin():

  1. x: Specifies the categorical variable along the x-axis.
  2. y: Specifies the continuous variable along the y-axis.
  3. alpha: Specifies a variable to determine transparency.
  4. color: Assigns an outline color to respective levels of a specified categorical variable.
  5. fill: Assigns a fill color to respective levels of a specified categorical variable.
  6. group: Partitions data by a discrete variable when no other grouping variable is specified, or grouping is incorrectly defaulted by R.
  7. linetype: Specifies line type of violin plot.
  8. linewidth: Specifies line width of violin plot.
  9. weight: Specifies a weight variable.
# Violin Plot
violin_basic <- demographics_df %>% 
  ggplot() +
  aes(x = race, y = age, color = race) +
  scale_color_manual(
    values = c("coral1", "darkgreen", "deepskyblue2", "darkorchid2")
  ) +
  labs(
    x = "Race",
    y = "Age",
    title = "Violin Plot of Race and Age",
    subtitle = "With Summary Information",
    color = "Race"
  ) +
  geom_violin() +
  geom_boxplot(width = 0.1) +
  theme(legend.position = "None")

violin_basic                        
Figure 3.5

3.7 Brief Interpretation

3.7.1 Mosaic Plot

  • Compared to Project 27 and Project 51, Project 30 had the highest proportion of participants who indicated that their race is ‘White’.
  • Compared to Project 30 and Project 51, Project 27 had the highest proportion of participants who indicated that their race is ‘Other’.
  • Compared to Project 27 and Project 51, Project 30 has the lowest proportion of participants who indicated that their race is ‘Other’.

3.7.2 Box Plot

  • Participants who indicated that their race is ‘Black’ exhibited the highest median age of around 45 years old
  • Participants who indicated that their race is ‘White’ exhibited the lowest median age at approximately 31 years old.

3.7.3 Violin Plot

This plot more clearly shows the bimodality of age by race among Black and ‘Other’ participants in CTN. It also shows the skewness of age in the White participants. Specifically:

  • Participants who indicated that their race is ‘White’ exhibited peak density around mid-20s compared to those who indicated that their race is ‘Black’, where peak density is exhibited around late-40s.
  • Participants who indicated that their race is ‘White’ had the lowest median age at approximately 31 years old, where participants who indicated that their race is ‘Black’ had the highest median age at approximately 45 years old.

3.8 Conclusion

This lesson discusses three different plots for one-dimensional data: the Mosaic, Box, and Violin plots. Figure 3.1 is a basic mosaic plots that shows race by CTN project. In Figure 3.2, we added a third variable to the visualization: gender. The box plots, Figure 3.3 and Figure 3.4 we plotted age (continuous) by race and age by race and gender, respectively. Finally, Figure 3.5 shows a violin plot with an overlaid box plot for age by race in the CTN projects.

Mosaic plots are useful for proportionally visualizing the observations of two or more categorical variables. Box and violin plots can be used to visualize a continuous variable by one, or two in the case of box plots, categorical variables. Violin plots build on box plots in that they are able to provide quick information on the potential multimodal distribution(s) and skewness of a continuous variable across categories, as we saw in Figure 3.5.