5  Skimr Package

Author

Shelly Sinclair and Alvonee Penn

Published

July 30, 2024

5.1 Introduction

Skimr is an R package designed to provide summary statistics about variables in data frames, tibbles, data tables and vectors. The function is modifiable where you can add additional variables, which are not a part of default summary function within R. Skimr allows us to quickly assess data quality by feature and type in a quick report. This is a critical step in Data Exploration, where Understanding our data helps us to generate a hypothesis and determine what data analysis are appropriate.

This presentation will cover the simplest and most effective ways to explore data in R.

5.1.1 Packages

To begin we will upload the packages necessary for the lesson, this includes the following:

  • readr() to import our data file
  • knitr() that houses the kable() feature that allows us to construct and customize tables.
  • tidyverse houses the dyplyrpackage that assists with data manipulation and visualization.
  • Theskimrpackage provides a compact summary of the variables in a dataset.
# install.packages("skimr")
# install.packages("knitr")
# install.packages("tidyverse")

# load all the packages we will need to analyze the data and use the skim
#   function
library(skimr)
library(knitr)
library(readxl)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

5.1.2 Census Data

For this assignment we will be using the Census_2010 dataset. There is no code book associated with the data, making it difficult to provide an accurate description of the variables. The information recorded shows the United States population estimates from the years 2010-2015, as well as relevant variables like net population change, number of births, number of deaths, international and domestic migration. Within the dataframe, there are 3,193 observations and 100 variables.

The data can be imported into R from the following link: https://fiudit-my.sharepoint.com/:x:/g/personal/ssinc013_fiu_edu/ESK1A13PstVGtf7HUwNNt68Bnh1YPfH8L-hnvMUxjBuCVw?e=CCwQU9

# import the data
# census_2010 <- read_csv("Data/census_2010.csv")
census_2010 <- readxl::read_xlsx("../data/01_census_2010.xlsx")

# what are the variables
colnames(census_2010) %>% 
  head(n = 10)
 [1] "SUMLEV"            "REGION"            "DIVISION"         
 [4] "STATE"             "COUNTY"            "STNAME"           
 [7] "CTYNAME"           "CENSUS2010POP"     "ESTIMATESBASE2010"
[10] "POPESTIMATE2010"  

5.2 The Summary() Function

In R, the most similar function is summary(). The summary() function in R can be used to quickly summarize the values in a data frame or vector.

This syntax shows examples of the summary function using both our data set, and a vector:

#| label: Summary-syntax-with-data

# Example using summary function with data
summary(census_2010$CENSUS2010POP)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
      82    11299    26424   193387    71404 37253956 
# Example using summary function with vector
# Define vector
x <- c(3, 4, 23, 5, 7, 8, 9, 12, 26, 15, 20, 21, NA)

# Summarize values in vector
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   3.00    6.50   10.50   12.75   20.25   26.00       1 

The summary() function automatically calculates: The minimum value, The value of the 1st quartile (25th percentile), The median value, The value of the 3rd quartile (75th percentile) and The maximum value. Any missing values (NA) in the vector, the summary() function will automatically exclude them when calculating the summary statistics.

Now, let’s see how skim() compares.

5.3 Skimr Package

The skim() function will generate a summary of the variables in your dataset, including their data type, number of non-missing values, minimum and maximum values, median, mean, standard deviation, and more (Waring et al. 2022).

The following syntax ensures that the data is compatible with Skimr functions.

Code
# is the summary data a skimr dataframe
skim(census_2010) %>% 
  is_skim_df() # TRUE
[1] TRUE
attr(,"message")
character(0)

We can explore the data as a tibble:

Code
# use skim to get descriptive statistics of the data
skim(census_2010) %>% 
  head(n = 10)
Data summary
Name census_2010
Number of rows 3193
Number of columns 100
_______________________
Column type frequency:
character 2
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
STNAME 0 1 4 20 0 51 0
CTYNAME 0 1 4 33 0 1927 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SUMLEV 0 1 49.84 1.25 40 50 50 50 50 ▁▁▁▁▇
REGION 0 1 2.67 0.81 1 2 3 3 4 ▁▆▁▇▂
DIVISION 0 1 5.19 1.97 1 4 5 7 9 ▂▇▅▆▃
STATE 0 1 30.26 15.15 1 18 29 45 56 ▃▇▆▆▇
COUNTY 0 1 101.92 107.63 0 33 77 133 840 ▇▁▁▁▁
CENSUS2010POP 0 1 193387.05 1176201.45 82 11299 26424 71404 37253956 ▇▁▁▁▁
ESTIMATESBASE2010 0 1 193396.87 1176244.25 82 11299 26446 71491 37254503 ▇▁▁▁▁
POPESTIMATE2010 0 1 193765.65 1178710.28 83 11275 26467 71721 37334079 ▇▁▁▁▁

Using skimr functions provides a cleaner and more detailed display of the results compared to the summary() function. In this example we are showing the first ten variables in our data set. The data summary tab shows the number of rows and columns, column type frequency and group variables. There is also additional descriptive information like missing values, unique characters.

This will be relevant for data cleaning as well as understanding the distribution. Both are critical to determine which statistical analysis would be most appropriate to use for a project.

5.4 Other Skimr Features

5.4.1 Separate dataframes by type

The data frames produced by skim() are wide and sparse, filled with columns that are mostly NA. For that reason, it can be convenient to work with “by type” subsets of the original data frame. These smaller subsets have their NA columns removed.

Features:

  • partition() - Creates a list of smaller data frames. Each entry in the list is a data type from the original dataframe
  • bind() - Takes the list and rebuilds the original dataframe.
  • yank() - Extract a subtable from a dataframe with a particular type.

The following syntax is using partition() to separate the large census_df.

Code
# split the character and numeric data
separate_df <- partition(skim(census_2010))
# check only the character data
separate_df$character

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
STNAME 0 1 4 20 0 51 0
CTYNAME 0 1 4 33 0 1927 0
Code
# create summary statistics for only numeric variables
numeric_separate_df <- separate_df[2]
# pull out the desired summary statistics in the nested list
head(numeric_separate_df$numeric["mean"]) %>% 
  kable(digits = 1) 
mean
49.8
2.7
5.2
30.3
101.9
193387.1

The following syntax is using bind() to combine the smaller character and numeric lists into the desired df.

Code
# combine the character and numeric data
head(bind(separate_df))
Data summary
Name census_2010
Number of rows 3193
Number of columns 100
_______________________
Column type frequency:
character 2
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
STNAME 0 1 4 20 0 51 0
CTYNAME 0 1 4 33 0 1927 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SUMLEV 0 1 49.84 1.25 40 50 50 50 50 ▁▁▁▁▇
REGION 0 1 2.67 0.81 1 2 3 3 4 ▁▆▁▇▂
DIVISION 0 1 5.19 1.97 1 4 5 7 9 ▂▇▅▆▃
STATE 0 1 30.26 15.15 1 18 29 45 56 ▃▇▆▆▇
Code
# confirm that the bound table is the same as the original skimmed table
identical(bind(separate_df), skim(census_2010)) 
[1] TRUE

The following syntax is using yank() to extract a specific table eg.character to examine.

Code
# Extract character data
yank(skim(census_2010), "character")

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
STNAME 0 1 4 20 0 51 0
CTYNAME 0 1 4 33 0 1927 0

5.4.2 Skimr with Dplyr

Skimr functions can be used in combination with Dplyr functions to examine specific variables within the census dataset.

The following example used skim() with filter() to display the variable CENSUS2010POP. The dataframe was further customized to display variable name and data type using select().

Code
# use dplyr functions on the statistics summary table
census_filter <- skim(census_2010) %>% 
  filter(skim_variable == "CENSUS2010POP")
census_filter
Data summary
Name census_2010
Number of rows 3193
Number of columns 100
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CENSUS2010POP 0 1 193387 1176201 82 11299 26424 71404 37253956 ▇▁▁▁▁
Code
census_select <- skim(census_2010) %>% 
  select(skim_type, skim_variable)
head(census_select)
# A tibble: 6 × 2
  skim_type skim_variable
  <chr>     <chr>        
1 character STNAME       
2 character CTYNAME      
3 numeric   SUMLEV       
4 numeric   REGION       
5 numeric   DIVISION     
6 numeric   STATE        

You can also customize the output of the skim() function by using various arguments. For example, you can use the numeric argument to specify which variables should be treated as numeric variables, or use the ranges argument to specify custom ranges for variables.

Using skim() in combination with mutate() we will compute a new variable to add to our skim dataframe.

Code
# create a new variable calculate the change in birth rate from 2010 to 2011
census_2010 %>% 
  # new variable
  mutate(net_birth = BIRTHS2011 - BIRTHS2010) %>% 
  # move the variable to the beginning of the dataset
  relocate(net_birth, .after = CENSUS2010POP) %>% 
  # summary statistics table
  skim() %>% 
  # only the first fifteen variables
  head(n = 15) %>% 
  # change the formatting 
  kable(digit = 2)
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character STNAME 0 1 4 20 0 51 0 NA NA NA NA NA NA NA NA
character CTYNAME 0 1 4 33 0 1927 0 NA NA NA NA NA NA NA NA
numeric SUMLEV 0 1 NA NA NA NA NA 49.84 1.25 40 50 50 50 50 ▁▁▁▁▇
numeric REGION 0 1 NA NA NA NA NA 2.67 0.81 1 2 3 3 4 ▁▆▁▇▂
numeric DIVISION 0 1 NA NA NA NA NA 5.19 1.97 1 4 5 7 9 ▂▇▅▆▃
numeric STATE 0 1 NA NA NA NA NA 30.26 15.15 1 18 29 45 56 ▃▇▆▆▇
numeric COUNTY 0 1 NA NA NA NA NA 101.92 107.63 0 33 77 133 840 ▇▁▁▁▁
numeric CENSUS2010POP 0 1 NA NA NA NA NA 193387.05 1176201.45 82 11299 26424 71404 37253956 ▇▁▁▁▁
numeric net_birth 0 1 NA NA NA NA NA 1870.12 11792.85 -3 96 232 639 386443 ▇▁▁▁▁
numeric ESTIMATESBASE2010 0 1 NA NA NA NA NA 193396.87 1176244.25 82 11299 26446 71491 37254503 ▇▁▁▁▁
numeric POPESTIMATE2010 0 1 NA NA NA NA NA 193765.65 1178710.28 83 11275 26467 71721 37334079 ▇▁▁▁▁
numeric POPESTIMATE2011 0 1 NA NA NA NA NA 195251.40 1189647.76 90 11277 26417 72387 37700034 ▇▁▁▁▁
numeric POPESTIMATE2012 0 1 NA NA NA NA NA 196744.52 1200508.37 81 11195 26362 72496 38056055 ▇▁▁▁▁
numeric POPESTIMATE2013 0 1 NA NA NA NA NA 198200.69 1211123.45 89 11180 26519 72222 38414128 ▇▁▁▁▁
numeric POPESTIMATE2014 0 1 NA NA NA NA NA 199754.09 1222669.36 87 11121 26483 72257 38792291 ▇▁▁▁▁

5.4.3 Adding Variables

  • base - An sfl that sets skimmers for all column types.
  • append - Whether the provided options should be in addition to the defaults already in skim. Default is TRUE.

As mentioned, skim() is designed to display default statistics, however you can use this function to change the summary statistics that it returns.

skim_with() is type closure: a function that returns adds a new variable to the table. This lets you have several skimming functions in a single R session, but it also means that you need to assign the return of skim_with() before you can use it.

You assign values within skim_with() by using the sfl() helper (skimr function list). It identifies which skimming functions you want to remove, by setting them to NULL. Assign an sfl to each column type that you wish to modify.

For example, we will add the following variables to the dataframe: median, min, max, IQR, length.

Code
my_skim <- skim_with(
  numeric = sfl(median, min, max, IQR),
  character = sfl(length), 
  append = TRUE
)

# add new variables into the summary table
census_2010 %>% 
  my_skim() %>% 
  head(n = 10)
Data summary
Name Piped data
Number of rows 3193
Number of columns 100
_______________________
Column type frequency:
character 2
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace length
STNAME 0 1 4 20 0 51 0 3193
CTYNAME 0 1 4 33 0 1927 0 3193

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist median min max IQR
SUMLEV 0 1 49.84 1.25 40 50 50 50 50 ▁▁▁▁▇ 50 40 50 0
REGION 0 1 2.67 0.81 1 2 3 3 4 ▁▆▁▇▂ 3 1 4 1
DIVISION 0 1 5.19 1.97 1 4 5 7 9 ▂▇▅▆▃ 5 1 9 3
STATE 0 1 30.26 15.15 1 18 29 45 56 ▃▇▆▆▇ 29 1 56 27
COUNTY 0 1 101.92 107.63 0 33 77 133 840 ▇▁▁▁▁ 77 0 840 100
CENSUS2010POP 0 1 193387.05 1176201.45 82 11299 26424 71404 37253956 ▇▁▁▁▁ 26424 82 37253956 60105
ESTIMATESBASE2010 0 1 193396.87 1176244.25 82 11299 26446 71491 37254503 ▇▁▁▁▁ 26446 82 37254503 60192
POPESTIMATE2010 0 1 193765.65 1178710.28 83 11275 26467 71721 37334079 ▇▁▁▁▁ 26467 83 37334079 60446

5.5 Conclusion

Overall, Skimr is a useful package for quickly summarizing the variables in a dataset and gaining insights into its structure and content.

5.6 References

Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2022. Skimr: Compact and Flexible Summaries of Data. https://docs.ropensci.org/skimr/.