5 Skimr Package

Author

Shelly Sinclair and Alvonee Penn

Published

July 30, 2024

5.1 Introduction

Skimr is an R package designed to provide summary statistics about variables in data frames, tibbles, data tables and vectors. The function is modifiable where you can add additional variables, which are not a part of default summary function within R. Skimr allows us to quickly assess data quality by feature and type in a quick report. This is a critical step in Data Exploration, where Understanding our data helps us to generate a hypothesis and determine what data analysis are appropriate.

This presentation will cover the simplest and most effective ways to explore data in R.

5.1.1 Packages

To begin we will upload the packages necessary for the lesson, this includes the following:

readr() to import our data file
knitr() that houses the kable() feature that allows us to construct and customize tables.
tidyverse houses the dyplyrpackage that assists with data manipulation and visualization.
Theskimrpackage provides a compact summary of the variables in a dataset.

# install.packages("skimr")
# install.packages("knitr")
# install.packages("tidyverse")

# load all the packages we will need to analyze the data and use the skim
#   function
library(skimr)
library(knitr)
library(readxl)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

5.1.2 Census Data

For this assignment we will be using the Census_2010 dataset. There is no code book associated with the data, making it difficult to provide an accurate description of the variables. The information recorded shows the United States population estimates from the years 2010-2015, as well as relevant variables like net population change, number of births, number of deaths, international and domestic migration. Within the dataframe, there are 3,193 observations and 100 variables.

The data can be imported into R from the following link: https://fiudit-my.sharepoint.com/:x:/g/personal/ssinc013_fiu_edu/ESK1A13PstVGtf7HUwNNt68Bnh1YPfH8L-hnvMUxjBuCVw?e=CCwQU9

# import the data
# census_2010 <- read_csv("Data/census_2010.csv")
census_2010 <- readxl::read_xlsx("../data/01_census_2010.xlsx")

# what are the variables
colnames(census_2010) %>% 
  head(n = 10)

 [1] "SUMLEV"            "REGION"            "DIVISION"         
 [4] "STATE"             "COUNTY"            "STNAME"           
 [7] "CTYNAME"           "CENSUS2010POP"     "ESTIMATESBASE2010"
[10] "POPESTIMATE2010"

5.2 The Summary() Function

In R, the most similar function is summary(). The summary() function in R can be used to quickly summarize the values in a data frame or vector.

This syntax shows examples of the summary function using both our data set, and a vector:

#| label: Summary-syntax-with-data

# Example using summary function with data
summary(census_2010$CENSUS2010POP)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
      82    11299    26424   193387    71404 37253956

# Example using summary function with vector
# Define vector
x <- c(3, 4, 23, 5, 7, 8, 9, 12, 26, 15, 20, 21, NA)

# Summarize values in vector
summary(x)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   3.00    6.50   10.50   12.75   20.25   26.00       1

The summary() function automatically calculates: The minimum value, The value of the 1st quartile (25th percentile), The median value, The value of the 3rd quartile (75th percentile) and The maximum value. Any missing values (NA) in the vector, the summary() function will automatically exclude them when calculating the summary statistics.

Now, let’s see how skim() compares.

5.3 Skimr Package

The skim() function will generate a summary of the variables in your dataset, including their data type, number of non-missing values, minimum and maximum values, median, mean, standard deviation, and more (Waring et al. 2022).

The following syntax ensures that the data is compatible with Skimr functions.

Code

# is the summary data a skimr dataframe
skim(census_2010) %>% 
  is_skim_df() # TRUE

[1] TRUE
attr(,"message")
character(0)

We can explore the data as a tibble:

Code

# use skim to get descriptive statistics of the data
skim(census_2010) %>% 
  head(n = 10)

Data summary
Name	census_2010
Number of rows	3193
Number of columns	100
_______________________
Column type frequency:
character	2
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
STNAME	0	1	4	20	0	51	0
CTYNAME	0	1	4	33	0	1927	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
SUMLEV	1	49.84	1.25	40	50	50	50	50	▁▁▁▁▇
REGION	1	2.67	0.81	1	2	3	3	4	▁▆▁▇▂
DIVISION	1	5.19	1.97	1	4	5	7	9	▂▇▅▆▃
STATE	1	30.26	15.15	1	18	29	45	56	▃▇▆▆▇
COUNTY	1	101.92	107.63	0	33	77	133	840	▇▁▁▁▁
CENSUS2010POP	1	193387.05	1176201.45	82	11299	26424	71404	37253956	▇▁▁▁▁
ESTIMATESBASE2010	1	193396.87	1176244.25	82	11299	26446	71491	37254503	▇▁▁▁▁
POPESTIMATE2010	1	193765.65	1178710.28	83	11275	26467	71721	37334079	▇▁▁▁▁

Using skimr functions provides a cleaner and more detailed display of the results compared to the summary() function. In this example we are showing the first ten variables in our data set. The data summary tab shows the number of rows and columns, column type frequency and group variables. There is also additional descriptive information like missing values, unique characters.

This will be relevant for data cleaning as well as understanding the distribution. Both are critical to determine which statistical analysis would be most appropriate to use for a project.

5.4 Other Skimr Features

5.4.1 Separate dataframes by type

The data frames produced by skim() are wide and sparse, filled with columns that are mostly NA. For that reason, it can be convenient to work with “by type” subsets of the original data frame. These smaller subsets have their NA columns removed.

Features:

partition() - Creates a list of smaller data frames. Each entry in the list is a data type from the original dataframe
bind() - Takes the list and rebuilds the original dataframe.
yank() - Extract a subtable from a dataframe with a particular type.

The following syntax is using partition() to separate the large census_df.

Code

# split the character and numeric data
separate_df <- partition(skim(census_2010))
# check only the character data
separate_df$character

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
STNAME	0	1	4	20	0	51	0
CTYNAME	0	1	4	33	0	1927	0

Code

# create summary statistics for only numeric variables
numeric_separate_df <- separate_df[2]
# pull out the desired summary statistics in the nested list
head(numeric_separate_df$numeric["mean"]) %>% 
  kable(digits = 1)

mean
49.8
2.7
5.2
30.3
101.9
193387.1

The following syntax is using bind() to combine the smaller character and numeric lists into the desired df.

Code

# combine the character and numeric data
head(bind(separate_df))

Data summary
Name	census_2010
Number of rows	3193
Number of columns	100
_______________________
Column type frequency:
character	2
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
STNAME	0	1	4	20	0	51	0
CTYNAME	0	1	4	33	0	1927	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
SUMLEV	1	49.84	1.25	40	50	50	50	50	▁▁▁▁▇
REGION	1	2.67	0.81	1	2	3	3	4	▁▆▁▇▂
DIVISION	1	5.19	1.97	1	4	5	7	9	▂▇▅▆▃
STATE	1	30.26	15.15	1	18	29	45	56	▃▇▆▆▇

Code

# confirm that the bound table is the same as the original skimmed table
identical(bind(separate_df), skim(census_2010))

[1] TRUE

The following syntax is using yank() to extract a specific table eg.character to examine.

Code

# Extract character data
yank(skim(census_2010), "character")

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
STNAME	0	1	4	20	0	51	0
CTYNAME	0	1	4	33	0	1927	0

5.4.2 Skimr with Dplyr

Skimr functions can be used in combination with Dplyr functions to examine specific variables within the census dataset.

The following example used skim() with filter() to display the variable CENSUS2010POP. The dataframe was further customized to display variable name and data type using select().

Code

# use dplyr functions on the statistics summary table
census_filter <- skim(census_2010) %>% 
  filter(skim_variable == "CENSUS2010POP")
census_filter

Data summary
Name	census_2010
Number of rows	3193
Number of columns	100
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
CENSUS2010POP	0	1	193387	1176201	82	11299	26424	71404	37253956	▇▁▁▁▁

Code

census_select <- skim(census_2010) %>% 
  select(skim_type, skim_variable)
head(census_select)

# A tibble: 6 × 2
  skim_type skim_variable
  <chr>     <chr>        
1 character STNAME       
2 character CTYNAME      
3 numeric   SUMLEV       
4 numeric   REGION       
5 numeric   DIVISION     
6 numeric   STATE

You can also customize the output of the skim() function by using various arguments. For example, you can use the numeric argument to specify which variables should be treated as numeric variables, or use the ranges argument to specify custom ranges for variables.

Using skim() in combination with mutate() we will compute a new variable to add to our skim dataframe.

Code

# create a new variable calculate the change in birth rate from 2010 to 2011
census_2010 %>% 
  # new variable
  mutate(net_birth = BIRTHS2011 - BIRTHS2010) %>% 
  # move the variable to the beginning of the dataset
  relocate(net_birth, .after = CENSUS2010POP) %>% 
  # summary statistics table
  skim() %>% 
  # only the first fifteen variables
  head(n = 15) %>% 
  # change the formatting 
  kable(digit = 2)

skim_type	skim_variable	complete_rate	character.min	character.max	character.empty	character.n_unique	character.whitespace	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
character	STNAME	1	4	20	0	51	0	NA	NA	NA	NA	NA	NA	NA	NA
character	CTYNAME	1	4	33	0	1927	0	NA	NA	NA	NA	NA	NA	NA	NA
numeric	SUMLEV	1	NA	NA	NA	NA	NA	49.84	1.25	40	50	50	50	50	▁▁▁▁▇
numeric	REGION	1	NA	NA	NA	NA	NA	2.67	0.81	1	2	3	3	4	▁▆▁▇▂
numeric	DIVISION	1	NA	NA	NA	NA	NA	5.19	1.97	1	4	5	7	9	▂▇▅▆▃
numeric	STATE	1	NA	NA	NA	NA	NA	30.26	15.15	1	18	29	45	56	▃▇▆▆▇
numeric	COUNTY	1	NA	NA	NA	NA	NA	101.92	107.63	0	33	77	133	840	▇▁▁▁▁
numeric	CENSUS2010POP	1	NA	NA	NA	NA	NA	193387.05	1176201.45	82	11299	26424	71404	37253956	▇▁▁▁▁
numeric	net_birth	1	NA	NA	NA	NA	NA	1870.12	11792.85	-3	96	232	639	386443	▇▁▁▁▁
numeric	ESTIMATESBASE2010	1	NA	NA	NA	NA	NA	193396.87	1176244.25	82	11299	26446	71491	37254503	▇▁▁▁▁
numeric	POPESTIMATE2010	1	NA	NA	NA	NA	NA	193765.65	1178710.28	83	11275	26467	71721	37334079	▇▁▁▁▁
numeric	POPESTIMATE2011	1	NA	NA	NA	NA	NA	195251.40	1189647.76	90	11277	26417	72387	37700034	▇▁▁▁▁
numeric	POPESTIMATE2012	1	NA	NA	NA	NA	NA	196744.52	1200508.37	81	11195	26362	72496	38056055	▇▁▁▁▁
numeric	POPESTIMATE2013	1	NA	NA	NA	NA	NA	198200.69	1211123.45	89	11180	26519	72222	38414128	▇▁▁▁▁
numeric	POPESTIMATE2014	1	NA	NA	NA	NA	NA	199754.09	1222669.36	87	11121	26483	72257	38792291	▇▁▁▁▁

5.4.3 Adding Variables

base - An sfl that sets skimmers for all column types.
append - Whether the provided options should be in addition to the defaults already in skim. Default is TRUE.

As mentioned, skim() is designed to display default statistics, however you can use this function to change the summary statistics that it returns.

skim_with() is type closure: a function that returns adds a new variable to the table. This lets you have several skimming functions in a single R session, but it also means that you need to assign the return of skim_with() before you can use it.

You assign values within skim_with() by using the sfl() helper (skimr function list). It identifies which skimming functions you want to remove, by setting them to NULL. Assign an sfl to each column type that you wish to modify.

For example, we will add the following variables to the dataframe: median, min, max, IQR, length.

Code

my_skim <- skim_with(
  numeric = sfl(median, min, max, IQR),
  character = sfl(length), 
  append = TRUE
)

# add new variables into the summary table
census_2010 %>% 
  my_skim() %>% 
  head(n = 10)

Data summary
Name	Piped data
Number of rows	3193
Number of columns	100
_______________________
Column type frequency:
character	2
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace	length
STNAME	0	1	4	20	0	51	0	3193
CTYNAME	0	1	4	33	0	1927	0	3193

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist	median	min	max	IQR
SUMLEV	1	49.84	1.25	40	50	50	50	50	▁▁▁▁▇	50	40	50	0
REGION	1	2.67	0.81	1	2	3	3	4	▁▆▁▇▂	3	1	4	1
DIVISION	1	5.19	1.97	1	4	5	7	9	▂▇▅▆▃	5	1	9	3
STATE	1	30.26	15.15	1	18	29	45	56	▃▇▆▆▇	29	1	56	27
COUNTY	1	101.92	107.63	0	33	77	133	840	▇▁▁▁▁	77	0	840	100
CENSUS2010POP	1	193387.05	1176201.45	82	11299	26424	71404	37253956	▇▁▁▁▁	26424	82	37253956	60105
ESTIMATESBASE2010	1	193396.87	1176244.25	82	11299	26446	71491	37254503	▇▁▁▁▁	26446	82	37254503	60192
POPESTIMATE2010	1	193765.65	1178710.28	83	11275	26467	71721	37334079	▇▁▁▁▁	26467	83	37334079	60446

5.5 Conclusion

Overall, Skimr is a useful package for quickly summarizing the variables in a dataset and gaining insights into its structure and content.

5.6 References

Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2022. Skimr: Compact and Flexible Summaries of Data. https://docs.ropensci.org/skimr/.