4  How to Create a Scatterplot

Author
Affiliations

Patrice Lewis, Julissa Martinez, & Ivan Pachon

Florida International University

Robert Stempel College of Public Health and Social Work

library(tidyverse)
# Contains colour palette for ggplot
library(viridis)
# Gapminder dataset
library(dslabs)

4.1 Introduction to Scatterplots

Scatterplots display the relationship between two variables using dots to represent the values for each numeric variable. This presentation will examine the relationship between GDP per capita and Fertility over time using ggplot with facets.

Hypothesis: A negative relationship exists between GDP per capita and fertility i.e. as GDP per capita increases, fertility decreases.

4.2 Gapminder data description

Data was obtained from the dslabs package and comes from Gapminder a Swedish non-profit organization. The Gapmidner data set has health and income outcomes for 184 countries from 1960 to 2016. Gapminder aims to promote a fact-based worldview by providing accessible and understandable global development data. The dataset covers a wide range of variables, including economic, social, and health-related indicators like GDP, infant mortality, life expectancy, fertility, as well as population, making it a valuable resource for understanding global trends and patterns over time. Countries and territories with missing information were not excluded from the data set as the lack of information can also be looked into and shed light on why data was not collected or provided. To determine whether a country’s health and income outcomes are influenced by population sizes and GDP per capita, the data will be used to create a series of graphs to view different trends.

4.3 Cleaning the data to create a model data frame

A tibble was created from the gapminder dataset, and a new column was created to measure GDP per capita. Overall, using tibbles enhances the readability, usability, and compatibility of your code within the tidyverse ecosystem.

# Creating gapminder dataset\tibble
gapminder_df <-
  as_tibble(gapminder) %>%
  mutate(gdp_per_capita = gdp / population)

4.4 Components of ggplot2

ggplot2 is a package used to create graphs and visualize data. The main three components of ggplot2 are the data, aesthetics and geom layers.

  • The data layer - states what data will be used to graph

  • The aesthetics layer - specifies the variables that are being mapped

  • The geom layer - specifies the type of graph to be produced

4.5 Code to run interpretable scatterplots and create facets

In order to create a scatter-plot using ggplot, you must specify what data you will be using, state which variables will be mapped and how under aesthetics. What differentiates the scatter-plot from any other type of graph will be specified under the geom layer. For the scatter-plot, geom_point will be used.

In this example, we will analyze the relationship between fertility rates and gdp per capita for each country in 2011.

fig_bubble_2011 <-
  ggplot(data = filter(gapminder_df, year == 2011)) +
  aes(x = gdp_per_capita, y = fertility) +
  geom_point()

fig_bubble_2011 
Figure 4.1: Association between fertility rates and gdp per capita for each country in 2011

In the example above, we have mapped out fertility as our y-axis and gdp per capita as our x-axis. However, at it’s very basic level, there is not enough information provided to accurately analyze the relationship between the two. For this reason, we can add additional layers that will provide more information to properly analyze the scatter-plot.

fig_bubble_pretty_2011 <-
  ggplot(data = filter(gapminder_df, year == 2011)) +
  aes(
    x = gdp_per_capita,
    y = fertility,
    # will change the size of the point based on population size 
    size = population, 
    # will assign colors based on the continent the country is in 
    color = continent
  ) +
  # gives a range as to how big or small the points of population should be
  scale_size(range = c(1, 20)) + 
  # removes N/A from the legend and titles it Continent 
  scale_colour_discrete(na.translate = F, name = "Continent") +
  # removes population size from the legend 
  guides(size = "none") +
  scale_x_continuous(
    name = "GDP per Capita",
    trans = "log10",
    # transforms numbers from scientific notation to regular number 
    labels = scales::comma
  ) +
  labs(
    title = "Fertility rate descreases as GDP per capita increases in 2011",
    y = "Fertility rates",
    caption = "Source: Gapminder"
  ) +
  # the ylim was set based on the fertility, lowest was near 1 & highest was above 7
  ylim(1.2, 8.0) +
  # alpha increases transparency of the points to ensure they can all be seen
  geom_point(alpha = 0.5) 

fig_bubble_pretty_2011
Figure 4.2: Association between fertility rates and gdp per capita for each country, grouped by continent, in 2011

Figure 4.2 builds on the previous scatterplot of Fertility Rates (y axis) against GDP per capita (x axis) for 2011. The bubble size depicts respective country populations, and continents are coded by colors according to the key. This figure displays a negative relationship between GDP per capita and Fertility Rates. It supports the Hypothesis which states that as GDP per capita increases, Fertility Rates decreases. This trend can be confirmed for all continents, however, the degree to which fertility rates drop between continents varies. Most European country appear below a fertility rate of 2 babies per woman. The Americas appear to follow closely behind (under 4), followed by Oceania and Asia. A significant number of African countries still maintained higher fertility rates with lower GDP per capita for 2011.

This is an example of wanting to create four separate graphs to see the relationship between fertility rates and GDP per capita based on the years 1960, 1975, 1990 and 2005. In this example we omitted the facet argument.

fig_bubble_multiple <-
  ggplot(data = filter(gapminder_df, year %in% c(1960, 1975, 1990, 2005))) +
  aes(
    x = gdp_per_capita,
    y = fertility,
    size = population,
    color = continent
  ) +
  scale_x_continuous(
    name = "GDP per Capita",
    trans = "log10",
    labels = scales::comma
  ) +
  scale_size(range = c(0, 20)) +
  guides(size = "none") +
  scale_colour_discrete(na.translate = FALSE, name = "Continent") +
  labs(
    title = "Fertility continues to decrease as GDP per capita increases",
    caption = "Source: Gapminder",
    y = "Fertility rates"
  ) +
  geom_point(alpha = 0.3) 

fig_bubble_multiple
Figure 4.3: Association between fertility rates and GDP per capita based on the years 1960, 1975, 1990 and 2005

Without having used the facet argument, all points of all four years have been included into one graph. This graph does not provide us with the information we were looking for.

fig_bubble_multiple_facet <-
  ggplot(data = filter(gapminder_df, year %in% c(1960, 1975, 1990, 2005))) +
  aes(
    x = gdp_per_capita,
    y = fertility,
    size = population,
    color = continent
  ) +
  scale_x_continuous(
    name = "GDP per Capita",
    trans = "log10",
    labels = scales::comma
  ) +
  scale_size(range = c(0, 20)) +
  guides(size = "none") +
  scale_colour_discrete(na.translate = FALSE , name = "Continent") +
  labs(
    title = "Fertility continues to decrease as GDP per capita increases",
    caption = "Source: Gapminder",
    y = "Fertility rates"
  ) +
  geom_point(alpha = 0.3) +
  # specifiying we want the graphs split based on year
  facet_wrap(~ year)

fig_bubble_multiple_facet
Figure 4.4: Association between fertility rates and GDP per capita based on the years 1960, 1975, 1990 and 2005, using facet

Now that we’ve specified the facet argument, we now have four separate graphs that can be properly analysed. In Figure 4.4 we see an increasingly negative relationship between the two variables over time. This observation is congruent with the hypothesis that as GDP per capita increases, fertility decreases.

This global trend can be attributed to the increasing proportion of women in the workforce in the mid to late 20th century. As a result of World War II (1939-1945), women took on roles outside the home to compensate for men at war. Despite increased GDP per capita, this may have contributed to reduced fertility (babies per woman) over time. In 1960, a clear disparity among continents is seen. Most European countries’ fertility rates fell below 5, while their GDP per capita increased. Most African countries maintained high fertility rates above 5, but little change is seen in GDP per capita. The Asian continent shows the most variation among countries during that year. Some smaller Asian countries continued to maintain high fertility rates as GDP per capita increased in 1960. However, others displayed a drastic decrease in fertility rates by 1960. The Americas followed a steady decline over the years. By 2005, an overall negative relationship can be seen with most countries’ fertility rates below 5 babies per woman.

fig_bubble_row_2011 <-
  ggplot(data = filter(gapminder_df, year == 2011)) +
  aes(
    x = gdp_per_capita,
    y = fertility
  ) +
  geom_point(alpha = 0.5) +
  facet_wrap(~ continent, nrow = 1)

fig_bubble_row_2011 
Figure 4.5: Association between fertility rates and GDP per capita based on continent

In the graph above, we see an example of separating the single graph into graphs based on continent. It has also been specified to have all graphs appear in one single row through the nrow argument. Very importantly however, this graph is unclear and cannot be used to compare the relationship between fertility and gdp per capita.

fig_bubble_facet_2011 <-
  ggplot(data = filter(gapminder_df, year == 2011)) +
  aes(
    x = gdp_per_capita,
    y = fertility
  ) +
  scale_x_continuous(
    name = "GDP per Capita",
    trans = "log10",
    labels = scales::comma
  )  +
  geom_point(alpha = 0.5) +
  facet_wrap(~ continent)

fig_bubble_facet_2011 
Figure 4.6: Association between fertility rates and GDP per capita based on continent not using nrow

In the next example above, we removed the nrow argument and the system automatically separated the graphs into three columns with two rows. Additionally, we changed the x-axis to a log scale to better interpret gdp per capita. There is a way to determine a relationship between fertility and gdp per capita by continent.

4.6 Public Health Interpretation

A global negative trend is depicted between GDP per capita and fertility over time. Such changes were due to wars as well as social, cultural and economic changes that incentivize smaller families especially in Asian countries. Most European, American and Asian countries depicted significant decreases in fertility rates over time as GDP per capita increased. On the other hand, African countries remain in the top rank for fertility over the years. These differences are depicted in the population pyramid changes of developed vs developing countries. Public health policies can be tailored to incentivizing increased fertility in developed countries to ensure generation continuity, and effective family planning strategies in developing countries.

4.7 Conclusion

In this lesson, the basic functions of ggplot2 package were shown, which can create a scatterplot. There are three layers to the code to make a plot in R: data, aesthetic, and geometric. Within the aesthetic layer, functions can be added such as size and color to analyze more variables. Additionally, facets can split up graphs over a categorical variable, adding another potential variable to analyze in the plot.