Tag: data visualization

Writing Functions to Automate Repetitive Plotting Tasks in ggplot2

Introduction

There are often situations when you need to perform repetitive plotting tasks. For example, you’d like to plot the same kind of data (e.g. the same economic indicator) for several states, provinces, or cities. Here are some ways you can address this:

But what if the data is too complex to fit into a single plot? Or maybe there are just too many levels in your grouping variable – for example, if you try to plot family income data for all 50 U.S. states, a plot made up of 50 facets would be virtually unreadable. Same goes for a plot with all 50 states on its X axis.

Yet another example of a repetitive plotting task is when you’d like to use your own custom plot theme for your plots.

Both use cases – making multiple plots on the same subject, and using the same theme for multiple plots – require the same R code to run over and over again. Of course, you can simply duplicate your code (with necessary changes), but this is tedious and not optimal, putting it mildly. In case of plotting data for all 50 U.S. states, would you copy and paste the same chunk of code 50 times?

Fortunately, there is a much better way – simply write a function that will iteratively run the code as many times as you need.

Making Multiple Plots on the Same Subject

Lets’ start with a more complex use case – making multiple plots on the same subject. To illustrate this, I will be using the ‘education’ dataset that contains education levels of people aged 25 to 64, broken down by gender, according to 2016 Canadian Census. You may consider this post to be a continuation of Part 6 of the Working with Statistics Canada Data in R series.

You can find the code that retrieves the data using the specialized cancensus package here and here. If you are not interested in Statistics Canada data, you can simply download the dataset and read it into R:

education <- read.csv("education.csv", stringsAsFactors = TRUE)

Let’s take a look at the first 20 lines of the ‘education’ dataset (all data for the ‘Canada’ region):

> head(education, 20)
# A tibble: 20 x 5
   region vector        count gender level                        
   <fct>  <fct>         <dbl> <fct>  <fct>                        
 1 Canada v_CA16_5100 1200105 Male   None                         
 2 Canada v_CA16_5101  969690 Female None                         
 3 Canada v_CA16_5103 2247025 Male   High school or equivalent    
 4 Canada v_CA16_5104 2247565 Female High school or equivalent    
 5 Canada v_CA16_5109 1377775 Male   Apprenticeship or trades     
 6 Canada v_CA16_5110  664655 Female Apprenticeship or trades     
 7 Canada v_CA16_5118 1786060 Male   College or equivalent        
 8 Canada v_CA16_5119 2455920 Female College or equivalent        
 9 Canada v_CA16_5121  240035 Male   University below bachelor    
10 Canada v_CA16_5122  340850 Female University below bachelor    
11 Canada v_CA16_5130  151210 Male   Cert. or dipl. above bachelor
12 Canada v_CA16_5131  211250 Female Cert. or dipl. above bachelor
13 Canada v_CA16_5127 1562155 Male   Bachelor's degree            
14 Canada v_CA16_5128 2027925 Female Bachelor's degree            
15 Canada v_CA16_5133   74435 Male   Degree in health**           
16 Canada v_CA16_5134   78855 Female Degree in health**           
17 Canada v_CA16_5136  527335 Male   Master's degree              
18 Canada v_CA16_5137  592850 Female Master's degree              
19 Canada v_CA16_5139  102415 Male   Doctorate*                   
20 Canada v_CA16_5140   73270 Female Doctorate*

Our goal is to plot education levels as percentages for both genders, and for all regions. This is a good example of a repetitive plotting task, as we’ll be making one plot for each region. Overall, there are 6 regions, so we’ll be making 6 plots:

> levels(education$region)
[1] "Canada"  "Halifax"  "Toronto"  "Calgary"  "Vancouver"  "Whitehorse"

Ideally, our plot should also reflect the hierarchy of education levels.

Preparing the Data

The data, as retrieved from Statistics Canada in Part 5 of the Working with Statistics Canada Data in R series, is not yet ready for plotting: it doesn’t have percentages, only counts. Also, education levels are almost, but not quite, in the correct order: the ‘Cert. or dipl. above bachelor’ is before ‘Bachelor’s degree’, while it should of course follow the Bachelor’s degree.

So let’s apply some final touches to our dataset, after which it will be ready for plotting. First, lets load tidyverse:

library(tidyverse)

Then let’s calculate percentages and re-level the ‘levels’ variable:

# prepare 'education' dataset for plotting
education <- education %>% 
  group_by(region) %>% 
  mutate(percent = round(count/sum(count)*100, 1)) %>% 
  mutate(level = factor(level, # put education levels in logical order
                        levels = c("None", 
                                   "High school or equivalent", 
                                   "Apprenticeship or trades", 
                                   "College or equivalent", 
                                   "University below bachelor", 
                                   "Bachelor's degree", 
                                   "Cert. or dipl. above bachelor", 
                                   "Degree in health**", 
                                   "Master's degree", 
                                   "Doctorate*")))

Note that we needed to group the data by the ‘region’ variable to make sure our percentages get calculated correctly, i.e. by region. If you are not sure if the dataset has been grouped already, you can check this with the dplyr::is_grouped_df() function.

Writing Functions to Generate Multiple Plots

Now our data is ready to be plotted, so let’s write a function that will sequentially generate our plots – one for each region. Pay attention to the comments in the code:

## plot education data

# a function for sequential graphing of data by region
plot.education <- function(x = education) {
  
  # a vector of names of regions to loop over
  regions <- unique(x$region)
  
  # a loop to produce ggplot2 graphs 
  for (i in seq_along(regions)) {
    
    # make plots; note data = args in each geom
    plot <- x %>% 
      ggplot(aes(x = level, fill = gender)) +
      geom_col(data = filter(x, region == regions[i], 
                             gender == "Male"), 
               aes(y = percent)) +
      geom_col(data = filter(x, region == regions[i], 
                             gender == "Female"), 
               # multiply by -1 to plot data left of 0 on the X axis
               aes(y = -1*percent)) +  
      geom_text(data = filter(x, region == regions[i], 
                              gender == "Male"), 
                aes(y = percent, label = percent), 
                hjust = -.1) +
      geom_text(data = filter(x, region == regions[i], 
                              gender == "Female"), 
                aes(y = -1*percent, label = percent), 
                hjust = 1.1) +
      expand_limits(y = c(-17, 17)) +
      scale_y_continuous(breaks = seq(-15, 15, by = 5),
                         labels = abs) + # axes labels as absolute values
      scale_fill_manual(name = "Gender",
                        values = c("Male" = "deepskyblue2",
                                   "Female" = "coral1")) +
      coord_flip() +
      theme_bw() +
      theme(plot.title = element_text(size = 14, face = "bold",
                                      hjust = .5,
                                      margin = margin(t = 5, b = 15)),
            plot.caption = element_text(size = 12, hjust = 0, 
                                        margin = margin(t = 15)),
            panel.grid.major = element_line(colour = "grey88"),
            panel.grid.minor = element_blank(),
            legend.title = element_text(size = 13, face = "bold"),
            legend.text = element_text(size = 12),
            axis.text = element_text(size = 12, color = "black"),
            axis.title.x = element_text(margin = margin(t = 10),
                                        size = 13, face = "bold"),
            axis.title.y = element_text(margin = margin(r = 10),
                                        size = 13, face = "bold")) +
      labs(x = "Education level", 
           y = "Percent of population", 
           fill = "Gender",
           title = paste0(regions[i], ": ", "Percentage of Population by Highest Education Level, 2016"),
           caption = "* Doesn’t include honorary doctorates.\n** A degree in medicine, dentistry, veterinary medicine, or optometry.\nData: Statistics Canada 2016 Census.")
    
    # create folder to save the plots to
    if (dir.exists("output")) { } 
      else {dir.create("output")}
    
    # save plots to the 'output' folder
    ggsave(filename = paste0("output/",
                             regions[i],
                             "_plot_education.png"),
           plot = plot,
           width = 11, height = 8.5, units = "in")
    
    # print each plot to screen
    print(plot)
  }
}

Let’s now look in detail at the key sections of this code. First, we start with creating a vector of regions’ names for our function to loop over, and then we follow with a simple for-loop: for (i in seq_along(regions)). We put our plotting code inside the loop’s curly brackets { }.

Note the data = argument in each geom: region == regions[i] tells ggplot() to take the data that corresponds to each element of the ‘regions’ vector, for each new iteration of the for-loop.

Since we want our plot to reflect the hierarchy of education levels and to show the data by gender, the best approach would be to plot the data as a pyramid, with one gender being to the left of the center line, and the other – to the right. This is why each geom is plotted twice, with the dplyr::filter() function used to subset the data.

The y = -1*percent argument to the aes() function tells the geom to plot the data to the left of the 0 center line. It has to be accompanied by labels = abs argument to scale_y_continuous(), which tells this function to use absolute values for the Y axis labels, since you obviously can’t have a negative percentage of people with a specific education level.

Note also the expand_limits(y = c(-17, 17)), which ensures that axis limits stay the same in all plots generated by our function. This is one of those rare cases when expand_limits() is preferable to coord_flip(), since with expand_limits() axis limits stay the same in all auto-generated plots. However, keep in mind that expand_limits() trims observations outside of the set range from the data, so it should be used with caution. More on this here and here.

Next, coord_flip() converts a bar plot into a pyramid, so that education levels are on the Y axis, and percentages are on the X axis.

Finally, note how our for-loop uses regions[i] inside the labs() function to iteratively add the names of the regions to the plots’ titles, and to correctly name each file when saving our plots with ggsave().

To generate the plots, run:

plot.education()

Here is one of our plots:

If you did everything correctly, there should be five more graphics like this one in your “output” folder – one for each region in our dataset.

Making Custom Plot Themes

The other way how you can simplify repetitive plotting tasks, is by making your own custom plot themes. Since every plot theme in ggplot2 is a function, you can easily save your favorite theme settings as a custom-made function. Making a theme is easier than writing functions to generate multiple plots, as you won’t have to write any loops.

Suppose, you’d like to save the theme of our education plots, and to use it in other plots. To do this, simply wrap theme settings in function():

## Save custom theme as a function ##
theme_custom <- function() {
  theme_bw() + # note ggplot2 theme is used as a basis
  theme(plot.title = element_text(size = 14, face = "bold",
                                  hjust = .5,
                                  margin = margin(t = 5, b = 15)),
        plot.caption = element_text(size = 11, hjust = 0, 
                                    margin = margin(t = 15)),
        panel.grid.major = element_line(colour = "grey88"),
        panel.grid.minor = element_blank(),
        legend.title = element_text(size = 13, face = "bold"),
        legend.text = element_text(size = 12),
        axis.text = element_text(size = 12),
        axis.title.x = element_text(margin = margin(t = 10),
                                    size = 13, face = "bold"),
        axis.title.y = element_text(margin = margin(r = 10),
                                    size = 13, face = "bold"))
}

Note that this code takes one of ggplot2 themes as a basis, and then alters some of its elements to our liking. You can change any theme like this: a ggplot2 theme, a custom theme from another package such as ggthemes, or your own custom theme.

Let’s now use the saved theme in a plot. Usually it doesn’t matter what kind of data we are going to visualize, as themes tend to be rather universal. Note however, that sometimes the data and the type of visualization do matter. For example, our theme_custom() won’t work for a pie chart, because our theme has grid lines and labelled X and Y axes.

To illustrate how this theme fits an entirely different kind of data, let’s plot some data about penguins. Why penguins? Because I love Linux!

The data was recently released as the palmerpenguins package containing various measurements of 3 species of penguins (discovered via @allison_horst). The package is quite educational: for example, I learned that Gentoo is not only a Linux, but also a penguin!

palmerpenguins is not yet on CRAN, so you’ll need to install it from GitHub:

devtools::install_github("allisonhorst/palmerpenguins")
library(palmerpenguins)

Let’s now make a scatterplot showing the relationship between the bill length and body mass in the three species of penguins from palmerpenguins. Let’s also add regression lines with 95% confidence intervals to our plot, and apply our custom-made theme:

## Plot penguins data with a custom theme
plot_penguins <- 
  penguins %>% 
  group_by(species) %>% 
  ggplot(aes(x = bill_length_mm,
             y = body_mass_g,
             color = species)) +
  geom_point(size = 2, na.rm = TRUE) +
  geom_smooth(aes(fill = species), 
              formula = y ~ x, # optional: removes message
              method = "lm", 
              alpha = .3, # alpha level for conf. interval
              na.rm = TRUE) + 
  # Note that you need identical name, labels, and values 
  # in both manual scales to avoid legend duplication:
  # this merges two legends into one.
  scale_color_manual(name = "Species",
                     values = c("Adelie" = "orange2",
                                "Chinstrap" = "dodgerblue",
                                "Gentoo" = "orchid")) +
  scale_fill_manual(name = "Species",
                    values = c("Adelie" = "orange2",
                               "Chinstrap" = "dodgerblue",
                               "Gentoo" = "orchid")) +
  theme_custom() + # here is our custom theme
  labs(x = "Bill length, mm",
       y = "Body mass, grams",
       title = "Body Mass to Bill Length in Adelie, Chinstrap, and Gentoo Penguins",
       caption = "Data: Gorman, Williams, and Fraser 2014")

As usual, let’s save the plot to the ‘output’ folder and print it to screen:

# save plot to 'output' folder
ggsave("output/plot_penguins.png", 
       plot_penguins,
       width = 11, height = 8.5, units = "in")

# print plot to screen
print(plot_penguins)

Here it is:

Updating Plot Themes

Now, suppose your organization uses a green-colored theme for their website and reports, so your penguin data plot needs to fit the overall style. Fortunately, updating a custom theme is very easy: you re-assign those theme elements you’d like to change, e.g. to use a different color:

# further change some elements of our custom theme
theme_custom_green <- function() {
  theme_custom() +
  theme(plot.title = element_text(color = "darkgreen"),
        plot.caption =  element_text(color = "darkgreen"),
        panel.border = element_rect(color = "darkgreen"),
        axis.title = element_text(color = "darkgreen"),
        axis.text = element_text(color = "darkgreen"),
        axis.ticks = element_line(color = "darkgreen"),
        legend.title = element_text(color = "darkgreen"),
        legend.text = element_text(color = "darkgreen"),
        panel.grid.major = element_line(color = "#00640025"))
}

Upd.: Note the use of an 8-digit hex color code in the last line: it is the hex value for the “darkgreen” color with an alpha-level of 25. This is how you can change the transparency of grid lines so that they don’t stand out too much, since element_line() doesn’t take the alpha argument. Keep in mind that if you want to use a color other than “gray” (or “grey”) for grid lines, you’d have to use actual hex values, not color names. Setting element_line(color = “darkgreen25”) would throw an error. You can find more about hex code colors with alpha values here. Thanks to @PhilSmith26 for the tip!

Then simply replace theme_custom() in the code above with theme_custom_green(). No other changes needed!

And last but not least, here is the citation for the penguins data:

Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. https://doi.org/10.1371/journal.pone.0090081.

Working with Statistics Canada Data in R, Part 6: Visualizing Census Data

Back to Working with Statistics Canada Data in R, Part 5.

Introduction

In the previous part of the Working with Statistics Canada Data in R series, we have retrieved the following key labor force indicators from the 2016 Canadian census:

  • labor force participation rate, employment rate, and unemployment rate,
  • percent of workers by work situation: full time vs part time, by gender, and
  • education levels of people aged 25 to 64, by gender,

… for Canada as a country and for the largest metropolitan areas in each of Canada’s five geographic regions.

Now we are going to plot the labor force participation rates and the percent of workers by work situation. And in the next post, I’ll show how to write functions to automate repetitive plotting tasks using the 2016 Census education data as an example.

As always, let’s start with loading the required packages. Note ggrepel package, which helps to prevent overlapping of data points and text labels in our graphics.

# load packages
library(tidyverse)
library(ggrepel)

Ordered Bar Plot: Labor Force Involvement Rates

Why the bar plot for this data? Well, the bar plot is one of the simplest and thus easiest to interpret plots, and the data – labor force involvement rates – fits this type of plot nicely. We will plot the rates for all our regions in the same graphic, and we are going to order regions by unemployment rate.

Creating an Ordering Vector

In the previous part of this series, we retrieved 2016 Census data for labor force involvement rates, did some preparatory work required to plot the data with ggplot2 package, and saved the data as the ‘labor’ dataframe. There is one more step we need to complete before we can plot this data: we need to create an ordering vector with unemployment numbers and append this vector to ‘labor’.

# prepare 'labor' dataset for plotting: 
# create an ordering vector to set the order of regions in the plot
labor <- labor %>%
  group_by(region) %>%  # groups data by region
  filter(indicator == "unemployment rate") %>%
  select(-indicator) %>%
  rename(unemployment = rate) %>% 
  left_join(labor, by = "region") %>% 
  mutate(indicator = factor(indicator, 
                            levels = c("participation rate",
                                       "employment rate",
                                       "unemployment rate")))

Note the left_join call, which joins the result of manipulating the ‘labor’ dataframe back onto ‘labor’. If it seems confusing, take a look at this code, which returns the same output:

# alt. (same output):
labor_order <- labor %>%
  filter(indicator == "unemployment rate") %>%
  select(-indicator) %>%
  rename(unemployment = rate)

labor <- labor %>%
left_join(labor_order, by = "region") %>%
mutate(indicator = factor(indicator,
                          levels = c("participation rate",
                                     "employment rate",
                                     "unemployment rate")))

Also note the mutate call that manually re-assigns factor levels of the ‘indicator’ variable, so that labor force indicators are plotted in the logical order: first labor force participation rate, then employment rate, and finally the unemployment rate. Remember that ggplot2 plots categorical variables in the order of factor levels.

Making an Ordered Bar Plot

# plot data
plot_labor <- 
  labor %>% 
  ggplot(aes(x = reorder(region, unemployment), 
             y = rate, 
             fill = indicator)) +
  geom_col(width = .6, position = "dodge") +
  geom_text(aes(label = rate),
            position = position_dodge(width = .6),
            show.legend = FALSE,
            size = 3.5,
            vjust = -.4) +
  scale_y_continuous(name = "Percent", 
                     breaks = seq(0, 80, by = 10)) +
  scale_x_discrete(name = NULL) +
  scale_fill_manual(name = "Indicator:",
                    values = c("participation rate" = "deepskyblue2",
                               "employment rate" = "olivedrab3",
                               "unemployment rate" = "tomato")) +
  theme_bw() +
  theme(plot.title = element_text(hjust = .5, size = 14, 
                                  face = "bold"),
        plot.subtitle = element_text(hjust = .5, 
                                     size = 13, 
                                     margin = margin(b = 15)),
        panel.grid.major = element_line(colour = "grey88"),
        panel.grid.minor = element_blank(),
        axis.text = element_text(size = 12, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold",
                                    margin = margin(r = 8)),
        legend.title = element_text(size = 12, face = "bold"),
        legend.text = element_text(size = 12),
        legend.position = "bottom",
        plot.caption = element_text(size = 11, hjust = 0,
                                    margin = margin(t = 15))) +
  labs(title = "Labor Force Indicators in Canada's Geographic Regions' Largest Cities in 2016",
       subtitle = "Compared to Canada, Ordered by Unemployment Rate",
       caption = "Data: Statistics Canada 2016 Census.")

Note the x = reorder(region, unemployment) inside the aes call: this is where we order the plot’s x axis by unemployment rates. Remember that we have grouped our data by region so that we could put regions on the X axis.

Note also the scale_fill_manual function, where we manually assign colors to the plot’s fill aesthetic (hence scale_fill_manual).

Saving the Plot

Now that we have made the plot, let’s create the directory where we will be saving our graphics, and save our plot to it:

# save plot to a specific folder
dir.create("output") # creates folder
ggsave("output/plot_labor.png", 
       plot_labor,
       width = 11, height = 8.5, units = "in")

Finally, let’s print the plot to screen:

# print plot to screen
print(plot_labor)

Faceted Plot: Full Time vs Part Time Workers, by Gender

This will be a more complex task compared to plotting labor force participation rates. Here we have the data that is broken down by work situation (full-time vs part-time), and by gender, and also by region. And ideally, we also want the total numbers for full-time and part-time workers to be presented in the same plot. This is too complex to be visualized as a simple bar plot like the one we’ve just made.

To visualize all these data in a single plot, we’ll use faceting: breaking down one plot into multiple sub-plots. And I suggest a donut chart – a variation on a pie chart that has a round hole in the center. Note that generally speaking, pie charts have a well-deserved bad reputation, which boils down to two facts: humans have difficulty visually comparing angles, and if you have many categories in your data, pie charts become an unreadable mess. Here and here you can read more about pie charts’ shortcomings, and which plots can best replace pie charts.

So why an I using a pie chart? Well, three reasons, really. First, we’ll only have four categories inside the chart, so it won’t be messy. Second, it is technically a donut chart, not a pie chart, and it is the empty space inside each donut where I will put the total numbers for full- and part-time workers. And third, I’d like to show how to make donut charts with ggplot2 in case you ever need this.

Preparing the Data for Plotting

In the previous post, we have retrieved the 2016 Census data on the percentage of full-time and part-time workers, by gender, and saved it in the ‘work’ dataframe. Let’s now prepare the data for plotting. For that, we’ll need to add three more variables. ‘type_gender’ will be a categorical variable that combines work type and gender – currently these are two different variables. ‘percent’ will contain percentages for each combination of work type and gender, by region. And ‘percent_type’ will contain total percentages for full-time and part-time workers, by region.

# prepare 'work' dataset for plotting: 
work <- work %>% 
  group_by(region) %>% 
  mutate(type_gender = str_c(type, gender, sep = " ")) %>% 
  # percent of workers by region, work type, and gender
  mutate(percent = round(count/sum(count)*100, 1)) %>% 
  # percent of workers by work type, total
  group_by(region, type) %>% 
  mutate(percent_type = sum(percent))

Making a Faceted Donut Plot

Now the dataset is ready for plotting, so let’s make a faceted plot. Since ggplot2 doesn’t like pie-charts (of which a donut chart is a variant), there is no ‘pie’ geom, and we’ll have to get a bit hacky with the code. Pay close attention to the in-code comments.

# plot work data (as a faceted plot)
plot_work <-
  work %>% 
  ggplot(aes(x = "", 
             y = percent, 
             fill = type_gender)) +
  geom_col(color = "white") + # sectors' separator color
  coord_polar(theta = "y") +
  geom_text_repel(aes(label = percent),
                  # put text labels inside corresponding sectors:
                  position = position_stack(vjust = .5), 
                  # repelling force:
                  force = .02, 
                  size = 4.5) + 
  geom_label_repel(data = distinct(select(work, c("region",
                                                  "type",
                                                  "percent_type"))),
                   aes(x = 0, # turns pie chart into donut chart
                       y = percent_type, 
                       label = percent_type, 
                       fill = type),
                   size = 4.5,
                   fontface = "bold",
                   force = .02, # repelling force
                   show.legend = FALSE) +  
  scale_fill_manual(name = "Work situation",
                    labels = c("full time" = "all full-time",
                               "part time" = "all part-time"),
                    values = c("full time male" = "olivedrab4",
                               "full time female" = "olivedrab1",
                               "part time male" = "tan4",
                               "part time female" = "tan1",
                               "full time" = "green3",
                               "part time" = "orange3")) +
  facet_wrap(~ region) +
  guides(fill = guide_legend(nrow = 3)) + 
  theme_void() +
  theme(plot.title = element_text(size = 14, face = "bold",
                                  margin = margin(t = 10, b = 20),
                                  hjust = .5),
        strip.text = element_text(size = 12, face = "bold"), 
        plot.caption = element_text(size = 11, hjust = 0,
                                    margin = margin(t = 20, b = 10)),
        legend.title = element_text(size = 12, face = "bold"),
        legend.text = element_text(size = 12),
        # change size of symbols (colored squares) in legend:
        legend.key.size = unit(1.1, "lines"), 
        legend.position = "bottom") +
  labs(title = "Percentage of Workers, by Work Situation & Gender, 2016",
       caption = "Note: Percentages may not add up to 100% due to values rounding.\nData source: Statistics Canada 2016 Census.")

Here is our plot:

There are a number of things in the plot’s code that I’d like to draw your attention to. First, a ggplot2 pie chart is a stacked bar chart (geom_col) made in the polar coordinate system: coord_polar(theta = “y”). For geom_col, position = “stack” is the default, so it is not specified in the code. Note also that geom_col needs the X aesthetic, but a pie chart doesn’t have an X coordinate. So I used x = “” to trick geom_col into thinking it has the X aesthetic, otherwise it would have thrown an error: “geom_col requires the following missing aesthetics: x”.

But how do you turn a pie chart into a donut chart? To do this, I set x = 0 inside the ggrepel::geom_label_repel aes call. Try passing different values to x to see how it works: for example, x = 1 turns the plot into a standard pie chart, while x = -1 turns a donut into a ring.

In order to prevent labels overlap, I used ggrepel::geom_text_repel and ggrepel::geom_label_repel to add text labels to our plot instead of ggplot2::geom_text and ggplot2::geom_label. And position = position_stack(vjust = .5) inside geom_text_repel puts text labels in the middle of their respective sectors of the donut plot.

The data = distinct(select(work, c(“region”, “type”, “percent_type”)) argument to geom_label_repel prevents the duplication of labels containing total numbers for full-time and part-time workers.

The scale_fill_manual is used to manually assign colors and names to our plot’s legend items, and guides(fill = guide_legend(nrow = 3)) changes the order of legend items.

Finally, facet_wrap(~ region) creates a faceted plot, by region.

And just as we did with the previous plot, let’s save our plot to the ‘output’ folder and print it to screen:

# save plot to 'output' folder
ggsave("output/plot_work.png", 
       plot_work,
       width = 11, height = 8.5, units = "in")

# print work plot to screen
print(plot_work)

In the next post, I will show how to write functions to automate repetitive plotting tasks.

COVID-19 Canada Data Explorer

Click picture to open the app

Update: the COVID-19 Canada Data Explorer was endorsed by the Macdonald-Laurier Institute – one of Canada’s leading public policy think tanks!

Update 2: My public policy brief Designing COVID-19 Data Tools, co-authored with Ken Coates and Carin Holroyd, was published by the Johnson Shoyama Graduate School of Public Policy, University of Saskatchewan. The brief uses COVID-19 Canada Data Explorer as an example, and makes a comparative overview of COVID-19 data analysis and visualization tools available from Canada’s federal and provincial governments.

Update 3: eRum2020::CovidR
This is a more or less final version, so here is the source code for the app. And here is the code used to retrieve and pre-process geospatial data.

* * *

In the times of the pandemic, the data community can help in many ways, including by developing instruments to track and break down the data on the spread of the dreaded coronavirus disease. The COVID-19 Canada Data Explorer app was built with R, including Shiny and Leaflet, to process the official dataset available from the Government of Canada. They do have their own data visualization tool, but it is very basic. You can do so much more with the available data!

I hope that my app will help public health professionals, policymakers, and really anyone to stay informed about the course of SARS-CoV-2 epidemic in Canada.

Some Things to Keep in Mind (Technical Stuff Here)

The app runs on a free version of Shiny Server, which doesn’t allow to serve Shiny applications using SSL/TLS encryption. Hence HTTP connection instead of HTTPS – your browser may say that “connection is not secure”. I am aware that there may be a way to force connection through HTTPS using reverse proxy as described here. However, these instructions are well over 4 years old, and the required library libapache2-mod-proxy-html is no longer available; no replacement for Ubuntu 18.04 either. If you know how to do this on Apache (not Nginx), I’d very much appreciate your advice (please use this contact form to reach me).

The data is downloaded from Canada.ca at 6-hour intervals to minimize the load on the data repository – there is likely a high demand due to the pandemic. This means that there may be a delay of up to six hours from the time Canada.ca update their data to the moment it is updated on my server.

Working with Statistics Canada Data in R, Part 3: Visualizing CANSIM Data

Back to Working with Statistics Canada Data in R, Part 2.
Forward to Working with Statistics Canada Data in R, Part 4.

Exploring Data

In the previous part of this series, we have retrieved CANSIM data on the weekly wages of Aboriginal and Non-Aboriginal Canadians of 25 years and older, living in Saskatchewan, and the CPI data for the same province. We then used the CPI data to adjust the wages for inflation, and saved the results as wages_0370 dataset. To get started, let’s take a quick look at the dataset, what types of variables it contains, which should be considered categorical, and what unique values categorical variables have:

# explore wages_0370 before plotting
View(wages_0370)
map(wages_0370, class)
map(wages_0370, unique)

The first two variables – year and group – are of the type “character”, and the rest are numeric of the type “double” (“double” simply means they are not integers, i.e. can have decimals).

Also, we can see that wages_0370 dataset is already in the tidy format, which is an important prerequisite for plotting with ggplot2 package. Since ggplot2 is included into tidyverse, there is no need to install it separately.

Preparing Data for Plotting

At this point, our data is almost ready to be plotted, but we need to make one final change. Looking at the unique values, we can see that the first two variables (year and group) should be numeric (integer) and categorical respectively, while the rest are continuous (as they should be).

In R, categorical variables are referred to as factors. It is important to expressly tell R which variables are categorical, because mapping ggplot aesthetics – things that go inside aes() – to a factor variable makes ggplot2 use a discrete colour scale (distinctly different colors) for different categories (different factor levels in R terms). Otherwise, values would be plotted to a gradient, i.e. different hues of the same color. There are several other reasons to make sure you expressly identify categorical variables as factors if you are planning to visualize your data. I understand that this might be a bit too technical, so if you are interested, you can find more here and here. For now, just remember to convert your categorical variables to factors if you are going to plot your data. Ideally, do it always – it is a good practice to follow.

( ! ) It is a good practice to always convert categorical variables to factors.

So, let’s do it: convert year to an integer, and group to a factor. Before doing so, let’s remove the word “population” from “Non-Aboriginal population” category, so that our plot’s legend takes less space inside the plot. We can also replace accented “é” with ordinary “e” to make typing in our IDE easier. Note that the order is important: first we edit the string values of a “character” class variable, and only then convert it to a factor. Otherwise, our factor will have missing levels.

( ! ) Converting a categorical variable to a factor should be the last step in cleaning your dataset.

wages_0370 <- wages_0370 %>% 
  mutate(group = str_replace_all(group, 
                                 c(" population" = "", "é" = "e"))) %>% 
  mutate_at("year", as.integer) %>% 
  mutate_if(is.character, as_factor)

Note: if you only need to remove string(s), use str_remove or str_remove_all:

mutate(group = str_remove(group, " population"))

Plotting with ggplot2

Finally, we are ready to plot the data with ggplot2!

# plot data
plot_wages_0370 <- 
  wages_0370 %>% 
  ggplot(aes(x = year, y = dollars_2007, 
             color = group)) +
  geom_point(size = 2.5) +
  geom_line(size = 1.2) +
  geom_label(aes(label = round(dollars_2007)),
  # alt: use geom_label_repel() # requires ggrepel
             fontface = "bold",
             label.size = .5, # label border thickness
             size = 3.5, # label size
             # force = .005, # repelling force: requires ggrepel 
             show.legend = FALSE) +
  coord_cartesian(ylim = c(650, 1000)) + # best practice to set scale limits
  scale_x_continuous(breaks = 2007:2018) +
  scale_y_continuous(name = "2007 dollars",
                     breaks = seq(650, 1000, by = 50)) + 
  scale_color_manual(values = c("First Nations" = "tan3",
                                "Non-Aboriginal" = "royalblue",
                                "Metis" = "forestgreen")) +
  theme_bw() +
  theme(plot.title = element_text(size = 12, 
                                  face = "bold", 
                                  hjust = .5,
                                  margin = margin(b = 10)),
        plot.caption = element_text(size = 11),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_line(colour = "grey85"),
        axis.text = element_text(size = 11),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 12, face = "bold",
                                    margin = margin(r = 10)),
        legend.title = element_blank(),
        legend.position = "bottom",
        legend.text = element_text(size = 11, face = "bold")) +
  labs(title = "Average Weekly Wages, Adjusted for Inflation,\nby Aboriginal Group, 25 Years and Older",
       caption = "Wages data: Statistics Canada Data Table 14-10-0370\nInflation data: Statistics Canada Data Vector v41694489")

This is what the resulting graphic looks like:

Let’s now look at what this code does. We start with feeding our data object wages_0370 into ggplot function using pipe.

( ! ) Note that ggplot2 internal syntax differs from tidyverse pipe-based syntax, and uses + instead of %>% to join code into blocks.

Next, inside ggplot(aes()) call we assign common aesthetics for all layers, and then proceed to choosing the geoms we need. If needed, we can assign additional/override common aesthetics inside individual geoms, like we did when we told geom_label to use dollars_2007 variable values (rounded to a dollar) as labels. If you’d like to find out more about what layers are, and about ggplot2 grammar of graphics, I recommend this article by Hadley Wickham.

Choosing Plot Type

Plot type in each layer is determined by geom_* functions. This is where geoms are:

geom_point(size = 2.5) +
geom_line(size = 1.2) +
geom_label(aes(label = round(dollars_2007)),
# alt: use geom_label_repel() # requires ggrepel
           fontface = "bold",
           label.size = .5, # label border thickness
           size = 3.5, # label size
           # force = .005, # repelling force: requires ggrepel 
           show.legend = FALSE) +

Choosing plot type is largely a judgement call, but you should always make sure to choose the type of graphic that would best suite the data you have. In this case, our goal is to reveal the dynamics of wages in Saskatchewan over time, hence our choice of geom_line. Note that the lines in our graphic are for visual aid only – to make it easier for an eye to follow the trend. They are not substantively meaningful like they would be, for example, in a regression plot. geom_point is also there primarily for visual purposes – to make the plot’s legend more visible. Note that unlike the lines, the points in this plot are substantively meaningful, i.e. they are exactly where our data is (but are covered by labels). If you don’t like the labels in the graphic, you can use points instead.

Finally, geom_label plots our substantive data. Note that I am using show.legend = FALSE argument – this is simply because I don’t like the look of geom_label legend symbols, and prefer a combined line+point symbol instead. If you prefer geom_label symbols in the plot’s legend, remove show.legend = FALSE argument from geom_label call, and add it to geom_line and geom_point.

Preventing Overlaps with ggrepel

You have noticed some commented lines in the ggplot call. You may also have noticed that some labels in our graphic overlap slightly. In this case the overlap is minute and can be ignored. But what if there are a lot of overlapping data points, enough to affect readability?

Fortunately, there is a package to solve this problem for the graphics that use text labels: ggrepel. It has *_repel versions of ggplot2::geom_label and ggplot2::geom_text functions, which repel the labels away from each other and away from the data points.

install.packages("ggrepel")
library("ggrepel")

ggrepel functions can take the same arguments as corresponding ggplot2 functions, and also take the force argument that defines repelling force between overlapping text labels. I recommend setting it to a small value, as the default 1 seems way too strong.

Here is what our graphic looks like now. Note that the nearby labels no longer overlap:

Axes and Scales

This is where axes and scales are defined:

coord_cartesian(ylim = c(650, 1000)) + 
scale_x_continuous(breaks = 2007:2018) +
scale_y_continuous(name = "2007 dollars",
                   breaks = seq(650, 1000, by = 50)) + 
scale_color_manual(values = c("First Nations" = "tan3",
                              "Non-Aboriginal" = "royalblue",
                              "Metis" = "forestgreen")) +

coord_cartesian is the function I’d like to draw your attention to, as it is the best way to zoom the plot, i.e. to get rid of unnecessary empty space. Since we don’t have any values less than 650 or more than 950 (approximately), starting our Y scale at 0 would result in a less readable plot, where most space would be empty, and the space where we actually have data would be crowded. If you are interested in why coord_cartesian is the best way to set axis limits, there is an in-depth explanation.

( ! ) It is a good practice to use coord_cartesian to change axis limits.

Plot Theme, Title, and Captions

Next, we edit our plot theme:

theme_bw() +
theme(plot.title = element_text(size = 12, 
                                face = "bold", 
                                hjust = .5,
                                margin = margin(b = 10)),
      plot.caption = element_text(size = 11),
      panel.grid.minor = element_blank(),
      panel.grid.major = element_line(colour = "grey85"),
      axis.text = element_text(size = 11),
      axis.title.x = element_blank(),
      axis.title.y = element_text(size = 12, face = "bold",
                                  margin = margin(r = 10)),
      legend.title = element_blank(),
      legend.position = "bottom",
      legend.text = element_text(size = 11, face = "bold")) +

First I selected a simple black-and-white theme theme_bw, and then overrode some of the theme’s default settings in order to improve the plot’s readability and overall appearance. Which theme and settings to use is up to you, just make sure that whatever you do makes the plot easier to read and comprehend at a glance. Here you can find out more about editing plot theme.

Finally, we enter the plot title and plot captions. Captions are used to provide information about the sources of our data. Note the use of \n (new line symbol) to break strings into multiple lines:

labs(title = "Average Weekly Wages, Adjusted for Inflation,\nby Aboriginal Group, 25 Years and Older",
     caption = "Wages data: Statistics Canada Data Table 14-10-0370\nInflation data: Statistics Canada Data Vector v41694489")

Saving Your Plot

The last step is to save the plot so that we can use it externally: insert into reports and other publications, publish online, etc.

# save plot
ggsave("plot_wages_0370.svg", plot_wages_0370)

ggsave takes various arguments, but only one is mandatory: file name as a string. The second argument plot defaults to the last plot displayed, but it is advisable to name the plot expressly to make sure the right one gets saved. You can find out more about how ggsave works here.

My favourite format to save graphics is SVG, which is being used in the code example above. SVG stands for Scalable Vector Graphics – an extremely lightweight vectorized format that ensures the graphic stays pixel-perfect at any resolution. Note that SVG is not really a pixel image like JPEG or PNG, but a bunch of XML code, which entails certain security implications when using SVG files online.

ggsave is dependent on the settings (e.g. aspect ratio, size, etc.) of your system’s graphics device, which can be inconsistent between sessions. If you notice inconsistencies in how the saved graphics look, consider using the export package – a specialized package for saving graphics and statistical output made in R. Its functions offer more customizable saving options, more formats (file types), and with the right settings ensure consistent output between sessions.

This was the last of the three articles about working with CANSIM data. In the next article in the “Working with Statistics Canada Data in R” series, I’ll move on to working with the national census data.