This is the second part of the Correlation Analysis in R series. In this post, I will provide an overview of some of the packages and functions used to perform correlation analysis in R, and will then address reporting and visualizing correlations as text, tables, and correlation matrices in online and print publications.
There are multiple packages that allow to perform basic correlation analysis and provide a sufficiently detailed output. By sufficiently detailed I mean more detailed than that of stats::cor()
. Of those, I prefer rstatix
and correlation
(the latter is part of the easystats
ecosystem). Both have the function cor_test()
, and both are better than stats::cor_test()
because they work in a pipe and return output as a dataframe.
Let’s illustrate the use of cor_test()
from both packages with the data collected by Gorman, Williams, and Fraser (2014), which is available as the palmerpenguins
package. First, let’s install and load the packages, then get data for one penguin species:
# install packages
install.packages("rstatix")
install.packages("correlation")
install.packages("palmerpenguins")
# load packages
library(dplyr)
library(rstatix)
library(correlation)
library(palmerpenguins)
# select Adelie penguins
adelie <- penguins %>%
filter(species == "Adelie") %>%
select(c(2, 3, 6)) %>% # keep only relevant data
drop_na()
Advantages of rstatix::cor_test()
:
stats::cor.test()
,stats::cor.test()
, which returns a list, and unlike correlation::cor_test()
, which returns a dataframe but not a tibble,correlation::cor_test()
– you don’t have to remember to put variable names in quotes.Disadvantages of rstatix::cor_test()
:
correlation::cor_test()
, which can calculate CIs for Spearman’s rho \(\rho\) and Kendall’s tau \(\tau\),correlation::cor_test()
including the "auto"
method, where R tries to guess the best method for you, andcorrelation::cor_test()
.Let’s illustrate:
# rstatix::cor_test()
rstatix::cor_test(adelie, bill_length_mm, body_mass_g, method = "spearman")
#> # A tibble: 1 x 6
#> var1 var2 cor statistic p method
#> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 bill_length_mm body_mass_g 0.55 258553. 2.77e-13 Spearman
# correlation::cor_test()
correlation::cor_test(adelie, x = "bill_length_mm", y = "body_mass_g", method = "spearman")
#> Parameter1 | Parameter2 | rho | 95% CI | S | p | Method | n_Obs
#> -----------------------------------------------------------------------------------------
#> bill_length_mm | body_mass_g | 0.55 | [0.42, 0.65] | 2.59e+05 | < .001 | Spearman | 151
Most R packages, including stats
, rstatix
, and correlation
, use Pearson’s correlation coefficient \(r\) as the default method for correlation analysis, so you’ll need to expressly assign a method
argument if you need to compute a different coefficient.
Both rstatix::cor_test()
and correlation::cor_test()
support directional hypothesis testing, even though in the latter case the directional option is not documented in the help returned by ?correlation::cor_test
:
rstatix::cor_test(adelie, bill_length_mm, body_mass_g,
alternative = "greater")
#> # A tibble: 1 x 8
#> var1 var2 cor statistic p conf.low conf.high method
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 bill_length_mm body_mass_g 0.55 8.01 1.48e-13 0.447 1 Pearson
correlation::cor_test(adelie, x = "bill_length_mm", y = "body_mass_g",
alternative = "greater")
#> Parameter1 | Parameter2 | r | 95% CI | t(149) | p | Method | n_Obs
#> --------------------------------------------------------------------------------------
#> bill_length_mm | body_mass_g | 0.55 | [0.45, 1.00] | 8.01 | < .001 | Pearson | 151
Even if your analysis does not immediately return a p-value or a CI for your chosen method, correlation
package provides two functions that can calculate them for nearly any method in existence: cor_to_p()
and cor_to_ci()
. These functions take:
?correlation::cor_to_ci
for a full list).Let’s illustrate using the values returned by our analysis of correlation between bill length and body mass in Adelie penguins, for Spearman’s \(\rho\) coefficient:
# p-value
correlation::cor_to_p(.55, n = 151, method = "spearman")
#> $p
#> [1] 2.581667e-13
#>
#> $statistic
#> [1] 8.038661
# CI with default confidence level
correlation::cor_to_ci(.55, n = 151, method = "spearman")
#> $CI_low
#> [1] 0.4239604
#>
#> $CI_high
#> [1] 0.6551406
# CI with 99% confidence level
correlation::cor_to_ci(.55, n = 151, ci = 0.99, method = "spearman")
#> $CI_low
#> [1] 0.3802826
#>
#> $CI_high
#> [1] 0.683883
As of the time of writing this, cor_to_p()
and cor_to_ci()
do not support directional hypothesis testing.
A correlation matrix is simply a table containing correlation coefficients for pairs of variables. It is useful when you need to report coefficients (and sometimes their p-values too) for more than two variables. Here is what it looks like:
# clean up missing data
penguins <- drop_na(penguins)
# make correlation matrix
cmat <- rstatix::cor_mat(penguins, names(select_if(penguins, is.numeric)))
cmat
#> # A tibble: 5 x 6
#> rowname bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
#> * <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 bill_length_mm 1 -0.23 0.65 0.59 0.033
#> 2 bill_depth_mm -0.23 1 -0.580 -0.47 -0.048
#> 3 flipper_length_mm 0.65 -0.580 1 0.87 0.15
#> 4 body_mass_g 0.59 -0.47 0.87 1 0.022
#> 5 year 0.033 -0.048 0.15 0.022 1
You can reorder a correlation matrix by coefficient:
# correlation matrix, ordered by coefficient
rstatix::cor_reorder(cmat)
#> # A tibble: 5 x 6
#> rowname bill_depth_mm year bill_length_mm flipper_length_mm body_mass_g
#> * <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 bill_depth_mm 1 -0.048 -0.23 -0.580 -0.47
#> 2 year -0.048 1 0.033 0.15 0.022
#> 3 bill_length_mm -0.23 0.033 1 0.65 0.59
#> 4 flipper_length_mm -0.580 0.15 0.65 1 0.87
#> 5 body_mass_g -0.47 0.022 0.59 0.87 1
It is also possible to extract significance levels from the correlation matrix with rstatix::cor_get_pval()
, which returns a table of numeric p-values:
# matrix of p-values
rstatix::cor_get_pval(cmat)
#> # A tibble: 5 x 6
#> rowname bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 bill_length_mm 0. 2.53e- 5 7.21e- 42 1.54e- 32 0.553
#> 2 bill_depth_mm 2.53e- 5 0. 4.78e- 31 7.02e- 20 0.381
#> 3 flipper_length_mm 7.21e-42 4.78e-31 0. 3.13e-105 0.00574
#> 4 body_mass_g 1.54e-32 7.02e-20 3.13e-105 0. 0.691
#> 5 year 5.53e- 1 3.81e- 1 5.74e- 3 6.91e- 1 0
You can also get a correlation matrix with both coefficients (as numbers) and p-values (as symbols). By default, the symbols and their meanings are: **** \(\leq\) .0001, *** \(\leq\) .001, ** \(\leq\) .01, * \(\leq\) .05, no symbol \(=\) not significant. You can assign your own symbols and significance cut-off points with the symbols
and cutpoints
arguments, respectively.
rstatix::cor_mark_significant(cmat)
#> rowname bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
#> 1 bill_length_mm
#> 2 bill_depth_mm -0.23****
#> 3 flipper_length_mm 0.65**** -0.58****
#> 4 body_mass_g 0.59**** -0.47**** 0.87****
#> 5 year 0.033 -0.048 0.15** 0.022
If you don’t like the matrix format, you can pivot the matrix to a long format with rstatix::cor_gather()
as a dataframe of paired variables. The returned table will show both the coefficients and the p-values as numbers. Note that the table might get quite long depending on the number of correlated variables:
rstatix::cor_gather(cmat)
#> # A tibble: 25 x 4
#> var1 var2 cor p
#> <chr> <chr> <dbl> <dbl>
#> 1 bill_length_mm bill_length_mm 1 0.
#> 2 bill_depth_mm bill_length_mm -0.23 2.53e- 5
#> 3 flipper_length_mm bill_length_mm 0.65 7.21e-42
#> 4 body_mass_g bill_length_mm 0.59 1.54e-32
#> 5 year bill_length_mm 0.033 5.53e- 1
#> 6 bill_length_mm bill_depth_mm -0.23 2.53e- 5
#> 7 bill_depth_mm bill_depth_mm 1 0.
#> 8 flipper_length_mm bill_depth_mm -0.580 4.78e-31
#> 9 body_mass_g bill_depth_mm -0.47 7.02e-20
#> 10 year bill_depth_mm -0.048 3.81e- 1
#> # … with 15 more rows
An opposite function rstatix::cor_spread()
spreads a long correlation dataframe into a correlation matrix:
cmat_long <- cor_gather(cmat)
rstatix::cor_spread(cmat_long)
#> # A tibble: 5 x 6
#> rowname bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 bill_length_mm 1 -0.23 0.65 0.59 0.033
#> 2 bill_depth_mm -0.23 1 -0.580 -0.47 -0.048
#> 3 flipper_length_mm 0.65 -0.580 1 0.87 0.15
#> 4 body_mass_g 0.59 -0.47 0.87 1 0.022
#> 5 year 0.033 -0.048 0.15 0.022 1
Reporting correlation coefficients is pretty easy: you just have to say how big they are and what their significance value is. When reporting, keep the following things in mind (Field, Miles, and Field 2012, 241):
For example, the results of this test:
correlation::cor_test(adelie, x = "bill_length_mm", y = "body_mass_g")
#> Parameter1 | Parameter2 | r | 95% CI | t(149) | p | Method | n_Obs
#> --------------------------------------------------------------------------------------
#> bill_length_mm | body_mass_g | 0.55 | [0.43, 0.65] | 8.01 | < .001 | Pearson | 151
Can be reported as follows:
Our research shows a highly significant positive correlation between bill length and body mass among Adelie penguins: \(t\) = 8.01, \(p\) < .001, Pearson’s \(r\)(149) = .55, \(n\) = 151, 95% CI [0.43, 0.65].
Reporting Spearman’s or Kendall’s correlation coefficients would be similar, but without degrees of freedom.^{1}
You have no doubt noticed that the results of statistical models are often reported as nicely formatted tables in peer-reviewed journals. So far, our correlations have been reported as plain text tables with a monospaced font. Which means they look a bit ugly. Fortunately, R has a multitude of packages designed to format tables. You can find a brief overview of most (although certainly not all) of them here.
My criteria for choosing the best packages to format tables are simple. First and foremost, the package should be fully compatible with R Markdown, which I use for nearly all my writing (and you should too, because of how much better it is than MS Word legacy software with horrible UI).^{2} This means that when you knit your Rmd, the table should render correctly in at least the following formats: HTML, PDF, Word, PowerPoint, and ideally, also OpenDocument and LaTeX. The package should also be easy to use and well-documented.
Upon some research, I think that the best options are:
huxtable
– supports most formats and is well-documented,flextable
– the best documented, andgtsummary
– the simplest. Although gtsummary
renders natively as HTML only, its output can be converted to huxtable or flextable objects, which in turn can be rendered as pretty much anything. Also, huxtable
and flextable
are highly versatile and can be used to format any tables, regardless of their contents. gtsummary
is primarily intended to format the output of commonly used statistical models.I personally prefer huxtable
because it supports the largest number of formats (for some to work, you may still need flextable
to be installed) and has a simple, straightforward syntax.
# install packages for table formatting
install.packages("huxtable")
install.packages("flextable")
Avoid loading huxtable
and flextable
at the same time, as there will be conflicts between some of their functions. Or if you have to, call functions using the packagename::
syntax.
Let’s now demonstrate huxtable
in action by formatting the results of our correlation analysis:
# make huxtable
adelie_ht <- rstatix::cor_test(adelie,
bill_length_mm, body_mass_g,
method = "spearman") %>%
as_huxtable() %>%
set_all_padding(row = everywhere, col = everywhere, value = 6) %>%
set_bold(1, everywhere) %>%
set_top_border(1, everywhere, value = 0.8) %>%
set_bottom_border(1, everywhere, value = 0.4) %>%
set_caption("Correlation between Body Mass and Bill Length in Adelie Penguins")
# render huxtable
adelie_ht
In some situations, you might need to render a huxtable object as an image, e.g. to combine it with a ggplot object or for other purposes. For example, I had to do this for compatibility with the goodpress
package, which for some reason can’t process huxtable
HTML output. To render your huxtable as an image, you’ll first need to convert it to a flextable object with huxtable::as_flextable()
, and then render with flextable::as_raster()
.^{3}
# render huxtable as an image
adelie_ht %>%
as_flextable() %>%
flextable::as_raster(.)
Refer to the huxtable
documentation for the details about what these functions do. Note that some table formatting options work only for the specific output types. For example, higher visual weight of the top border (value = 0.8
) renders correctly in PDF, but in HTML both borders render as having equal weight. Not sure if this is a bug or a feature.
As a more advanced example, let’s format our correlation matrix cmat
. First, let’s reorder it by correlation coefficient, and then let’s render coefficients in different color fonts depending on the coefficient’s sign and magnitude:
cmat_ht <- rstatix::cor_reorder(cmat) %>%
as_huxtable() %>%
set_all_padding(row = everywhere, col = everywhere, value = 6) %>%
set_bold(1, everywhere) %>%
set_background_color(evens, everywhere, "grey92") %>%
map_text_color(-1, -1, by_colorspace("red4", "darkgreen")) %>%
set_caption("Correlation Matrix for Pygoscelis Penguins") %>%
set_col_width(everywhere, value = c(.16, .15, .11, .2, .2, .2)) %>%
set_width(1.02) %>% # note how sum of col widths == total table width
theme_article() # yes, there are themes!
cmat_ht %>%
as_flextable() %>%
flextable::as_raster(.)
In case you are using R Markdown for your reporting and are rendering to PDF, keep in mind that you won’t be able to format table captions with HTML tags like here: set_caption("<b>Correlation Matrix for Pygoscelis Penguins</b>")
, because they will be rendered literally (as “<b>” and “</b>”). However, this works for HTML.
Also keep in mind that the best YAML settings for PDF output would be:
output:
pdf_document:
latex_engine: xelatex
This will work well for complex or unusual LaTeX syntax, which may otherwise cause “Unicode character … not set up for use with LaTeX when knitting to pdf” error.
Just to illustrate how PDF output would look, here is this post rendered to PDF from R Markdown. Looks nice, doesn’t it?
rstatix
has a function to visualize correlation matrices: cor_plot()
. However, rstatix::cor_plot()
does not return a ggplot object, and thus:
ggplot2
themes or custom theme objects,rstatix::cor_plot()
output can’t be further customized or annotated using ggplot2
themes or packages such as ggpubr
.^{4}
Therefore, I would instead recommend using ggcorrplot::ggcorrplot()
which returns a ggplot object that can be altered, customized, or annotated using a broad ecosystem of ggplot2
-based packages. Another great package is latex2exp
. It allows you to render LaTeX expressions inside plot objects, which is very handy if you’d like to use special symbols or Greek letters in your plot’s text elements.
# install ggcorrplot and ggpubr
install.packages("ggcorrplot")
install.packages("ggpubr")
install.packages("latex2exp")
# load ggcorrplot and ggpubr
library(ggcorrplot)
library(ggpubr)
There are two main ways to visualize a correlation matrix: as a square plot where correlations are duplicated (remember that \(CORxy = CORyx\)) and self-correlations (\(r = 1\)) are included, and as a half-square plot where correlation coefficients are not duplicated and self-correlations are excluded. Optionally, you can also add correlation coefficients to the plot, mark statistically non-significant correlations or completely exclude them, change the plot’s color scheme, etc. Since ggcorrplot()
returns a ggplot2
object, it can be further altered (e.g. by adding subtitle, annotations, captions, etc.) using ggpubr::ggpar()
, ggpubr::annotate_figure()
, and similar functions.
# ggcorrplot - basic
ggcorrplot(cmat, title = "Penguins Correlated")
Let’s now customize the plot by removing self-correlations, leaving non-significant coefficients blank, assigning a different color palette, reordering the plot by correlation coefficient, choosing plot theme, and adding subtitle and caption. Pay attention to comments in the code:
# ggcorrplot - customized
ggcorrplot(cmat, # takes correlation matrix
title = "Penguins Correlated",
ggtheme = theme_classic, # takes ggplot2 and custom themes
colors = c("red", "white", "forestgreen"), # custom color palette
hc.order = TRUE, # reorders matrix by corr. coeff.
type = "upper", # prevents duplication; also try "lower"
lab = TRUE, # adds corr. coeffs. to the plot
insig = "blank", # wipes non-significant coeffs.
lab_size = 3.5) %>%
# add subtitle and caption; note rendering LaTeX symbols in ggplot objects
ggpubr::ggpar(subtitle = latex2exp::TeX("Significant correlations only (p$\\leq$.05)",
output = "text"),
caption = "Data: Gorman, Williams, and Fraser 2014")
Note how you can render LaTeX symbols inside ggplot objects with latex2exp::TeX()
function.
Hopefully, I’ve managed to provide some useful tips on performing and reporting correlation analysis. The next post in this series will be dedicated to robust methods for correlation analysis.
This post is also available as a PDF.
Field, Andy, Jeremy Miles, and Zoë Field. 2012. Discovering Statistics Using R. First edit. London, Thousand Oaks, New Delhi, Singapore: SAGE Publications.
Gorman, Kristen B., Tony D. Williams, and William R. Fraser. 2014. “Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (Genus Pygoscelis).” PLoS ONE 9 (3). https://doi.org/10.1371/journal.pone.0090081.
Sim, Julius, and Norma Reid. 1999. “Statistical inference by confidence intervals: Issues of interpretation and utilization.” Physical Therapy 79 (2): 186–95. https://doi.org/10.1093/ptj/79.2.186.
Also note that the magnitude of Spearman’s correlation is usually very close to Pearson’s, but Kendall’s is not. For small samples, Kendall’s \(\tau\) gives a more accurate estimate of the correlation in the population, particularly when your ranked data (ranked because it is a non-parametric test) has a lot of tied ranks (Field, Miles, and Field 2012, 225). More on this later in the series.︎
For example, all posts in my blog are written in R Markdown and deployed to WordPress directly from R using the goodpress
package. Changing a single setting in the post’s YAML header (which takes a few seconds) can turn it into a nicely formatted HTML page, PDF article, MS Word or LibreOffice document, etc.︎
You’ll also need to have packages webshot
and magick
installed, along with their system dependencies.︎
If you try, it will return “Error in ggpubr…: Can’t handle an object of class matrix”.︎
There are probably tutorials and posts on all aspects of correlation analysis, including on how to do it in R. So why more?
When I was learning statistics, I was surprised by how few learning materials I personally found to be clear and accessible. This might be just me, but I suspect I am not the only one who feels this way. Also, everyone’s brain works differently, and different people would prefer different explanations. So I hope that this will be useful for people like myself – social scientists and economists – who may need a simpler and more hands-on approach.
These series are based on my notes and summaries of what I personally consider some the best textbooks and articles on basic stats, combined with the R code to illustrate the concepts and to give practical examples. Likely there are people out there whose cognitive processes are similar to mine, and who will hopefully find this series useful.
Why correlation analysis specifically, you might ask? Understanding correlation is the basis for most other statistical models; in fact, these things are directly related, as you will see just a few paragraphs further down in this post.
Although this series will go beyond the basic explanation of what a correlation coefficient is and will thus include several posts, it is not intended to be a comprehensive source on the subject. Some topics won’t be included because I do not know much about them, or (for example, Bayesian correlations) because I am not planning to include them. Since I will be focusing on performing correlation analysis in R, I won’t be addressing basic statistical concepts such as variance, standard deviation, etc.
I was inspired to write these series by The Feynman Technique of Learning.
A variable that is related to another variable is a covariate. Covariates share some of their variance (hence co-variance). But variance depends on the scale of measurement, and thus is sensitive to changes in the scale. Therefore, when you measure covariance, it is not a very useful number in a sense that you can’t say whether the variables share a lot or only a little bit of their variance by simply looking at the number.^{1}
To overcome the problem of covariance being dependent on the measurement scale, we need a unit of measurement into which any scale of measurement can be converted. This unit is the standard deviation (SD). Converting units of measurement into SD units is known as standardization. In this case we are converting covariance, and it is done by simply dividing the covariance by the product of standard deviations of the covariates.
This gives us the Pearson product-moment correlation coefficient, more often referred to as the Pearson correlation coefficient, or simply the Pearson’s \(r\): \[ r = {COVxy \over SDx*SDy} \]
So what does the correlation coefficient do in practical terms? Suppose, you have collected your data and now have a scatterplot in front of you. As such, you don’t know much about how the variables are related in your data yet. So you’ll have to start by somehow describing and summarizing your data.
The logical first step would be to find \(mean_x\) and \(mean_y\),^{2} and then mark a point where they intersect – the point of averages. After we have found the point of averages, we can measure the spread of the data points using \(SDx\) and \(SDy\),^{3} which let us know how spread out the data is horizontally and vertically. But if we knew only the point of averages and the standard deviations of our variables, we still wouldn’t know if and how strongly the variables are associated. This is exactly what the correlation coefficient tells us!
The correlation coefficient is the measure of linear association between variables. “If there is a strong association between two variables, then knowing one helps a lot in predicting the other. But when there is a weak association, information about one doesn’t help much in guessing the other” (Freedman, Pisani, and Purves 1998, 121).
What does the “measure of linear association” mean? It simply means that if the relationship between the variables can be graphically summarized with a straight line, and the correlation coefficient measures how much the data points are clustered around the line. Further in this series, I will be using the letter \(r\) interchangeably with the term “correlation coefficient”, unless I am speaking about a specific type of coefficient such as the Kendall’s tau \(\tau\) or the Spearman’s rho \(\rho\).
Let’s illustrate these concepts using a scatterplot:
# load {tidyverse} for convenience
library(tidyverse)
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length)) +
facet_wrap(~ Species, scales = "free_x") +
geom_point() +
geom_smooth(formula = y ~ x, method = "lm", se = FALSE) +
ggpubr::stat_cor(aes(label = tolower(..r.label..)), label.y = 8.1) +
theme_classic() +
theme(panel.spacing = unit(1, "lines")) +
labs(x = "Sepal Width",
y = "Sepal Length",
title = "Sepal Length vs. Sepal Width in Irises",
subtitle = "Grouped by Species")
The closer \(r\) is to 1 or -1, the tighter are the data points grouped around the line and the stronger is the association between the variables; the closer \(r\) is to 0, the looser is the grouping and the weaker is the association. If the coefficient is positive, the variables are positively associated (as \(x\) deviates from the mean, \(y\) deviates in the same direction), and the line goes up. If it is negative, they are negatively associated (as \(x\) deviates from the mean, \(y\) deviates in the opposite direction), and the line goes down.
Some important things to remember (Freedman, Pisani, and Purves 1998, 126, 128, 144–45):
Correlation coefficient has some interesting features that stem from the fact that it is a pure number, i.e. does not depend on the units in which the variables are measured. I will illustrate these using the iris
dataset, but first I will make my own version iris_l
(l
for “local”) with shorter variable names, so that the code is more concise and readable:
# shorten var names for convenience
iris_l <- select(iris, sl = Sepal.Length, sw = Sepal.Width)
iris_l <- iris_l %>%
mutate(sl_plus = sl + 7.25, sw_plus = sw + 7.25) %>%
mutate(sl_minus = sl - 7.25, sw_minus = sw - 7.25)
cor(iris_l$sl, iris_l$sw)
cor(iris_l$sl_plus, iris_l$sw_plus)
cor(iris_l$sl_minus, iris_l$sw_minus)
cor(iris_l$sl, iris_l$sw_plus)
cor(iris_l$sl_minus, iris_l$sw)
\(r\) stays the same: -0.1175698
iris_l <- mutate(iris_l, sl_mpos = sl * 3.5, sw_mpos = sw * 3.5)
cor(iris_l$sl, iris_l$sw_mpos)
#> [1] -0.1175698
cor(iris_l$sl_mpos, iris_l$sw_mpos)
#> [1] -0.1175698
iris_l <- mutate(iris_l, sl_mneg = sl * -3.5, sw_mneg = sw * -3.5)
cor(iris_l$sl_mneg, iris_l$sw_mneg)
#> [1] -0.1175698
cor(iris_l$sl, iris_l$sw_mneg)
#> [1] 0.1175698
Keep these features in mind when converting units in which your variables are measured, e.g. if you are converting temperature from Fahrenheit to Celsius.
Also, here are some important (and hopefully, unnecessary) reminders:\[-1 \leq r \leq 1\\correlation \ne causation\\correlation \ne no\:causation\]
Although the meaning of the phrase “measure of association between variables” is hopefully now clear, the concept may still seem a bit abstract. How can it help me in my analysis? At least for a linear association, the answer is quite straightforward: \(r\) predicts by how much \(y\) will change upon the change in \(x\) – in standard deviation units, not in the original units of measurement. Fortunately, since we know the SD, we can then easily convert SD units into the original units in our model.
This is known as the regression method, which can be formulated as follows: associated with each change of one SD in \(x\), there is a change of \(r\) * SD in \(y\), on average (Freedman, Pisani, and Purves 1998, 160).
Let’s illustrate how the regression method works using my favorite dataset about penguins (Gorman, Williams, and Fraser 2014), because penguins are cool. First, let’s install the package with data:
# install palmerpenguins
install.packages("palmerpenguins")
Then, let’s get data for one penguin species:
# load data
library(palmerpenguins)
# select gentoo penguins
gentoo <- penguins %>%
filter(species == "Gentoo") %>%
select(c(2, 3, 6)) %>% # keep only relevant data
drop_na()
Then, let’s explore how bill length correlates with body mass in Gentoo penguins and calculate SDs for both variables. We’ll be needing the results of these calculations, so I am saving them as separate data objects. Let’s also take a look at the mean values of these variables:
gentoo_r <- cor(gentoo$bill_length_mm, gentoo$body_mass_g)
sdx <- sd(gentoo$bill_length_mm)
sdy <- sd(gentoo$body_mass_g)
gentoo_r
#> [1] 0.6691662
sdx
#> [1] 3.081857
sdy
#> [1] 504.1162
mean(gentoo$bill_length_mm)
#> [1] 47.50488
mean(gentoo$body_mass_g)
#> [1] 5076.016
In this example \(r\) = 0.67 (approx.). Remember that \(r\) shows by how much \(y\) changes in SD units when \(x\) changes by 1 SD unit. Since we know the SDs for bill length and body mass of Gentoo penguins, we can predict the body mass of a Gentoo penguin using the regression method:
First, let’s find out how much body mass will change when bill length changes by ±1 SD. \[0.67 * SD(body\:mass) = 0.67 * 504 = 337.68\]
Thus, when bill length changes by ±3.08 mm, i.e. by 1 \(SD(bill\:length)\), body mass changes by ±337.68 grams in the same direction (because \(r\) is positive). This means that a Gentoo penguin with a bill length of 50.59 mm (approx. 1 SD above the mean) will have the predicted body mass of: \[mean(body\:mass) + r * SD(body\:mass) = 5076 + 337.68 = 5413.68\:grams\]
I should stress that this is the predicted value. One can also think about it as the most likely value of \(y\) at the corresponding value of \(x\). Actual values, of course, vary. Note also how SD units got converted into actual measurement units when we did our calculations.
Correlation should not be confused with regression, since \(r\) is standardized, and the regression equation is unit-specific. But does \(r\) matter for the regression equation? You bet! It is the key component of the formula for the slope of the regression line:^{4} \[r * {SDy \over SDx}\]
And the regression equation itself is: \(\hat{y} = intercept + x * slope\),^{5} or: \[\hat{y} = intercept + x * (r * {SDy \over SDx})\]
The basic concepts should be more or less clear by now, so let’s translate them into actual calculations. Using the regression equation (of which our correlation coefficient gentoo_r
is an important part), let us predict the body mass of three Gentoo penguins who have bills 45 mm, 50 mm, and 55 mm long, respectively.
First, let’s find the intercept:
gentoo_lm <- lm(body_mass_g ~ bill_length_mm, data = gentoo)
intercept <- as.numeric(gentoo_lm$coefficients[1])
And then paste the values (bill lengths, the correlation coefficient gentoo_r
, and standard deviations sdx
and sdy
) into the regression equation. This will produce the predicted body mass values in grams:
x <- c(45, 50, 55)
intercept + x * (gentoo_r * sdy/sdx)
#> [1] 4801.834 5349.130 5896.426
Note that this is not how you run a predictive linear model in R. Above I just wrote the regression equation as close to its textbook form as possible using the R code. In practice, you’d be using the predict()
function, which is far more convenient:
x <- data.frame(bill_length_mm = c(45, 50, 55))
predict(gentoo_lm, x)
#> 1 2 3
#> 4801.834 5349.130 5896.426
Keep in mind that predict()
requires a specific syntax not only inside predict()
, but also inside lm()
:
lm()
, use the data
argument to refer to the dataset, do not subset with $
(i.e. do not write the formula as gentoo$body_mass_g ~ gentoo$bill_length_mm
),bill_length_mm
inside both the lm()
function and in the x
dataframe).We can also visually check the output of our regression equation and of the predict()
function against a plot:
gentoo %>%
ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
geom_point() +
geom_smooth(formula = y ~ x, method = "lm", se = FALSE) +
theme_bw()
You can play around with assigning different values to x
and re-running the calculations to predict weights of different penguins. Of course, always keep in mind that the values produced by a statistical model are just that – predicted values – and may or may not make sense. For example, if you set bill length to 1 meter, you’ll get an enormous penguin that weighs 109.3 kg, which should not be possible without advanced genetic engineering. Or if you set it to a negative 50 mm, you’ll get a penguin with negative mass, which might exist somewhere in the realms of theoretical physics but certainly not in the real world. Always use common sense when building your model.
First, a few quotes and definitions (emphasis mine):
The observed significance level is the chance of getting a test statistic as extreme as, or more extreme than, the observed one. The chance is computed on the basis that the null hypothesis is right. The smaller this chance is, the stronger the evidence against the null (Freedman, Pisani, and Purves 1998, 481).
The P-value of a test is the chance of getting a big test statistic – assuming the null hypothesis to be right. P is not the chance of the null hypothesis being right (Freedman, Pisani, and Purves 1998, 482).
Scientists test hypotheses using probabilities. In the case of a correlation coefficient … if we find that the observed coefficient was very unlikely to happen if there was no effect [correlation] in the population, then we can gain confidence that the relationship that we have observed is statistically meaningful (Field, Miles, and Field 2012, 210).
I will not be providing a more in-depth breakdown of the concept of statistical significance here, as there are some great explanations in textbooks (including the ones listed in the Bibliography section of this post) and online (for example, here and here). I particularly recommend chapters 26 “Tests of Significance” and 29 “A Closer Look at Tests of Significance” in Freedman, Pisani, and Purves (1998), where statistical significance is explained very simply and clearly. Most importantly, they address common misunderstandings and misuses of significance testing and p-values, accompanied by detailed examples.
Briefly defining a confidence interval (CI) is much harder, as this deceptively simple topic can be very easily misinterpreted. I thus strongly recommend to read at least chapter 21 “The Accuracy of Percentages” in Freedman, Pisani, and Purves (1998), as well as an open-access article by Sim and Reid (1999). After you have read these, take a look at this simulation for a nice visual aid. Here I will only provide the formal and informal definitions, and a short but important quote:
Over infinite repeated sampling, and in the absence of selection, information, and confounding bias, the \(α\)-level confidence interval will include the true value in \(α\)% of the samples for which it is calculated (Naimi and Whitcomb 2020).
If we were to draw repeated samples from a population and calculate a 95% CI for the mean of each of these samples, the population mean would lie within 95% of these CIs. Thus, in respect of a particular 95% CI, we can be 95% confident that this interval is, of all such possible intervals, an interval that includes the population mean rather than an interval that does not include the population mean. It does not … express the probability that the interval in question contains the population mean, as this [probability] must be either 0% or 100% (Sim and Reid 1999).^{6}
The chances are in the sampling procedure, not in the parameter (Freedman, Pisani, and Purves 1998, 384).
So, let’s illustrate the correlation coefficient’s CI and p-value using the rstatix::cor_tes()
function. Same can be done with R’s default stats::cor.test()
, but I prefer rstatix
because it returns output as a dataframe instead of a list, as well as for other reasons to be addressed in detail in the next post in this series.
install.packages("rstatix")
First, let’s test the two-directional hypothesis that body mass is correlated with bill length in Gentoo penguins. It is called two-directional because our test includes two possibilities: that a higher bill length is correlated with a higher body mass, and that a higher bill length is correlated with a lower body mass. The null-hypothesis (\(H_0\)) is that they are uncorrelated, i.e. that bill length and body mass change independently of each other.
# Two-directional test
rstatix::cor_test(gentoo, bill_length_mm, body_mass_g)
#> # A tibble: 1 x 8
#> var1 var2 cor statistic p conf.low conf.high method
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 bill_length_mm body_mass_g 0.67 9.91 2.68e-17 0.558 0.757 Pearson
As we see, bill_length_mm
and body_mass_g
are positively correlated: \(r\) = 0.67, and the results are highly statistically significant due to an extremely low p-value: 2.68e-17, which allows us to reject the null hypothesis. In other words, we can say that bill length and body mass are likely correlated not just in our sample, but in the whole population of Gentoo penguins. The output also gives us the lower and higher limits of the 95% CI (default) for the correlation coefficient. You can set a lower or a higher confidence level for the CI with the conf.level
argument. For example:
# 99% CI - see how CI changes
rstatix::cor_test(gentoo,
bill_length_mm, body_mass_g,
conf.level = 0.99)
#> # A tibble: 1 x 8
#> var1 var2 cor statistic p conf.low conf.high method
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 bill_length_mm body_mass_g 0.67 9.91 2.68e-17 0.518 0.780 Pearson
Try assigning a lower confidence level (e.g. 90%) and see how it affects the CI.
Let’s now test if higher bill length correlates with a higher body mass (this would be a common sense assumption to make):
# Directional test: greater
rstatix::cor_test(gentoo,
bill_length_mm, body_mass_g,
alternative = "greater")
#> # A tibble: 1 x 8
#> var1 var2 cor statistic p conf.low conf.high method
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 bill_length_mm body_mass_g 0.67 9.91 1.34e-17 0.578 1 Pearson
Based on the outcome of this test, we can say that a higher bill length is likely correlated with a higher body mass not just in our sample, but in the whole population of Gentoo penguins.
Finally, let’s test if a higher bill length correlates with a lower body mass, i.e. if it could be that the longer the beak, the smaller the penguin. Admittedly, this does not sound like a particularly plausible idea, but sometimes our research can produce unexpected results – this is when it can be most fun:
# Directional test: less
rstatix::cor_test(gentoo,
bill_length_mm, body_mass_g,
alternative = "less")
#> # A tibble: 1 x 8
#> var1 var2 cor statistic p conf.low conf.high method
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 bill_length_mm body_mass_g 0.67 9.91 1 -1 0.744 Pearson
This time no surprises – this particular result was not significant at all at \(p = 1\), which is the highest a p-value can ever be. This means that the data is fully consistent with \(H_0\).^{7}
Note how in all our tests \(r\) remained the same, but p-value and CI changed depending on the hypothesis we were testing.
The coefficient of determination, \(R^2\) is a measure of the amount of variability in one variable that is shared by the other variable (Field, Miles, and Field 2012, 222).^{8} “When we want to know if two variables are related to each other, we … want to be able to … explain some of the variance in the scores on one variable based on our knowledge of the scores on a second variable”(Urdan 2011, 87). In other words, when variables are correlated, they share a certain proportion^{9} of their variance, which is known as the explained variance or shared variance. \(R^2\) is a very valuable measure, as it tells us the proportion (or the percentage if we multiply \(R^2\) by 100) of variance in one variable that is shared by the other variable. Calculating \(R^2\) is very simple:\[R^2 = r^2\]
If the concept of shared variance is still not entirely clear, take a look at this visualization, where \(R^2\) is explained graphically as a Venn diagram.
Note that the term coefficient of determination may be somewhat misleading. Correlation by itself does not signify causation, so there is no reason why it would magically become causation when you square the correlation coefficient. Sometimes people refer to \(R^2\) as “the variance in one variable explained by the other”, but we should remember that this does not imply causality. Also, this is why I think that the term shared variance is preferable to the explained variance.
Finally, let’s calculate \(R^2\) in R using bill length and body mass of Gentoo penguins as our covariates:
# expressed as a proportion:
cor(gentoo$bill_length_mm, gentoo$body_mass_g)^2
#> [1] 0.4477834
# expressed as a percentage:
cor(gentoo$bill_length_mm, gentoo$body_mass_g)^2 * 100
#> [1] 44.77834
This concludes the basic theory and definitions behind correlation analysis. In the next post, I will focus on performing and reporting correlation analysis in R.
This post is also available as a PDF.
Field, Andy, Jeremy Miles, and Zoë Field. 2012. Discovering Statistics Using R. First edit. London, Thousand Oaks, New Delhi, Singapore: SAGE Publications.
Freedman, David, Robert Pisani, and Roger Purves. 1998. Statistics. Third edit. New York, London: W.W. Norton & Company.
Gorman, Kristen B., Tony D. Williams, and William R. Fraser. 2014. “Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (Genus Pygoscelis).” PLoS ONE 9 (3). https://doi.org/10.1371/journal.pone.0090081.
Naimi, Ashley I, and Brian W Whitcomb. 2020. “Can Confidence Intervals Be Interpreted?” American Journal of Epidemiology 189 (7): 631–33. https://doi.org/10.1093/aje/kwaa004.
Sim, Julius, and Norma Reid. 1999. “Statistical inference by confidence intervals: Issues of interpretation and utilization.” Physical Therapy 79 (2): 186–95. https://doi.org/10.1093/ptj/79.2.186.
Urdan, Timothy C. 2011. Statistics in Plain English. New York: Routledge, Taylor & Francis Group. https://doi.org/10.4324/9780203851173.
Not long ago, I published a post “Three ggplot2 Themes Optimized for the Web”, where I made some tweaks to my three favorite ggplot2
themes – theme_bw()
, theme_classic()
, and theme_void()
– to make them more readable and generally look better in graphics posted online, particularly in blog posts and similar publications.
I was happy to see that some people liked those and suggested that I should make a package. I tended to view packages as large collections of code and functions, but as Sébastien Rochette wisely put it, “If you have one function, create a package! If this simplifies your life, why not?” And since I will be frequently using these themes in subsequent posts, I’d like to make it as convenient as possible for the reader to install and use them.
So here is the ggwebthemes
package! It has the same three themes, which I have tweaked and improved some more.
The package is not yet on CRAN. You can install ggwebthemes
from GitLab:
# option 1: install using devtools
# install.packages("devtools")
devtools::install_gitlab("peterbar/ggwebthemes")
# option 2: install using remotes
# install.packages("remotes")
remotes::install_gitlab("peterbar/ggwebthemes")
# option 3: build from source
# use if you get error: package 'ggplot2' was built under R version...
install.packages("https://gitlab.com/peterbar/ggwebthemes/-/raw/master/tar/ggwebthemes_0.1.1.tar.gz",
repos = NULL, type = "source")
# load ggwebthemes
library(ggwebthemes)
Please report bugs and/or submit feature requests here. Since I am currently using WordPress (but thinking about switching to a static site), I am particularly interested in how these themes would behave on Hugo and in blogs created with R Markdown/blogdown, so if you have any feedback, it will be most appreciated.
You can find the package’s reference manual here.
Note: To avoid confusing the readers, I will be removing the original post “Three ggplot2 Themes Optimized for the Web”, which contains early versions of these themes. You can still find it on R-Bloggers in case you need it.
]]>
There are often situations when you need to perform repetitive plotting tasks. For example, you’d like to plot the same kind of data (e.g. the same economic indicator) for several states, provinces, or cities. Here are some ways you can address this:
But what if the data is too complex to fit into a single plot? Or maybe there are just too many levels in your grouping variable – for example, if you try to plot family income data for all 50 U.S. states, a plot made up of 50 facets would be virtually unreadable. Same goes for a plot with all 50 states on its X axis.
Yet another example of a repetitive plotting task is when you’d like to use your own custom plot theme for your plots.
Both use cases – making multiple plots on the same subject, and using the same theme for multiple plots – require the same R code to run over and over again. Of course, you can simply duplicate your code (with necessary changes), but this is tedious and not optimal, putting it mildly. In case of plotting data for all 50 U.S. states, would you copy and paste the same chunk of code 50 times?
Fortunately, there is a much better way – simply write a function that will iteratively run the code as many times as you need.
Lets’ start with a more complex use case – making multiple plots on the same subject. To illustrate this, I will be using the ‘education’ dataset that contains education levels of people aged 25 to 64, broken down by gender, according to 2016 Canadian Census. You may consider this post to be a continuation of Part 6 of the Working with Statistics Canada Data in R series.
You can find the code that retrieves the data using the specialized cancensus
package here. If you are not interested in Statistics Canada data, you can simply download the dataset and read it into R:
download.file(url = "https://dataenthusiast.ca/wp-content/uploads/2020/12/education.csv",
destfile = "education.csv")
education <- read.csv("education.csv", stringsAsFactors = TRUE)
Let’s take a look at the first 20 lines of the ‘education’ dataset (all data for the ‘Canada’ region):
head(education, 20)
#> # A tibble: 20 x 5
#> region vector count gender level
#> <fct> <fct> <dbl> <fct> <fct>
#> 1 Canada v_CA16_5100 1200105 Male None
#> 2 Canada v_CA16_5101 969690 Female None
#> 3 Canada v_CA16_5103 2247025 Male High school or equivalent
#> 4 Canada v_CA16_5104 2247565 Female High school or equivalent
#> 5 Canada v_CA16_5109 1377775 Male Apprenticeship or trades
#> 6 Canada v_CA16_5110 664655 Female Apprenticeship or trades
#> 7 Canada v_CA16_5118 1786060 Male College or equivalent
#> 8 Canada v_CA16_5119 2455920 Female College or equivalent
#> 9 Canada v_CA16_5121 240035 Male University below bachelor
#> 10 Canada v_CA16_5122 340850 Female University below bachelor
#> 11 Canada v_CA16_5130 151210 Male Cert. or dipl. above bachelor
#> 12 Canada v_CA16_5131 211250 Female Cert. or dipl. above bachelor
#> 13 Canada v_CA16_5127 1562155 Male Bachelor's degree
#> 14 Canada v_CA16_5128 2027925 Female Bachelor's degree
#> 15 Canada v_CA16_5133 74435 Male Degree in health**
#> 16 Canada v_CA16_5134 78855 Female Degree in health**
#> 17 Canada v_CA16_5136 527335 Male Master's degree
#> 18 Canada v_CA16_5137 592850 Female Master's degree
#> 19 Canada v_CA16_5139 102415 Male Doctorate*
#> 20 Canada v_CA16_5140 73270 Female Doctorate*
Our goal is to plot education levels (as percentages) for both genders, and for all regions. This is a good example of a repetitive plotting task, as we’ll be making one plot for each region. Overall, there are 6 regions, so we’ll be making 6 plots:
levels(education$region)
#> [1] "Canada" "Halifax" "Toronto" "Calgary" "Vancouver" "Whitehorse"
Ideally, our plot should also reflect the hierarchy of education levels.
The data, as retrieved from Statistics Canada in Part 5 of the Working with Statistics Canada Data in R series, is not yet ready for plotting: it doesn’t have percentages, only counts. Also, education levels are almost, but not quite, in the correct order: the ‘Cert. or dipl. above bachelor’ is before ‘Bachelor’s degree’, while it should of course follow the Bachelor’s degree.
So let’s apply some final touches to our dataset, after which it will be ready for plotting. First, lets load tidyverse:
Then let’s calculate percentages and re-level the levels
variable:
# prepare 'education' dataset for plotting
education <- education %>%
group_by(region) %>%
mutate(percent = round(count/sum(count)*100, 1)) %>%
mutate(level = factor(level, # put education levels in logical order
levels = c("None",
"High school or equivalent",
"Apprenticeship or trades",
"College or equivalent",
"University below bachelor",
"Bachelor's degree",
"Cert. or dipl. above bachelor",
"Degree in health**",
"Master's degree",
"Doctorate*")))
Note that we needed to group the data by the region
variable to make sure our percentages get calculated correctly, i.e. by region. If you are not sure if the dataset has been grouped already, you can check this with the dplyr::is_grouped_df()
function.
Now our data is ready to be plotted, so let’s write a function that will sequentially generate our plots – one for each region. Pay attention to the comments in the code:
## plot education data
# a function for sequential graphing of data by region
plot.education <- function(x = education) {
# a vector of names of regions to loop over
regions <- unique(x$region)
# a loop to produce ggplot2 graphics
for (i in seq_along(regions)) {
# make plots; note data = args in each geom
plot <- x %>%
ggplot(aes(x = level, fill = gender)) +
geom_col(data = filter(x,
region == regions[i],
gender == "Male"),
aes(y = percent)) +
geom_col(data = filter(x,
region == regions[i],
gender == "Female"),
# multiply by -1 to plot data left of 0 on the X axis
aes(y = -1*percent)) +
geom_text(data = filter(x,
region == regions[i],
gender == "Male"),
aes(y = percent, label = percent),
hjust = -.1) +
geom_text(data = filter(x,
region == regions[i],
gender == "Female"),
aes(y = -1*percent, label = percent),
hjust = 1.1) +
expand_limits(y = c(-17, 17)) +
scale_y_continuous(breaks = seq(-15, 15, by = 5),
labels = abs) + # axes labels as absolute values
scale_fill_manual(name = "Gender",
values = c("Male" = "deepskyblue2",
"Female" = "coral1")) +
coord_flip() +
theme_bw() +
theme(plot.title = element_text(size = 14, face = "bold",
hjust = .5,
margin = margin(t = 5, b = 15)),
plot.caption = element_text(size = 12, hjust = 0,
margin = margin(t = 15)),
panel.grid.major = element_line(colour = "grey88"),
panel.grid.minor = element_blank(),
legend.title = element_text(size = 13, face = "bold"),
legend.text = element_text(size = 12),
axis.text = element_text(size = 12, color = "black"),
axis.title.x = element_text(margin = margin(t = 10),
size = 13, face = "bold"),
axis.title.y = element_text(margin = margin(r = 10),
size = 13, face = "bold")) +
labs(x = "Education level",
y = "Percent of population",
fill = "Gender",
title = paste0(regions[i], ": ", "Percentage of Population by Highest Education Level, 2016"),
caption = "* Doesn’t include honorary doctorates.\n** A degree in medicine, dentistry, veterinary medicine, or optometry.\nData: Statistics Canada 2016 Census.")
# create folder to save the plots to
if (dir.exists("output")) { }
else {dir.create("output")}
# save plots to the 'output' folder
ggsave(filename = paste0("output/",
regions[i],
"_plot_education.png"),
plot = plot,
width = 11, height = 8.5, units = "in")
# print each plot to screen
print(plot)
}
}
Let’s now look in detail at the key sections of this code. First, we start with creating a vector of regions’ names for our function to loop over, and then we follow with a simple for-loop: for (i in seq_along(regions))
. We put our plotting code inside the loop’s curly brackets { }
.
Note the data =
argument in each geom: region == regions[i]
tells ggplot()
to take the data that corresponds to each element of the regions
vector, for each new iteration of the for-loop.
Since we want our plot to reflect the hierarchy of education levels and to show the data by gender, the best approach would be to plot the data as a pyramid, with one gender being to the left of the center line, and the other – to the right. This is why each geom is plotted twice, with the dplyr::filter()
function used to subset the data.
The y = -1*percent
argument to the aes()
function tells the geom to plot the data to the left of the 0 center line. It has to be accompanied by labels = abs
argument to scale_y_continuous()
, which tells this function to use absolute values for the Y axis labels, since you obviously can’t have a negative percentage of people with a specific education level.
Note also the expand_limits(y = c(-17, 17))
, which ensures that axis limits stay the same in all plots generated by our function. This is one of those rare cases when expand_limits()
is preferable to coord_flip()
, since with expand_limits()
axis limits stay the same in all auto-generated plots. However, keep in mind that expand_limits()
trims observations outside of the set range from the data, so it should be used with caution. More on this here and here.
Next, coord_flip()
converts bar plot into a pyramid, so that education levels are on the Y axis, and percentages are on the X axis.
Finally, note how our for-loop uses regions[i]
inside the labs()
function to iteratively add the names of the regions to the plots’ titles, and to correctly name each file when saving our plots with ggsave()
.
To generate the plots, run:
plot.education()
Here is one of our plots:
If you did everything correctly, there should be five more graphics like this one in your “output” folder – one for each region in our dataset.
The other way how you can simplify repetitive plotting tasks, is by making your own custom plot themes. Since every plot theme in ggplot2
is a function, you can easily save your favorite theme settings as a custom-made function. Making a theme is easier than writing functions to generate multiple plots, as you won’t have to write any loops.
Suppose, you’d like to save the theme of our education plots, and to use it in other plots. To do this, simply wrap theme settings in function()
:
## Save custom theme as a function ##
theme_custom <- function() {
theme_bw() + # note ggplot2 theme is used as a basis
theme(plot.title = element_text(size = 10, face = "bold",
hjust = .5,
margin = margin(t = 5, b = 15)),
plot.caption = element_text(size = 8, hjust = 0,
margin = margin(t = 15)),
panel.grid.major = element_line(colour = "grey88"),
panel.grid.minor = element_blank(),
legend.title = element_text(size = 9, face = "bold"),
legend.text = element_text(size = 9),
axis.text = element_text(size = 8),
axis.title.x = element_text(margin = margin(t = 10),
size = 9, face = "bold"),
axis.title.y = element_text(margin = margin(r = 10),
size = 9, face = "bold"))
}
Note that this code takes one of ggplot2
themes as a basis, and then alters some of its elements to our liking. You can change any theme like this: a ggplot2
theme, a custom theme from another package such as ggthemes
, or your own custom theme.
Let’s now use the saved theme in a plot. Usually it doesn’t matter what kind of data we are going to visualize, as themes tend to be rather universal. Note however, that sometimes the data and the type of visualization do matter. For example, our theme_custom()
won’t work for a pie chart, because our theme has grid lines and labelled X and Y axes.
To illustrate how this theme fits an entirely different kind of data, let’s plot some data about penguins. Why penguins? Because I love Linux!
The data was originally presented in (Gorman, Williams, and Fraser 2014) and recently released as the palmerpenguins
package. It contains various measurements of 3 species of penguins (discovered via @allison_horst). The package is quite educational: for example, I learned that Gentoo is not only a Linux, but also a penguin!
Let’s now make a scatterplot showing the relationship between the bill length and body mass in the three species of penguins from palmerpenguins
. Let’s also add regression lines with 95% confidence intervals to our plot, and apply our custom-made theme:
## Plot penguins data with a custom theme
plot_penguins <-
penguins %>%
group_by(species) %>%
ggplot(aes(x = bill_length_mm,
y = body_mass_g,
color = species)) +
geom_point(size = 1, na.rm = TRUE) +
geom_smooth(aes(fill = species),
formula = y ~ x, # optional: removes message
method = "lm",
alpha = .3, # alpha level for conf. interval
na.rm = TRUE) +
# Note that you need identical name, values, and labels (if any)
# in both manual scales to avoid legend duplication:
# this merges two legends into one.
scale_color_manual(name = "Species",
values = c("Adelie" = "orange2",
"Chinstrap" = "dodgerblue",
"Gentoo" = "orchid")) +
scale_fill_manual(name = "Species",
values = c("Adelie" = "orange2",
"Chinstrap" = "dodgerblue",
"Gentoo" = "orchid")) +
theme_custom() + # here is our custom theme
labs(x = "Bill length, mm",
y = "Body mass, grams",
title = "Body Mass to Bill Length in Adelie, Chinstrap, and Gentoo Penguins",
caption = "Data: Gorman, Williams, and Fraser 2014")
As usual, let’s save the plot to the ‘output’ folder and print it to screen:
ggsave("output/plot_penguins.png",
plot_penguins,
width = 11, height = 8.5, units = "in")
print(plot_penguins)
Now, suppose your organization uses a green-colored theme for their website and reports, so your penguin data plot needs to fit the overall style. Fortunately, updating a custom theme is very easy: you re-assign those theme elements you’d like to change, e.g. to use a different color:
# further change some elements of our custom theme
theme_custom_green <- function() {
theme_custom() +
theme(plot.title = element_text(color = "darkgreen"),
plot.caption = element_text(color = "darkgreen"),
panel.border = element_rect(color = "darkgreen"),
axis.title = element_text(color = "darkgreen"),
axis.text = element_text(color = "darkgreen"),
axis.ticks = element_line(color = "darkgreen"),
legend.title = element_text(color = "darkgreen"),
legend.text = element_text(color = "darkgreen"),
panel.grid.major = element_line(color = "#00640025"))
}
Upd.: Note the use of an 8-digit hex color code in the last line: it is the hex value for the “darkgreen” color with an alpha-level of 25. This is how you can change the transparency of grid lines so that they don’t stand out too much, since element_line()
doesn’t take the alpha
argument. Keep in mind that if you want to use a color other than “gray” (or “grey”) for grid lines, you’d have to use actual hex values, not color names. Setting element_line(color = “darkgreen25”)
would throw an error. You can find more about hex code colors with alpha values here. Thanks to @PhilSmith26 for the tip!
Then simply replace theme_custom()
in the code above with theme_custom_green()
. No other changes needed!
And last but not least, here is the citation for the penguins data:
Gorman, Kristen B., Tony D. Williams, and William R. Fraser. 2014. “Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (Genus Pygoscelis).” PLoS ONE 9 (3). https://doi.org/10.1371/journal.pone.0090081.
In the previous part of the Working with Statistics Canada Data in R series, we have retrieved the following key labor force indicators from the 2016 Canadian census for Canada as a country and for the largest metropolitan areas in each of Canada’s five geographic regions:
Now we are going to plot the labor force participation rates and the percent of workers by work situation. And in the next post, I’ll show how to write functions to automate repetitive plotting tasks using the 2016 Census education data as an example.
As always, let’s start with loading the required packages. Note the ggrepel
package, which helps to prevent overlapping of data points and text labels in our graphics.
Why the bar plot for this data? Well, the bar plot is one of the simplest and thus easiest to interpret plots, and the data – labor force involvement rates – fits this type of plot nicely. We will plot the rates for all our regions in the same graphic, and we are going to order regions by unemployment rate.
In the previous part of this series, we retrieved 2016 Census data for labor force involvement rates, did some preparatory work required to plot the data with ggplot2
package, and saved the data as the labor
dataframe. There is one more step we need to complete before we can plot this data: we need to create an ordering vector with unemployment numbers and append this vector to labor
.
# prepare 'labor' dataset for plotting:
# create an ordering vector to set the order of regions in the plot
labor <- labor %>%
group_by(region) %>% # groups data by region
filter(indicator == "unemployment rate") %>%
select(-indicator) %>%
rename(unemployment = rate) %>%
left_join(labor, by = "region") %>%
mutate(indicator = factor(indicator,
levels = c("participation rate",
"employment rate",
"unemployment rate")))
Note the left_join()
call, which joins the result of manipulating the labor
dataframe back onto labor
. If it seems confusing, take a look at this code, which returns the same output:
# alt. (same output):
labor_order <- labor %>%
filter(indicator == "unemployment rate") %>%
select(-indicator) %>%
rename(unemployment = rate)
labor <- labor %>%
left_join(labor_order, by = "region") %>%
mutate(indicator = factor(indicator,
levels = c("participation rate",
"employment rate",
"unemployment rate")))
Also note the mutate()
call that manually re-assigns factor levels of the indicator
variable, so that labor force indicators are plotted in the logical order: first labor force participation rate, then employment rate, and finally the unemployment rate. Remember that ggplot2
plots categorical variables in the order of factor levels.
# plot data
plot_labor <-
labor %>%
ggplot(aes(x = reorder(region, unemployment),
y = rate,
fill = indicator)) +
geom_col(width = .6, position = "dodge") +
geom_text(aes(label = rate),
position = position_dodge(width = .6),
show.legend = FALSE,
size = 2.5,
vjust = -.4) +
coord_cartesian(ylim = c(0, 82)) + # expand Y axis to prevent labels overlap
scale_y_continuous(name = "Percent",
breaks = seq(0, 80, by = 10)) +
scale_x_discrete(name = NULL) +
scale_fill_manual(name = "Indicator:",
values = c("participation rate" = "deepskyblue2",
"employment rate" = "olivedrab3",
"unemployment rate" = "tomato")) +
theme_bw() +
theme(plot.title = element_text(hjust = .5,
size = 10,
face = "bold"),
plot.subtitle = element_text(hjust = .5,
size = 9,
margin = margin(b = 15)),
panel.grid.major = element_line(colour = "grey88"),
panel.grid.minor = element_blank(),
axis.text = element_text(size = 8, face = "bold"),
axis.title.y = element_text(size = 8,
face = "bold",
margin = margin(r = 8)),
legend.title = element_text(size = 8, face = "bold"),
legend.text = element_text(size = 8),
legend.position = "bottom",
plot.caption = element_text(size = 8,
hjust = 0,
margin = margin(t = 15))) +
labs(title = "Labor Force Indicators in Canada's Geographic Regions' Largest Cities in 2016",
subtitle = "Compared to Canada, Ordered by Unemployment Rate",
caption = "Data: Statistics Canada 2016 Census.")
Note the x = reorder(region, unemployment)
inside the aes()
call: this is where we order the plot’s X axis by unemployment rates. Remember that we have grouped our data by region so that we could put regions on the X axis.
Note also the scale_fill_manual()
function, where we manually assign colors to the plot’s fill
aesthetic.
Now that we have made the plot, let’s create the directory where we will be saving our graphics, and save our plot to it:
dir.create("output") # creates folder
ggsave("output/plot_labor.png",
plot_labor,
width = 11, height = 8.5, units = "in")
Finally, let’s print the plot to screen:
print(plot_labor)
This will be a more complex task compared to plotting labor force participation rates. Here we have the data that is broken down by work situation (full-time vs part-time), and by gender, and also by region. And ideally, we also want the total numbers for full-time and part-time workers to be presented in the same plot. This is too complex to be visualized as a simple bar plot like the one we’ve just made.
To visualize all these data in a single plot, we’ll use faceting: breaking down one plot into multiple sub-plots. And I suggest a donut chart – a variation on a pie chart that has a round hole in the center. Note that generally speaking, pie charts have a well-deserved bad reputation, which boils down to two facts: humans have difficulty visually comparing angles, and if you have many categories in your data, pie charts become an unreadable mess. Here and here you can read more about pie charts’ shortcomings, and which plots can best replace pie charts.
So why an I using a pie chart? Well, three reasons, really. First, we’ll only have four categories inside the chart, so it won’t be messy. Second, it is technically a donut chart, not a pie chart, and it is the empty space inside each donut where I will put the total numbers for full- and part-time workers. And third, I’d like to show how to make donut charts with ggplot2
in case you ever need this, which is not as straightforward as with most other charts, since ggplot2
doesn’t gave a ‘donut’ geom.
In the previous post, we have retrieved the 2016 Census data on the percentage of full-time and part-time workers, by gender, and saved it in the work
dataframe. Let’s now prepare the data for plotting. For that, we’ll need to add three more variables. type_gender
will be a categorical variable that combines work type and gender – currently these are two different variables. percent
will contain percentages for each combination of work type and gender, by region. And percent_type
will contain total percentages for full-time and part-time workers, by region.
# prepare 'work' dataset for plotting
work <- work %>%
group_by(region) %>%
mutate(type_gender = str_c(type, gender, sep = " ")) %>%
# percent of workers by region, work type, and gender:
mutate(percent = round(count/sum(count)*100, 1)) %>%
# percent of workers by work type, total:
group_by(region, type) %>%
mutate(percent_type = sum(percent))
Now the dataset is ready for plotting, so let’s make a faceted plot. Since ggplot2
doesn’t like pie charts (of which a donut chart is a variant), there is no ‘pie’ geom, and we’ll have to get a bit hacky with the code. Pay close attention to the in-code comments.
# plot work data (as a faceted plot)
plot_work <-
work %>%
ggplot(aes(x = "",
y = percent,
fill = type_gender)) +
geom_col(color = "white") + # sectors' separator color
coord_polar(theta = "y") +
geom_text_repel(aes(label = percent),
# put text labels inside corresponding sectors:
position = position_stack(vjust = .5),
force = .005, # repelling force
size = 2.5) +
geom_label_repel(data = distinct(select(work, c("region",
"type",
"percent_type"))),
aes(x = 0, # turn pie chart into donut chart
y = percent_type,
label = percent_type,
fill = type),
size = 2.5,
fontface = "bold",
force = .007, # repelling force
show.legend = FALSE) +
scale_fill_manual(name = "Work situation",
labels = c("full time" = "all full-time",
"part time" = "all part-time"),
values = c("full time male" = "olivedrab4",
"full time female" = "olivedrab1",
"part time male" = "tan4",
"part time female" = "tan1",
"full time" = "green3",
"part time" = "orange3")) +
facet_wrap(~ region) +
guides(fill = guide_legend(nrow = 3)) +
theme_void() +
theme(plot.title = element_text(size = 10,
face = "bold",
margin = margin(t = 10, b = 20),
hjust = .5),
strip.text = element_text(size = 8, face = "bold"),
plot.caption = element_text(size = 8,
hjust = 0,
margin = margin(t = 20, b = 10)),
legend.title = element_text(size = 8, face = "bold"),
legend.text = element_text(size = 8),
# change size of symbols (colored squares) in legend:
legend.key.size = unit(1, "lines"),
legend.position = "bottom") +
labs(title = "Percentage of Workers, by Work Situation & Gender, 2016",
caption = "Note: Percentages may not add up to 100% due to values rounding.\nData source: Statistics Canada 2016 Census.")
There are a number of things in the plot’s code that I’d like to draw your attention to. First, a ggplot2
pie chart is a stacked bar chart (geom_col
) made in the polar coordinate system: coord_polar(theta = “y”)
. For geom_col()
, position = “stack”
is the default, so it is not specified in the code. Note also that geom_col()
needs the x
aesthetic, but a pie chart doesn’t have an x
coordinate. So I used x = “”
to trick geom_col()
into thinking it has the x
aesthetic, otherwise it would have thrown an error: “geom_col requires the following missing aesthetics: x”
.
But how do you turn a pie chart into a donut chart? To do this, I set x = 0
inside the ggrepel::geom_label_repel()
aes()
call. Try passing different values to x to see how it works: for example, x = 1
turns the plot into a standard pie chart, while x = -1
turns a donut into a narrow ring.
In order to prevent labels overlap, I used ggrepel::geom_text_repel()
and ggrepel::geom_label_repel()
to add text labels to our plot instead of ggplot2::geom_text()
and ggplot2::geom_label()
. And position = position_stack(vjust = .5)
inside geom_text_repel()
puts text labels in the middle of their respective sectors of the donut plot.
The data = distinct(select(work, c(“region”, “type”, “percent_type”))
argument to geom_label_repel()
prevents the duplication of labels containing total numbers for full-time and part-time workers.
The scale_fill_manual()
is used to manually assign colors and names to our plot’s legend items, and guides(fill = guide_legend(nrow = 3))
changes the order of legend items.
Finally, facet_wrap(~ region)
creates a faceted plot, by region.
And just as we did with the previous plot, let’s save our plot to the ‘output’ folder and print it to screen:
ggsave("output/plot_work.png",
plot_work,
width = 11, height = 8.5, units = "in")
print(plot_work)
Update: the COVID-19 Canada Data Explorer was endorsed by the Macdonald-Laurier Institute – one of Canada’s leading public policy think tanks!
Update 2: My public policy brief Designing COVID-19 Data Tools, co-authored with Ken Coates and Carin Holroyd, was published by the Johnson Shoyama Graduate School of Public Policy, University of Saskatchewan. The brief uses COVID-19 Canada Data Explorer as an example, and makes a comparative overview of COVID-19 data analysis and visualization tools available from Canada’s federal and provincial governments.
This is a more or less final version, so here is the source code for the app. And here is the code used to retrieve and pre-process geospatial data.
***
In the times of the pandemic, the data community can help in many ways, including by developing instruments to track and break down the data on the spread of the dreaded coronavirus disease. The COVID-19 Canada Data Explorer app was built with R, including Shiny
, Leaflet
, and plotly
, to process the official dataset available from the Government of Canada. They do have their own data visualization tool, but it is very basic. You can do so much more with the available data!
The data is downloaded from Canada.ca at 6-hour intervals to minimize the load on the repository – there is likely a high demand due to the pandemic. This means that there may be a delay of up to six hours from the time Canada.ca update their data to the moment it is updated on my server.
I hope that my app will help public health professionals, policymakers, and really anyone to stay informed about the course of SARS-CoV-2 epidemic in Canada.
]]>Now that we are ready to start working with Canadian Census data, let’s first briefly address the question why you may need it. After all, CANSIM data is often more up-to-date and covers a much broader range of topics than the national census data, which is gathered every five years in respect of a limited number of questions.
The main reason is that CANSIM data is far less granular geographically. Most of it is collected at the provincial or even higher regional level. You may be able to find CANSIM data on a limited number of questions for some of the country’s largest metropolitan areas, but if you need the data for a specific census division, city, town, or village, you’ll have to use the Census.
To illustrate the use of the cancensus
package, let’s do a small research project. First, in this post we’ll retrieve the following key labor force characteristics of the largest metropolitan areas in each of the five geographic regions of Canada:
The cities (metropolitan areas) that we are going to look at, are: Calgary, Halifax, Toronto, Vancouver, and Whitehorse. We’ll also get these data for Canada as a whole for comparison and to illustrate the retrieval of data at different geographic levels
Next, in Part 6 of the “Working with Statistics Canada Data in R” series, we will visualize these data, including making a faceted plot and writing a function to automate repetitive plotting tasks.
Keep in mind that cancensus
also allows you to retrieve geospatial data, that is, borders of census regions at various geographic levels, in sp
and sf
formats. Retrieving and visualizing Statistics Canada geospatial data will be covered later in these series.
So, let’s get started by loading the required packages:
cancensus
retrieves census data with the get_census()
function. get_census()
can take a number of arguments, the most important of which are dataset
, regions
, and vectors
, which have no defaults. Thus, in order to be able to retrieve census data, you’ll first need to figure out:
Let’s see which census datasets are available through the CensusMapper API:
Currently, datasets earlier than 1996 are not available, so if you need to work with pre-1996 census data, you won’t be able to retrieve it with cancensus
.
Next, let’s find the regions that we’ll be getting the data for. To search for census regions, use the search_census_regions()
function.
Let’s take a look at what region search returns for Toronto. Note that cancensus
functions return their output as dataframes, making it is easy to subset. Here I limited the output to the most relevant columns to make sure it fits on screen. You can run the code without [c(1:5, 8)]
to see all of it.
# all census levels
search_census_regions(searchterm = "Toronto",
dataset = "CA16")[c(1:5, 8)]
#> # A tibble: 3 x 6
#> region name level pop municipal_status PR_UID
#> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 35535 Toronto CMA 5928040 B 35
#> 2 3520 Toronto CD 2731571 CDR 35
#> 3 3520005 Toronto CSD 2731571 C 35
You may have expected to get only one region: the city of Toronto, but instead you got three! So, what is the difference? Look at the column level
for the answer. Often, the same geographic region can be represented by several census levels, as is the case here. There are three levels for Toronto, which is simultaneously a census metropolitan area, a census division, and a census sub-division. Note also the PR_UID
column that contains numeric codes for Canada’s provinces and territories. These codes can help you distinguish between different census regions that have same or similar names but are located in different provinces. For an example, run the code above replacing “Toronto” with “Windsor”.
Remember that we were going to plot the data for census metropolitan areas? You can choose the geographic level with the level
argument, which can take the following values: ‘C’ for Canada (national level), ‘PR’ for province, ‘CMA’ for census metropolitan area, ‘CD’ for census division, ‘CSD’ for census sub-division, or NA:
# specific census level
search_census_regions("Toronto", "CA16", level = "CMA")
Let’s now list census regions that may be relevant for our project:
# explore available census regions
names <- c("Canada", "Calgary", "Halifax",
"Toronto", "Vancouver", "Whitehorse")
map_df(names, ~ search_census_regions(., dataset = "CA16"))[c(1:5, 8)]
#> # A tibble: 19 x 6
#> region name level pop municipal_status PR_UID
#> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 01 Canada C 35151728 NA NA
#> 2 48825 Calgary CMA 1392609 B 48
#> 3 4806016 Calgary CSD 1239220 CY 48
#> 4 12205 Halifax CMA 403390 B 12
#> 5 1209 Halifax CD 403390 CTY 12
#> 6 1209034 Halifax CSD 403131 RGM 12
#> 7 2432023 Sainte-Sophie-d'Halifax CSD 612 MÉ 24
#> 8 35535 Toronto CMA 5928040 B 35
#> 9 3520 Toronto CD 2731571 CDR 35
#> 10 3520005 Toronto CSD 2731571 C 35
#> 11 59933 Vancouver CMA 2463431 B 59
#> 12 5915 Greater Vancouver CD 2463431 RD 59
#> 13 5915022 Vancouver CSD 631486 CY 59
#> 14 5915046 North Vancouver CSD 85935 DM 59
#> 15 5915051 North Vancouver CSD 52898 CY 59
#> 16 5915055 West Vancouver CSD 42473 DM 59
#> 17 5915020 Greater Vancouver A CSD 16133 RDA 59
#> 18 6001009 Whitehorse CSD 25085 CY 60
#> 19 6001060 Whitehorse, Unorganized CSD 326 NO 60
purrr::map_df()
function applies search_census_regions()
iteratively to each element of the names
vector and returns output as a single dataframe. Note also the ~ .
syntax. Think of it as the tilde taking each element of names
and passing it as an argument to a place indicated by the dot in the search_census_regions()
function. You can find more about the tilde-dot syntax here. It may be a good idea to read the whole tutorial: purrr
is a super-useful package, but not the easiest to learn, and this tutorial does a great job explaining the basics.
Since there are multiple entries for each search term, we’ll need to choose the results for census metropolitan areas, or in case of Whitehorse, for census sub-division, since Whitehorse is too small to be considered a metropolitan area:
# select only the regions we need: CMAs (and CSD for Whitehorse)
regions <- list_census_regions(dataset = "CA16") %>%
filter(grepl("Calgary|Halifax|Toronto|Vancouver", name) &
grepl("CMA", level) |
grepl("Canada|Whitehorse$", name)) %>%
as_census_region_list()
Pay attention to how the logical operators are used to filter the output by several conditions at once; also note using $
regex meta-character to choose from the names
column the entry ending with ‘Whitehorse’ (to filter out ‘Whitehorse, Unorganized’.
Finally, as_census_region_list()
converts list_census_regions()
output to a data object of type list that can be passed to the get_census()
function as its regions argument.
Canadian census data is made up of individual variables, aka census vectors. Vector number(s) is another argument you need to specify in order to retrieve data with the get_census()
function.
cancensus
has two functions that allow you to search through census data variables: list_census_vectors()
and search_census_vectors()
.
list_census_vectors()
returns all available vectors for a given dataset as a single dataframe containing vectors and their descriptions:
# structure of list_census_vectors output
str(list_census_vectors(dataset = 'CA16'))
#> tibble [6,623 × 7] (S3: tbl_df/tbl/data.frame)
#> $ vector : chr [1:6623] "v_CA16_401" "v_CA16_402" "v_CA16_403" "v_CA16_404" ...
#> $ type : Factor w/ 3 levels "Female","Male",..: 3 3 3 3 3 3 3 3 2 1 ...
#> $ label : chr [1:6623] "Population, 2016" "Population, 2011" "Population percentage change, 2011 to 2016" "Total private dwellings" ...
#> $ units : Factor w/ 6 levels "Number","Percentage ratio (0.0-1.0)",..: 1 1 1 1 1 4 1 1 1 1 ...
#> $ parent_vector: chr [1:6623] NA NA NA NA ...
#> $ aggregation : chr [1:6623] "Additive" "Additive" "Average of v_CA16_402" "Additive" ...
#> $ details : chr [1:6623] "CA 2016 Census; Population and Dwellings; Population, 2016" "CA 2016 Census; Population and Dwellings; Population, 2011" "CA 2016 Census; Population and Dwellings; Population percentage change, 2011 to 2016" "CA 2016 Census; Population and Dwellings; Total private dwellings" ...
#> - attr(*, "last_updated")= POSIXct[1:1], format: "2020-12-21 23:27:47"
#> - attr(*, "dataset")= chr "CA16"
# count variables in 'CA16' dataset
nrow(list_census_vectors(dataset = 'CA16'))
#> [1] 6623
As you can see, there are 6623 (as of the time of writing this) variables in the 2016 census dataset, so list_census_vectors()
won’t be the most convenient function to find a specific vector. Note however that there are situations (such as when you need to select a lot of vectors at once), in which list_census_vectors()
would be appropriate.
Usually it is more convenient to use search_census_vectors()
to search for vectors. Just pass the text string of what you are looking for as the searchterm
argument. You don’t have to be precise: this function works even if you make a typo or are uncertain about the spelling of your search term.
Let’s now find census data vectors for labor force involvement rates:
# get census data vectors for labor force involvement rates
lf_vectors <-
search_census_vectors(searchterm = "employment rate",
dataset = "CA16") %>%
union(search_census_vectors("participation rate", "CA16")) %>%
filter(type == "Total") %>%
pull(vector)
Let’s take a look at what this code does. Since searchterm
doesn’t have to be a precise match, “employment rate” search term retrieves unemployment rate vectors too. In the next line, union()
merges dataframes returned by search_census_vectors()
into a single dataframe. Note that in this case union()
could be substituted with bind_rows()
. I recommend using union()
in order to avoid data duplication. Next, we choose only the “Total” numbers, since we are not going to plot labor force indicators by gender. Finally, the pull()
command extracts a single vector from the dataframe, just like the $
subsetting operator: we need lf_vectors
to be a data object of type vector in order to pass it to the vectors
argument of the get_census()
function.
There is another way to figure out search terms to put inside the search_census_vectors()
function: use Statistics Canada online Census Profile tool. It can be used to quickly explore census data as well as to figure out variables’ names (search terms) and their hierarchical structure.
For example, let’s look at the census labor data for Calgary metropolitan area. Scrolling down, you will quickly find the numbers and text labels for full-time and part-time workers:
Now we know the exact search terms, so we can get precisely the vectors we need, free from any extraneous data:
# get census data vectors for full- and part-time work
# get vectors and labels
work_vectors_labels <-
search_census_vectors("full year, full time", "CA16") %>%
union(search_census_vectors("part year and/or part time", "CA16")) %>%
filter(type != "Total") %>%
select(1:3) %>%
mutate(label = str_remove(label, ".*, |.*and/or ")) %>%
mutate(type = fct_drop(type)) %>%
setNames(c("vector", "gender", "type"))
# extract vectors
work_vectors <- work_vectors_labels$vector
Note how this code differs from the code with which we extracted labor force involvement rates: since we need the data to be sub-divided both by the type of work and by gender (hence no “Total” values here), we are creating a dataframe that assigns respective labels to each vector number. This work_vectors_labels
dataframe will supply categorical labels to be attached to the data retrieved with get_census()
.
Also, note these three lines:
mutate(label = str_remove(label, ".*, |.*and/or ")) %>%
mutate(type = fct_drop(type)) %>%
setNames(c("vector", "gender", "type"))
The first mutate()
call removes all text up to and including ,
and and/or
(spaces included) from the label
column. The second drops unused factor level “Total” – it is a good practice to make sure there are no unused factor levels if you are going to use ggplot2
to plot your data. Finally, setNames()
renames variables for convenience.
Finally, let’s retrieve vectors for the education data for the age group from 25 to 64 years, by gender. Before we do this, I’d like to draw your attention to the fact that some of the census data is hierarchical, which means that some variables (census vectors) are included into parent and/or include child variables. It is very important to choose vectors at proper hierarchical levels so that you do not double-count or omit your data.
Education data is a good example of hierarchical data. You can explore data hierarchy using parent_census_vectors()
and (child_census_vectors)
functions. However, you may find exploring the hierarchy visually to be more convenient:
So, let’s now retrieve and label the education data vectors:
# get vectors and labels
ed_vectors_labels <-
search_census_vectors("certificate", "CA16") %>%
union(search_census_vectors("degree", "CA16")) %>%
union(search_census_vectors("doctorate", "CA16")) %>%
filter(type != "Total") %>%
filter(grepl("25 to 64 years", details)) %>%
slice(-1,-2,-7,-8,-11:-14,-19,-20,-23:-28) %>%
select(1:3) %>%
mutate(label =
str_remove_all(label,
" cert.*diploma| dipl.*cate|, CEGEP| level|")) %>%
mutate(label =
str_replace_all(label,
c("No.*" = "None",
"Secondary.*" = "High school or equivalent",
"other non-university" = "equivalent",
"University above" = "Cert. or dipl. above",
"medicine.*" = "health**",
".*doctorate$" = "Doctorate*"))) %>%
mutate(type = fct_drop(type)) %>%
setNames(c("vector", "gender", "level"))
# extract vectors
ed_vectors <- ed_vectors_labels$vector
Note the slice()
function that allows to manually select specific rows from a dataframe: positive numbers choose rows to keep, negative numbers choose rows to drop. I used slice()
to drop the hierarchical levels from the data that are either too general or too granular. Note also that I had to edit text strings in the data. Finally, I added asterisks after “Doctorate” and “health”. These are not regex symbols, but actual asterisks that will be used to refer to footnotes in plot captions later on.
Now that we have figured out our dataset, regions, and data vectors (and labeled the vectors, too), we are finally ready to retrieve the data.
To retrieve census data, feed the dataset, regions, and data vectors into get_census()
as its’ respective arguments. Note that get_census()
has the use_cache
argument (set to TRUE
by default), which tells get_census()
to retrieve data from cache if available. If there is no cached data, the function will query CensusMapper API for the data and will save it in the cache, while use_cache = FALSE
will force get_census()
to query the API and update the cache.
# get census data for labor force involvement rates
# feed regions and vectors into get_census()
labor <-
get_census(dataset = "CA16",
regions = regions,
vectors = lf_vectors) %>%
select(-c(1, 2, 4:7)) %>%
setNames(c("region", "employment rate",
"unemployment rate",
"participation rate")) %>%
mutate(region = str_remove(region, " (.*)")) %>%
pivot_longer("employment rate":"participation rate",
names_to = "indicator",
values_to = "rate") %>%
mutate_if(is.character, as_factor)
The select()
call drops columns with irrelevant data. setNames()
renames columns to remove vector numbers from variable names – we don’t need vector numbers in variable names because variable names will be converted to values in the indicator
column. str_remove()
inside the mutate()
call drops municipal status codes ‘(B)’ and ‘(CY)’ from region names. Finally, mutate_if()
line converts characters to factors for subsequent plotting.
An important function here is tidyr::pivot_longer()
. It converts the dataframe from wide to long format. It takes three columns: employment rate
, unemployment rate
, and participation rate
, and passes their names as values of the indicator
variable, while their numeric values are passed to the rate
variable. The reason for conversion is that we are going to plot the data for all three labor force indicators in the same graphic, which makes it necessary to store the indicators as a single factor variable.
Next, let’s retrieve census data about the percent of full time vs part time workers, by gender, and the data about the education levels of people aged 25 to 64, by gender:
# get census data for full-time and part-time work
work <-
get_census(dataset = "CA16",
regions = regions,
vectors = work_vectors) %>%
select(-c(1, 2, 4:7)) %>%
rename(region = "Region Name") %>%
pivot_longer(2:5, names_to = "vector",
values_to = "count") %>%
mutate(region = str_remove(region, " (.*)")) %>%
mutate(vector = str_remove(vector, ":.*")) %>%
left_join(work_vectors_labels, by = "vector") %>%
mutate(gender = str_to_lower(gender)) %>%
mutate_if(is.character, as_factor)
# get census data for education levels
education <-
get_census(dataset = "CA16",
regions = regions,
vectors = ed_vectors) %>%
select(-c(1, 2, 4:7)) %>%
rename(region = "Region Name") %>%
pivot_longer(2:21, names_to = "vector",
values_to = "count") %>%
mutate(region = str_remove(region, " (.*)")) %>%
mutate(vector = str_remove(vector, ":.*")) %>%
left_join(ed_vectors_labels, by = "vector") %>%
mutate_if(is.character, as_factor)
Note one important difference from the code I used to retrieve the labor force involvement data: here I added the dplyr::left_join()
function that joins labels to the census data.
We now have the data and are ready to visualize it, which will be done in the next part of this series.
For those of you who are outside of Canada, Canada’s geographic regions and their largest metropolitan areas are:
These regions should not be confused with 10 provinces and 3 territories, which are Canada’s sub-national administrative divisions, much like states in the U.S. Each region consists of several provinces or territories, except the West Coast, which includes only one province – British Columbia. You can find more about Canada’s geographic regions and territorial structure here (pages 44 to 51).
For the definitions of employment rate, unemployment rate, labour force participation rate, full-time work, and part-time work, see Statistics Canada’s Guide to the Labour Force Survey.
You can find more about census geographic areas here and here. There is also a glossary of census-related geographic concepts.
]]>In the Introduction to the “Working with Statistics Canada Data in R” series, I discussed the three main types of data available from Statistics Canada. It is now time to move on to the second of those data types – the Canadian National Census data.
cancensus
is a specialized R package that allows you to retrieve Statistics Canada census data and geography. It follows the tidy approach to data processing. The package’s authors are Jens von Bergmann, Dmitry Shkolnik, and Aaron Jacobs. I am not associated with the authors, I just use this great package in my work on a regular basis. The package’s GitHub repository can be found here. There is also a tutorial by the package’s authors, which I recommend taking a look at before proceeding.
Further in this series, I will provide an in-depth real-life example of working with Canadian Census data using the cancensus
package. But first, let’s install cancensus
:
install.packages("cancensus")
library(cancensus)
cancensus
relies on queries to the CensusMapper API, which requires a CensusMapper API key. The key can be obtained for free as per the package authors’ instructions. Once you have the key, you can add it to your R environment:
options(cancensus.api_key = "CensusMapper_your_api_key")
Note that although the authors are warning that API requests are limited in volume, I have significantly exceeded my API quota on some occasions, and still had no issues with retrieving data.
That said, depending on how much data you need, you can draw down your quota very quickly, and here’s where local cache comes to the rescue. cancensus
caches data every time you retrieve it, but by default the cache is not persistent between sessions. To make it persistent, as well as to remove the need to enter your API key every time you use cancensus
, you can add both the API key and the cache path to your .Rprofile
file.
If you’d like to learn in-depth what .Rprofile
is and how to edit it, consider taking a look at sections 2.4.2 to 2.4.5 of “Efficient R Programming” by Colin Gillespie and Robin Lovelace. For a quick and simple edit, just keep reading.
First, find your R home directory with R.home()
. Most likely, it will be in /usr/lib/R. If that is where your R home is, in the Linux Terminal (not in R), run:
sudo nano /usr/lib/R/library/base/R/.Rprofile # edit path if needed
On some systems, .Rprofile may not be hidden, so if the above command doesn’t open the .Rprofile
file, try removing the dot before ‘Rprofile’:
sudo nano /usr/lib/R/library/base/R/Rprofile
( ! ) Note that this will edit the system .Rprofile
file, which will always run on startup and will apply to all your R projects. The file itself will warn you that “it is a bad idea to use this file as a template for personal startup files”. You can safely ignore this warning as long as the only edit you are making is the one shown below, i.e. adding cancensus
cache path and API key.
In the “options” section, add these two lines (remember to paste in your CensusMapper API key and a path to the directory where you’d like to keep your cancensus
cache):
options(cancensus.api_key = "CensusMapper_your_api_key")
options(cancensus.cache_path = "/home/your_username/path_to_your_R_directory/.cancensus_cache")
Then hit Ctrl+X
, choose Y
for “yes”, and press Enter
to save changes. When you first retrieve data with cancensus
, R will create .cancensus_cache
directory for you.
Editing .Rprofile
on Windows is a bit tricky. The best thing for you would be not to touch the Windows system .Rprofile
, or else risk weird errors and crashes (I was not able to figure out where they come from).
Instead, set up a project-specific .Rprofile
. The downside is that you may need to set it separately for every project in which you are going to use cancensus
. The upside is that the contents of the .Rprofile
file should be exactly the same every time. In R or Rstudio (not in the command line), run:
file.edit(".Rprofile")
Or use the usethis::edit_r_profile()
function from the usethis
package.
Then, add these two lines and save the file:
options(cancensus.api_key = "CensusMapper_your_api_key")
options(cancensus.cache_path = "C:\\Users\\Home\\Documents\\R\\cancensus_cache")
Note that when used inside R, backslash \
symbols in file paths in Windows may need to be escaped with another \
symbol. If this file path doesn’t work, try replacing duplicate \\
with a single \
.
At this point you should be ready to start retrieving data with cancensus
, which will be addressed in detail in my next post.
In the previous part of this series, we have retrieved CANSIM data on the weekly wages of Aboriginal and Non-Aboriginal Canadians of 25 years and older, living in Saskatchewan, and the CPI data for the same province. We then used the CPI data to adjust the wages for inflation, and saved the results as wages_0370
dataset.
To get started, let’s take a quick look at the dataset, what types of variables it contains, which should be considered categorical, and what unique values categorical variables have:
# explore wages_0370 before plotting
glimpse(wages_0370)
#> Rows: 39
#> Columns: 6
#> $ year <chr> "2007", "2007", "2007", "2008", "2008", "2008", "2009", "2009", "2009", "…
#> $ group <chr> "First Nations", "Métis", "Non-Aboriginal population", "First Nations", "…
#> $ current_dollars <dbl> 707.93, 761.75, 797.31, 675.95, 812.45, 855.09, 793.98, 807.23, 892.58, 8…
#> $ cpi <dbl> 112.2, 112.2, 112.2, 115.9, 115.9, 115.9, 117.1, 117.1, 117.1, 118.7, 118…
#> $ infrate <dbl> 0.00000000, 0.00000000, 0.00000000, 0.03297683, 0.03297683, 0.03297683, 0…
#> $ dollars_2007 <dbl> 707.93, 761.75, 797.31, 654.37, 786.51, 827.79, 760.76, 773.45, 855.23, 8…
map(wages_0370, class)
#> $year
#> [1] "character"
#>
#> $group
#> [1] "character"
#>
#> $current_dollars
#> [1] "numeric"
#>
#> $cpi
#> [1] "numeric"
#>
#> $infrate
#> [1] "numeric"
#>
#> $dollars_2007
#> [1] "numeric"
map(wages_0370[c(1, 2)], unique)
#> $year
#> [1] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017" "2018" "2019"
#>
#> $group
#> [1] "First Nations" "Métis" "Non-Aboriginal population"
The first two variables – year
and group
– are of the type “character”, and `the rest are numeric of the type “double” (“double” simply means they are not integers, i.e. can have decimals).
Also, we can see that wages_0370
dataset is already in the tidy format, which is an important prerequisite for plotting with ggplot2
package. Since ggplot2
is included into tidyverse
, there is no need to install it separately.
At this point, our data is almost ready to be plotted, but we need to make one final change. Looking at the unique values, we can see that the first two variables (year
and group
) should be numeric (integer) and categorical respectively, while the rest are continuous (as they should be).
In R, categorical variables are referred to as factors. It is important to expressly tell R which variables are categorical, because mapping ggplot2
aesthetics – things that go inside aes()
– to a factor variable makes ggplot2
use a discrete color scale (distinctly different colors) for different categories (different factor levels in R terms). Otherwise, values would be plotted to a gradient, i.e. different hues of the same color. There are several other reasons to make sure you expressly identify categorical variables as factors if you are planning to visualize your data. I understand that this might be a bit too technical, so if you are interested, you can find more here and here. For now, just remember to convert your categorical variables to factors if you are going to plot your data. Ideally, do it always – it is a good practice to follow.
( ! ) It is a good practice to always convert categorical variables to factors.
So, let’s do it: convert year
to an integer, and group
to a factor. Before doing so, let’s remove the word “population” from “Non-Aboriginal population” category, so that our plot’s legend takes less space inside the plot. We can also replace accented “é” with ordinary “e” to make typing in our IDE easier. Note that the order is important: first we edit the string values of a “character” class variable, and only then convert it to a factor. Otherwise, our factor will have missing levels.
( ! ) Converting a categorical variable to a factor should be the last step in cleaning your dataset.
wages_0370 <- wages_0370 %>%
mutate(group = str_replace_all(group, c(" population" = "", "é" = "e"))) %>%
mutate_at("year", as.integer) %>%
mutate_if(is.character, as_factor)
Note: if you only need to remove string(s), use str_remove or str_remove_all:
mutate(group = str_remove(group, " population"))
Finally, we are ready to plot the data with ggplot2
:
plot_wages_0370 <-
wages_0370 %>%
drop_na() %>%
ggplot(aes(x = year, y = dollars_2007,
color = group)) +
geom_point(size = 2) +
geom_line(size = .7) +
geom_label(aes(label = round(dollars_2007)),
# alt: use geom_label_repel() # requires ggrepel
fontface = "bold",
label.size = .4, # label border thickness
size = 2, # label size
# force = .005, # repelling force: requires ggrepel
show.legend = FALSE) +
coord_cartesian(ylim = c(650, 1000)) + # best practice to set scale limits
scale_x_continuous(breaks = 2007:2018) +
scale_y_continuous(name = "2007 dollars",
breaks = seq(650, 1000, by = 50)) +
scale_color_manual(values = c("First Nations" = "tan3",
"Non-Aboriginal" = "royalblue",
"Metis" = "forestgreen")) +
theme_bw() +
theme(plot.title = element_text(size = 10,
face = "bold",
hjust = .5,
margin = margin(b = 10)),
plot.caption = element_text(size = 8),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(colour = "grey85"),
axis.text = element_text(size = 8),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 9, face = "bold",
margin = margin(r = 10)),
legend.title = element_blank(),
legend.position = "bottom",
legend.text = element_text(size = 9, face = "bold")) +
labs(title = "Average Weekly Wages, Adjusted for Inflation,\nby Aboriginal Group, 25 Years and Older",
caption = "Wages data: Statistics Canada Data Table 14-10-0370\nInflation data: Statistics Canada Data Vector v41694489")
print(plot_wages_0370)
Let’s now look at what this code does. We start with feeding our data object wages_0370
into ggplot
function using pipe.
( ! ) Note that ggplot2
internal syntax differs from tidyverse
pipe-based syntax, and uses +
instead of %>%
to join code into blocks.
Next, inside ggplot(aes())
call we assign common aesthetics for all layers, and then proceed to choosing the geoms we need. If needed, we can assign additional/override common aesthetics inside individual geoms, like we did when we told geom_label()
to use dollars_2007
variable values (rounded to a dollar) as labels. If you’d like to find out more about what layers are, and about ggplot2
grammar of graphics, I recommend this article by Hadley Wickham.
Plot type in each layer is determined by geom_*()
functions. This is where geoms are:
geom_point(size = 2) +
geom_line(size = .7) +
geom_label(aes(label = round(dollars_2007)),
# alt: use geom_label_repel() # requires ggrepel
fontface = "bold",
label.size = .4, # label border thickness
size = 2.5, # label size
# force = .005, # repelling force: requires ggrepel
show.legend = FALSE)
Choosing plot type is largely a judgement call, but you should always make sure to choose the type of graphic that would best suite the data you have. In this case, our goal is to reveal the dynamics of wages in Saskatchewan over time, hence our choice of geom_line()
. Note that the lines in our graphic are for visual aid only – to make it easier for an eye to follow the trend. They are not substantively meaningful like they would be, for example, in a regression plot. geom_point()
is also there primarily for visual purposes – to make the plot’s legend more visible. Note that unlike the lines, the points in this plot are substantively meaningful, i.e. they are exactly where our data is (but are covered by labels). If you don’t like the labels in the graphic, you can use points instead.
Finally, geom_label()
plots our substantive data. Note that I am using show.legend = FALSE argument
– this is simply because I don’t like the look of geom_label()
legend symbols, and prefer a combined line+point symbol instead. If you prefer label symbols in the plot’s legend, remove show.legend = FALSE
argument from geom_label()
call, and add it to geom_line()
and geom_point()
.
You have noticed some commented lines in the ggplot()
call. You may also have noticed that some labels in our graphic overlap slightly. In this case the overlap is minute and can be ignored. But what if there are a lot of overlapping data points, enough to affect readability?
Fortunately, there is a package to solve this problem for the graphics that use text labels: ggrepel
. It has *_repel()
versions of ggplot2::geom_label()
and ggplot2::geom_text()
functions, which repel the labels away from each other and away from data points.
install.packages("ggrepel")
library("ggrepel")
ggrepel
functions can take the same arguments as corresponding ggplot2
functions, and also take the force
argument that defines repelling force between overlapping labels. I recommend setting it to a small value, as the default 1 seems way too strong.
Here is what our graphic looks like now. Note that the nearby labels no longer overlap:
This is where plot axes and scales are defined:
coord_cartesian(ylim = c(650, 1000)) + # best practice to set scale limits
scale_x_continuous(breaks = 2007:2018) +
scale_y_continuous(name = "2007 dollars",
breaks = seq(650, 1000, by = 50)) +
scale_color_manual(values = c("First Nations" = "tan3",
"Non-Aboriginal" = "royalblue",
"Metis" = "forestgreen"))
coord_cartesian()
is the function I’d like to draw your attention to, as it is the best way to zoom the plot, i.e. to get rid of unnecessary empty space. Since we don’t have any values less than 650 or more than 950 (approximately), starting our Y scale at 0 would result in a less readable plot, where most space would be empty, and the space where we actually have data would be crowded. If you are interested in why coord_cartesian()
is the best way to set axis limits, there is an in-depth explanation.
( ! ) It is a good practice to use coord_cartesian()
to change axis limits.
Next, we edit our plot theme:
theme_bw() +
theme(plot.title = element_text(size = 10,
face = "bold",
hjust = .5,
margin = margin(b = 10)),
plot.caption = element_text(size = 8),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(colour = "grey85"),
axis.text = element_text(size = 8),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 9, face = "bold",
margin = margin(r = 10)),
legend.title = element_blank(),
legend.position = "bottom",
legend.text = element_text(size = 9, face = "bold"))
First I selected a simple black-and-white theme theme_bw, and then overrode some of the theme’s default settings in order to improve the plot’s readability and overall appearance. Which theme and settings to use is up to you, just make sure that whatever you do makes the plot easier to read and comprehend at a glance. Here you can find out more about editing plot theme.
Finally, we enter the plot title and plot captions. Captions are used to provide information about the sources of our data. Note the use of (new line symbol) to break strings into multiple lines:
labs(title = "Average Weekly Wages, Adjusted for Inflation,\nby Aboriginal Group, 25 Years and Older",
caption = "Wages data: Statistics Canada Data Table 14-10-0370\nInflation data: Statistics Canada Data Vector v41694489")
The last step is to save the plot so that we can use it externally: insert into reports and other publications, publish online, etc.
# save plot
ggsave("plot_wages_0370.svg", plot_wages_0370)
ggsave()
takes various arguments, but only one is mandatory: file name as a string. The second argument plot
defaults to the last plot displayed, but it is advisable to name the plot expressly to make sure the right one gets saved. You can find out more about how ggsave()
works here.
My favorite format to save graphics is SVG, which stands for Scalable Vector Graphics – an extremely lightweight vectorized format that ensures the graphic stays pixel-perfect at any resolution. Note however, that SVG is not really a pixel image like JPEG or PNG, but a bunch of XML code, which entails certain security implications when using SVG files online.
This was the last of the three articles about working with CANSIM data. In the next article in the “Working with Statistics Canada Data in R” series, I’ll move on to working with the national census data.
]]>Most CANSIM data can be accessed in two formats: as data tables and, at a much more granular level, as individual data vectors. This post is structured accordingly:
CANSIM data is stored in tables containing data on a specific subject. Individual entries (rows) and groups of entries in these tables are usually assigned vector codes. Thus, unless we already know what table or vector we need, finding the correct table should be our first step.
As always, start with loading packages:
Now we can start retrieving CANSIM data. How we do this, depends on whether we already know the table or vector numbers. If we do, things are simple: just use get_cansim()
to retrieve data tables, or get_cansim_vector()
to retrieve vectors.
But usually we don’t. In this case one option is to use StatCan’s online search tool. Eventually you will find what you’ve been looking for, although you might also miss a few things – CANSIM is not that easy to search and work with manually.
A much better option is to let R do the tedious work for you with a few lines of code.
In this example (based on the code I wrote for a research project), let’s look for CANSIM tables that refer to Aboriginal (Indigenous) Canadians anywhere in the tables’ titles, descriptions, keywords, notes, etc.:
# create an index to subset list_cansim_tables() output
index <- list_cansim_tables() %>%
Map(grepl, "(?i)aboriginal|(?i)indigenous", .) %>%
Reduce("|", .) %>%
which()
# list all tables with Aboriginal data, drop redundant cols
tables <- list_cansim_tables()[index, ] %>%
select(c("title", "keywords", "notes", "subject",
"date_published", "time_period_coverage_start",
"time_period_coverage_end", "url_en",
"cansim_table_number"))
Let’s look in detail at what this code does. First, we call the list_cansim_tables()
function, which returns a tibble dataframe, where each row provides a description of one CANSIM table. To get a better idea of list_cansim_tables()
output, run:
glimpse(list_cansim_tables())
Then we search through the dataframe for Regex patterns matching our keywords. Note the (?i)
flag – it tells Regex to ignore case when searching for patterns; alternatively, you can pass ignore.case = TRUE
argument to grepl()
. The Map()
function allows to search for patterns in every column of the dataframe returned by list_cansim_tables()
. This step returns a very long list of logical values.
Our next step is to Reduce()
the list to a logical vector, where each value is either FALSE
if there was not a single search term match per CANSIM table description (i.e. per row of list_cansim_tables output), or TRUE
if there were one or more matches. The which()
function gives us the numeric indices of TRUE
elements in the vector.
Finally, we subset the list_cansim_tables()
output by index. Since there are many redundant columns in the resulting dataframe, we select only the ones that contain potentially useful information.
There is also a simpler approach, which immediately returns a dataframe of tables:
tables <- list_cansim_tables() %>%
filter(grepl("(?i)aboriginal|(?i)indigenous", title)) %>%
select(c("title", "keywords", "notes", "subject",
"date_published", "time_period_coverage_start",
"time_period_coverage_end", "url_en",
"cansim_table_number"))
However, keep in mind that this code would search only through the column or columns of the list_cansim_tables()
output, which were specified inside the grepl()
call (in this case, in the title column). This results in fewer tables listed in tables
: 60 vs 73 you get if you search by index (as of the time of writing this). Often simple filtering would be sufficient, but if you want to be extra certain you haven’t missed anything, search by index as shown above.
( ! ) Note that some CANSIM data tables do not get updated after the initial release, so always check the date_published
and time_period_coverage_*
attributes of the tables you are working with.
Finally, it could be a good idea to externally save the dataframe with the descriptions of CANSIM data tables in order to be able to view it as a spreadsheet. Before saving, let’s re-arrange the columns in a more logical order for viewers’ convenience, and sort the dataframe by CANSIM table number.
# re-arrange columns for viewing convenience
tables <- tables[c("cansim_table_number", "title", "subject",
"keywords", "notes", "date_published",
"time_period_coverage_start",
"time_period_coverage_end", "url_en")] %>%
arrange(cansim_table_number)
# and save externally
write_delim(tables, "selected_data_tables.txt", delim = "|")
( ! ) Note that I am using write_delim()
function instead of a standard write.csv()
or tidyverse::write_csv()
, with |
as a delimiter. I am doing this because there are many commas inside strings in CANSIM data, and saving as a comma-separated file would cause incorrect breaking down into columns.
Now, finding the data tables can be as simple as looking through the tables dataframe
or through the selected_data_tables.txt
file.
In order for the examples here to feel relevant and practical, let’s suppose we were asked to compare and visualize the weekly wages of Aboriginal and Non-Aboriginal Canadians of 25 years and older, living in a specific province (let’s say, Saskatchewan), adjusted for inflation.
Since we have already searched for all CANSIM tables that contain data about Aboriginal Canadians, we can easily figure out that we need CANSIM table #14-10-0370. Let’s retrieve it:
wages_0370 <- get_cansim("14-10-0370")
Note that a few CANSIM tables are too large to be downloaded and processed in R as a single dataset. However, below I’ll show you a simple way how you can work with them.
CANSIM tables have a lot of redundant data, so let’s quickly examine the dataset to decide which variables can be safely dropped in order to make working with the dataset more convenient:
names(wages_0370)
glimpse(wages_0370)
( ! ) Before we proceed further, take a look at the VECTOR
variable – this is how we can find out individual vector codes for specific CANSIM data. More on that below.
Let’s now subset the data by province, drop redundant variables, and rename the remaining in a way that makes them easier to process in the R language. Generally, I suggest following The tidyverse Style Guide by Hadley Wickham. For instance, variable names should use only lowercase letters, numbers, and underscores instead of spaces:
wages_0370 <- wages_0370 %>%
filter(GEO == "Saskatchewan") %>%
select(-c(2, 3, 7:12, 14:24)) %>%
setNames(c("year", "group", "type", "age", "current_dollars"))
Next, let’s explore the dataset again. Specifically, let’s see what unique values categorical variables year
, group
, type
, age
have. No need to do this with current_dollars
, as all or most of its values would inevitably be unique due to it being a continuous variable.
map(wages_0370[1:4], unique)
The output looks as follows:
#> $ year [1] “2007” “2008” “2009” “2010” “2011” “2012” “2013” “2014” “2015” “2016” “2017” “2018”
#> $ group [1] “Total population” “Aboriginal population” “First Nations” “Métis” “Non-Aboriginal population”
#> $ type [1] “Total employees” “Average hourly wage rate” “Average weekly wage rate” “Average usual weekly hours”
#> $ age [1] “15 years and over” “15 to 64 years” “15 to 24 years” “25 years and over” “25 to 54 years”
Now we can decide how to further subset the data.
We obviously need data for as many years as we can get, so we keep all the years from the year
variable.
For the group
variable, we need Aboriginal and Non-Aboriginal data, but the “Aboriginal” category has two sub-categories: “First Nations” and “Métis”. It is our judgment call which to go with. Let’s say we want our data to be more granular and choose “First Nations”, “Métis”, and “Non-Aboriginal population”.
As far as the type
variable is concerned, things are simple: we are only interested in the weekly wages, i.e. “Average weekly wage rate”. Note that we are using the data on average wages because for some reason CANSIM doesn’t provide the data on median wages for Aboriginal Canadians. Using average wages is not a commonly accepted way of analyzing wages, as it allows a small number of people with very high-paying jobs to distort the data, making wages look higher than they actually are. This happens because the mean is highly sensitive to large outliers, while the median is not. But well, one can only work with the data one has.
Finally, we need only one age group: “25 years and over”.
Having made these decisions, we can subset the data. We also drop two categorical variables (type
and age
) we no longer need, as both these variables would now have only one level each:
wages_0370 <- wages_0370 %>%
filter(grepl("25 years", age) &
grepl("First Nations|Métis|Non-Aboriginal", group) &
grepl("weekly wage", type)) %>%
select(-c("type", "age"))
What I just did step-by-step for demonstration purposes, can be done with a single block of code to minimize typing and remove unnecessary repetition. The “pipe” operator %>%
makes this super-easy. In R-Studio, you can use Shift+Ctrl+M
shortcut to insert %>%
:
wages_0370 <- get_cansim("14-10-0370") %>%
filter(GEO == "Saskatchewan") %>%
select(-c(2, 3, 7:12, 14:24)) %>%
setNames(c("year", "group", "type",
"age", "current_dollars")) %>%
filter(grepl("25 years", age) &
grepl("First Nations|Métis|Non-Aboriginal", group) &
grepl("weekly wage", type)) %>%
select(-c("type", "age"))
Now that we have our weekly wages data, let’s adjust the wages for inflation, otherwise the data simply won’t be meaningful. In order to be able to do this, we need to get the information about the annual changes in the Consumer Price Index (CPI) in Canada, since the annual change in CPI is used as a measure of inflation. Let’s take a look at what CANSIM has on the subject:
# list tables with CPI data, exclude the US
cpi_tables <- list_cansim_tables() %>%
filter(grepl("Consumer Price Index", title) &
!grepl("United States", title))
Even when we search using filter()
instead of indexing, and remove the U.S. data from search results, we still get a list of 20 CANSIM tables with multiple vectors in each. How do we choose the correct data vector? There are two main ways we can approach this.
First, we can use other sources to find out exactly which vectors to use. For example, we can take a look at how the Bank of Canada calculates inflation. According to the Bank of Canada’s “Inflation Calculator” web page, they use CANSIM vector v41690973 (Monthly CPI Indexes for Canada) to calculate inflation rates. So we can go ahead and retrieve this vector:
# retrieve vector data
cpi_monthly <- get_cansim_vector("v41690973",
start_time = "2007-01-01",
end_time = "2018-12-31")
Since the data in the wages_0370
dataset covers the period from 2007 till 2018, we retrieve CPI data for the same period. The function takes two main arguments: vectors
– the list of vector numbers (as strings), and start_time
– starting date as a string in YYYY-MM-DD format. Since we don’t need data past 2018, we also add an optional end_time
argument. Let’s take a look at the result of our get_cansim_vector()
call:
glimpse(cpi_monthly)
The resulting dataset contains monthly CPI indexes (take a look at the REF_DATE
variable). However, our wages_0370
dataset only has the annual data on wages. What shall we do?
Well, one option could be to calculate annual CPI averages ourselves:
# calculate mean annual CPI values
cpi_annual <- cpi_monthly %>%
mutate(year = str_remove(REF_DATE, "-.*-01")) %>%
group_by(year) %>%
transmute(cpi = round(mean(VALUE), 2)) %>%
unique()
Alternatively, we could look through CANSIM tables to find annual CPI values that have already been calculated by Statistics Canada.
Thus, the second way to find which vectors to use, is by looking through the relevant CANSIM tables. This might be more labor-intensive, but can lead to more precise results. Also, we can do this if we can’t find vector numbers from other sources.
Let’s look at cpi_tables
. Table # 18-10-0005 has “Consumer Price Index, annual average” in its title, so this is probably what we need.
# get CANSIM table with annual CPI values
cpi_annual_table <- get_cansim("18-10-0005")
Let’s now explore the data:
# explore the data
map(cpi_annual_table[1:4], unique)
# unique(cpi_annual_table$VECTOR)
Turns out, the data is much more detailed than in the vector v41690973. Remember that in wages_0370
we selected the data for a specific province (Saskatchewan)? Well, table # 18-10-0005 has CPI breakdown by province and even by specific groups of products. This is just what we need! However, if you run unique(cpi_annual_table$VECTOR)
, you’ll see that table # 18-10-0005 includes data from over 2000 different vectors – it is a large dataset. So, how do we choose the right vector? By narrowing down the search:
# find out vector number from CANSIM table
cpi_annual_table %>%
rename(product = "Products and product groups") %>%
filter(GEO == "Saskatchewan" &
product == "All-items") %>%
select(VECTOR) %>%
unique() %>%
print()
This gives us CANSIM vector number for the “all items” group CPI for the province of Saskatchewan: v41694489.
Some CANSIM tables are too large to be retrieved using cansim
package. For example, running get_cansim(“12-10-0136”)
will result in a long wait followed by “Problem downloading data, multiple timeouts” error. cansim
will also advise you to “check your network connection”, but network connection is not the problem, it is the size of the dataset.
CANSIM table 12-10-0136 is very large: 16.2 GB. By default, R loads the full dataset into RAM, which can make things painfully slow when dealing with huge datasets. There are solutions for datasets <10 GB in size, but anything larger than 10 GB requires either distributed computing, or retrieving data chunk-by-chunk. In practice, in R you are likely to start experiencing difficulties and slowdowns if your datasets exceed 1 GB.
Suppose we need to know how much wheat (in dollars) Canada exports to all other countries. CANSIM table 12-10-0136 “Canadian international merchandise trade by industry for all countries” has this data. But how do we get the data from this table if we can’t directly read it into R, and even if we manually download and unzip the dataset, R won’t be able to handle 16.2 GB of data?
This is where CANSIM data vectors come to the rescue. We need to get only one vector instead of the whole enormous table. To do that, we need to know the vector’s number, but we can’t look for it inside the table because the table is too large.
The solution is to find the correct vector using Statistics Canada Data search tool. Start with entering the table number in the “Keyword(s)” field. Obviously, you can search by keywords too, but searching by table number is more precise:
Then click on the name of the table: “Canadian international merchandise trade by industry for all countries”. After the table preview opens, click “Add/Remove data”:
The “Customize table (Add/Remove data)” menu will open. It has the following tabs: “Geography”, “Trade”, “Trading Partners”, “North American Industry Classification System (NAICS)”, “Reference Period”, and “Customize Layout”. Note that the selection of tabs depends on the data in the table.
Now, do the following:
In the “Geography” tab, do nothing – just make sure “Canada” is checked.
If you followed all the steps, here’s what your output should look like:
The vector number for Canada’s wheat exports to all other countries is v1063958702.
Did you notice that the output on screen has data only for a few months in 2019? This is just a sample of what our vector has. If you click “Reference period”, you’ll see that the table 12-10-0136 has data for the period from January 2002 to October 2019:
Now we can retrieve the data we’ve been looking for:
# get and clean wheat exports data
wheat_exports <-
get_cansim_vector("v1063958702", "2002-01-01", "2019-10-31") %>%
mutate(ref_date = lubridate::ymd(REF_DATE)) %>%
rename(dollars = VALUE) %>%
select(-c(1, 3:9))
# check object size
object.size(wheat_exports)
The resulting wheat_exports
object is only 4640 bytes in size: about 3,500,000 times smaller than the table it came from!
Note that I used lubridate::ymd()
function inside the mutate()
call. This is not strictly required, but wheat_exports
contains a time series, so it makes sense to convert the reference date column to an object of class “Date”. Since lubridate
is not loaded with tidyverse
(it is part of the tidyverse
ecosystem, but only core components are loaded by default), I had to call the ymd()
function with lubridate::
.
Finally, note that StatCan Data has another search tool that allows you to search by vector numbers. Unlike the data tables search tool, the vector search tool is very simple, so I’ll not be covering it in detail. You can find it here.
Let’s now get provincial annual CPI data and clean it up a bit, removing all the redundant stuff and changing the VALUE
variable name to something in line with The tidyverse Style Guide:
# get mean annual CPI for Saskatchewan, clean up data
cpi_sk <-
get_cansim_vector("v41694489", "2007-01-01", "2018-12-31") %>%
mutate(year = str_remove(REF_DATE, "-01-01")) %>%
rename(cpi = VALUE) %>%
select(-c(1, 3:9))
As usual, you can directly feed the output of one code block into another with the %>%
(“pipe”) operator:
# feed the vector number directly into get_cansim_vector()
cpi_sk <- cpi_annual_table %>%
rename(product = "Products and product groups") %>%
filter(GEO == "Saskatchewan" &
product == "All-items") %>%
select(VECTOR) %>%
unique() %>%
as.character() %>%
get_cansim_vector(., "2007-01-01", "2018-12-31") %>%
mutate(year = str_remove(REF_DATE, "-01-01")) %>%
rename(cpi = VALUE) %>%
select(-c(1, 3:9))
By this point you may be wondering if there is an inverse operation, i.e. if we can find CANSIM table number if all we know is a vector number? We sure can! This is what get_cansim_vector_info()
function is for. And we can retrieve the table, too:
# find out CANSIM table number if you know a vector number
get_cansim_vector_info("v41690973")$table
# get table if you know a vector number
cpi_table <- get_cansim_vector_info("v41690973")$table %>%
get_cansim()
Now that we have both weekly wages data (wages_0370
) and CPI data (cpi_sk
), we can calculate the inflation rate and adjust the wages for inflation. The formula for calculating the inflation rate for the period from the base year to year X is: (CPI in year X – CPI in base year) / CPI in base year
.
If we wanted the inflation rate to be expressed as a percentage, we would have multiplied the result by 100, but here it is more convenient to have inflation expressed as a proportion:
# calculate the inflation rate
cpi_sk$infrate <- (cpi_sk$cpi - cpi_sk$cpi[1]) / cpi_sk$cpi[1]
Now join the resulting dataset to wages_0370
with dplyr::left_join()
. Then use the inflation rates data to adjust wages for inflation with the following formula: base year $ = current year $ / (1 + inflation rate)
:
# join inflation rate data to wages_0370; adjust wages for inflation
wages_0370 <- wages_0370 %>%
left_join(cpi_sk, by = "year") %>%
mutate(dollars_2007 = round(current_dollars / (1 + infrate), 2))
We are now ready to move on to the next article in this series and plot the resulting data.
]]>