Update 3: This is a more or less final version, so here is the source code for the app. And here is the code used to retrieve and pre-process geospatial data.
* * *
In the times of the pandemic, the data community can help in many ways, including by developing instruments to track and break down the data on the spread of the dreaded coronavirus disease. The COVID-19 Canada Data Explorer app was built with R, including Shiny and Leaflet, to process the official dataset available from the Government of Canada. They do have their own data visualization tool, but it is very basic. You can do so much more with the available data!
I hope that my app will help public health professionals, policymakers, and really anyone to stay informed about the course of SARS-CoV-2 epidemic in Canada.
Some Things to Keep in Mind (Technical Stuff Here)
The app runs on a free version of Shiny Server, which doesn’t allow to serve Shiny applications using SSL/TLS encryption. Hence HTTP connection instead of HTTPS – your browser may say that “connection is not secure”. I am aware that there may be a way to force connection through HTTPS using reverse proxy as described here. However, these instructions are well over 4 years old, and the required library libapache2-mod-proxy-html is no longer available; no replacement for Ubuntu 18.04 either. If you know how to do this on Apache (not Nginx), I’d very much appreciate your advice (please use this contact form to reach me).
The data is downloaded from Canada.ca at 6-hour intervals to minimize the load on the data repository – there is likely a high demand due to the pandemic. This means that there may be a delay of up to six hours from the time Canada.ca update their data to the moment it is updated on my server.
Now that we are ready to start working with Canadian Census data, let’s first briefly address the question why you may need to use it. After all, CANSIM data is often more up-to-date and covers a much broader range of topics than the national census data, which is gathered every five years in respect of a limited number of questions.
The main reason is that CANSIM data is far less granular geographically. Most of it is collected at the provincial or even higher regional level. You may be able to find CANSIM data on a limited number of questions for some of the country’s largest metropolitan areas, but if you need the data for a specific census division, city, town, or village, you’ll have to use the Census.
To illustrate the use of cancensus package, let’s do a small research project. First, in this post we’ll retrieve the following key labor force characteristics of the largest metropolitan areas in each of the five geographic regions of Canada:
Labor force participation rate, employment rate, and unemployment rate.
Percent of workers by work situation: full time vs part time, by gender.
Education levels of people aged 25 to 64, by gender.
The cities (metropolitan areas) that we are going to look at, are: Calgary, Halifax, Toronto, Vancouver, and Whitehorse. We’ll also get these data for Canada as a whole for comparison and to illustrate the retrieval of data at different geographic levels
Next, in the upcoming Part 6 of the “Working with Statistics Canada Data in R” series, we will visualize these data, including making a faceted plot and writing a function to automate repetitive plotting tasks.
Keep in mind that cancensus also allows you to retrieve geospatial data, that is, borders of census regions at various geographic levels, in sp and sf formats. Retrieving and visualizing Statistics Canada geospatial data will be covered later in these series.
So, let’s get started by loading the required packages:
Searching for Data
cancensus retrieves census data with the get_census function. get_census can take a number of arguments, the most important of which are dataset, regions, and vectors, which have no defaults. Thus, in order to be able to retrieve census data, you’ll first need to figure out:
Currently, datasets earlier than 2001 are not available, so if you need to work with the 20th century census data, you won’t be able to retrieve it with cancensus.
Find Census Regions
Next, let’s find the regions that we’ll be getting the data for. To search for census regions, use the search_census_regions function.
Let’s take a look at what region search returns for Toronto. Note that cancensus functions return their output as dataframes, making it is easy to subset. Here I limited the output to the most relevant columns to make sure it fits on screen. You can run the code without [c(1:5, 8)] to see all of it.
# all census levels
search_census_regions(searchterm = "Toronto",
dataset = "CA16")[c(1:5, 8)]
A tibble: 3 x 6
# region name level pop municipal_status PR_UID
1 35535 Toronto CMA 5928040 B 35
2 3520 Toronto CD 2731571 CDR 35
3 3520005 Toronto CSD 2731571 C 35
You may have expected to get only one region: the city of Toronto, but instead you got three! So, what is the difference? Look at the column ‘level’ for the answer. Often, the same geographic region can be represented by several census levels, as is the case here. There are three levels for Toronto, which is simultaneously a census metropolitan area, a census division, and a census sub-division. Note also the ‘PR_UID’ column that contains numeric codes for Canada’s provinces and territories. These codes can help you distinguish between different census regions that have same or similar names but are located in different provinces. For an example, run the code above replacing “Toronto” with “Windsor”.
Remember that we were going to plot the data for census metropolitan areas? You can choose the geographic level with the level argument, which can take the following values: ‘C’ for Canada (national level), ‘PR’ for province, ‘CMA’ for census metropolitan area, ‘CD’ for census division, ‘CSD’ for census sub-division, or NA:
# specific census level
search_census_regions("Toronto", "CA16", level = "CMA")
Let’s now list census regions that may be relevant for our project:
# explore available census regions
names <- c("Canada", "Calgary", "Halifax",
"Toronto", "Vancouver", "Whitehorse")
map_df(names, ~ search_census_regions(., dataset = "CA16"))
purrr::map_df function applies search_census_regions iteratively to each element of the names vector and returns output as a single dataframe. Note also the ~ . syntax. Think of it as the tilde taking each element of names and passing it as an argument to a place indicated by the dot in the search_census_regions function. You can find more about the tilde-dot syntax here. It may be a good idea to read the whole tutorial: purrr is a super-useful package, but not the easiest to learn, and this tutorial does a great job explaining the basics.
Since there are multiple entries for each search term, we’ll need to choose the results for census metropolitan areas, or in case of Whitehorse, for census sub-division, since Whitehorse is too small to be considered a metropolitan area:
# select only the regions we need: CMAs (and CSD for Whitehorse)
regions <- list_census_regions(dataset = "CA16") %>%
filter(grepl("Calgary|Halifax|Toronto|Vancouver", name) &
grepl("CMA", level) |
grepl("Canada|Whitehorse$", name)) %>%
Pay attention to the use of logical operators to filter the output by several conditions at once; also note using $ regex meta-character to choose from the ‘names’ column the entry ending with ‘Whitehorse’ (to filter out ‘Whitehorse, Unorganized’.
Finally, as_census_region_list converts list_census_regions output to a data object of type list that can be passed to the get_census function as its regions argument.
Find Census Vectors
Canadian census data is made up of individual variables, aka census vectors. Vector number(s) is another argument you need to specify in order to retrieve data with the get_census function.
cancensus has two functions that allow you to search through census data variables: list_census_vectors and search_census_vectors.
list_census_vectors returns all available vectors for a given dataset as a single dataframe containing vectors and their descriptions:
# structure of list_census_vectors output
str(list_census_vectors(dataset = 'CA16'))
# count variables in 'CA16' dataset
nrow(list_census_vectors(dataset = 'CA16'))
As you can see, there are 6623 (as of the time of writing this) variables in the 2016 census dataset, so list_census_vectors won’t be the most convenient function to find a specific vector. Note however that there are situations (such as when you need to select a lot of vectors at once), in which list_census_vectors would be appropriate.
Usually it is more convenient to use search_census_vectors to search for vectors. Just pass the text string of what you are looking for as the searchterm argument. You don’t have to be precise: this function works even if you make a typo or are uncertain about the spelling of your search term.
Let’s now find census data vectors for labor force involvement rates:
# get census data vectors for labor force involvement rates
search_census_vectors(searchterm = "employment rate",
dataset = "CA16") %>%
union(search_census_vectors("participation rate", "CA16")) %>%
filter(type == "Total") %>%
Let’s take a look at what this code does. Since searchterm doesn’t have to be a precise match, “employment rate” search term retrieves unemployment rate vectors too. In the next line, union merges dataframes returned by search_census_vectors into a single dataframe. Note that in this case union could be substituted with bind_rows. I recommend using union in order to avoid data duplication. Next, we choose only the “Total” numbers, since we are not going to plot labor force indicators by gender. Finally, the pull command extracts a single vector from the dataframe, just like the $ subsetting operator: we need ‘lf_vectors’ to be a data object of type vector in order to pass it to the vectors argument of the get_census function.
There is another way to figure out search terms to put inside the search_census_vectors function: use Statistics Canada online Census Profile tool. It can be used to quickly explore census data as well as to figure out variable names (search terms) and their hierarchical structure.
Now we know the exact search terms, so we can get precisely the vectors we need, free from any extraneous data:
# get census data vectors for full and part time work
# get vectors and labels
search_census_vectors("full year, full time", "CA16") %>%
union(search_census_vectors("part year and/or part time", "CA16")) %>%
filter(type != "Total") %>%
mutate(label = str_remove(label, ".*, |.*and/or ")) %>%
mutate(type = fct_drop(type)) %>%
setNames(c("vector", "gender", "type"))
# extract vectors
work_vectors <- work_vectors_labels$vector
Note how this code differs from the code with which we extracted labor force involvement rates: since we need the data to be sub-divided both by the type of work and by gender (hence no “Total” values here), we create a dataframe that assigns respective labels to each vector number. This work_vectors_labels dataframe will supply categorical labels to be attached to the data retrieved with get_census.
The first mutate call removes all text up to and including ‘, ‘ and ‘and/or ‘ (spaces included) from the ‘label’ column. The second drops unused factor level “Total” – it is a good practice to make sure there are no unused factor levels if you are going to use ggplot2 to plot your data. Finally, setNames renames variables for convenience.
Finally, let’s retrieve vectors for the education data for the age group from 25 to 64 years, by gender. Before we do this, I’d like to draw your attention to the fact that some of the census data is hierarchical, which means that some variables (census vectors) are included into parent and/or include child variables. It is very important to choose vectors at proper hierarchical levels so that you do not double-count or omit your data.
Education data is a good example of hierarchical data. You can explore data hierarchy using parent_census_vectors and child_census_vectors functions as described here. However, you may find exploring the hierarchy visually to be more convenient:
So, let’s now retrieve and label the education data vectors:
Note the slice function that allows to manually select specific rows from a dataframe: positive numbers choose rows to keep, negative numbers choose rows to drop. I used slice to drop the hierarchical levels from the data that are either too general or too granular. Note also that I had to edit text strings in the data. Finally, I added asterisks after “Doctorate” and “health”. These are not regex symbols, but actual asterisks that will be used to refer to footnotes in plot captions later on.
Now that we have figured out our dataset, regions, and data vectors (and labelled the vectors, too), we are finally ready to retrieve the data.
Retrieve Census Data
To retrieve census data, feed the dataset, regions, and data vectors into get_census as its’ respective arguments. Note also that get_census has use_cache argument (set to TRUE by default), which tells get_census to retrieve data from cache if available. If there is no cached data, the function will query CensusMapper API for the data and will save it in cache, while use_cache = FALSE will force get_census to query the API and update the cache.
# get census data for labor force involvement rates
# feed regions and vectors into get_census()
get_census(dataset = "CA16",
regions = regions,
vectors = lf_vectors) %>%
select(-c(1, 2, 4:7)) %>%
setNames(c("region", "employment rate",
"participation rate")) %>%
mutate(region = str_remove(region, " (.*)")) %>%
pivot_longer("employment rate":"participation rate",
names_to = "indicator",
values_to = "rate") %>%
The select call drops columns with irrelevant data, setNames renames columns to remove vector numbers from variable names, which will be then converted to values in the ‘indicator’ column; str_remove inside the mutate call drops municipal status codes ‘(B)’ and ‘(CY)’ from region names; finally, mutate_if converts characters to factors for subsequent plotting.
An important function here is tidyr::pivot_longer. It converts the dataframe from wide to long format. It takes three columns: ‘employment rate’, ‘unemployment rate’, and ‘participation rate’, and converts their names into values of the ‘indicator’ variable, while their numeric values are passed to the ‘rate’ variable. The reason for conversion is that we are going to plot the data for all three labor force indicators in the same graphic, which makes it necessary to store the indicators as a single factor variable.
Next, let’s retrieve census data about the percent of full time vs part time workers, by gender, and the data about the education levels of people aged 25 to 64, by gender:
# get census data for full time and part time work
get_census(dataset = "CA16",
regions = regions,
vectors = work_vectors) %>%
select(-c(1, 2, 4:7)) %>%
rename(region = "Region Name") %>%
pivot_longer(2:5, names_to = "vector",
values_to = "count") %>%
mutate(region = str_remove(region, " (.*)")) %>%
mutate(vector = str_remove(vector, ":.*")) %>%
left_join(work_vectors_labels, by = "vector") %>%
mutate(gender = str_to_lower(gender)) %>%
# get census data for education levels
get_census(dataset = "CA16",
regions = regions,
vectors = ed_vectors) %>%
select(-c(1, 2, 4:7)) %>%
rename(region = "Region Name") %>%
pivot_longer(2:21, names_to = "vector",
values_to = "count") %>%
mutate(region = str_remove(region, " (.*)")) %>%
mutate(vector = str_remove(vector, ":.*")) %>%
left_join(ed_vectors_labels, by = "vector") %>%
Note one important difference from the code I used to retrieve the labor force involvement data: here I added the dplyr::left_join function that joins labels to the census data.
We now have the data and are ready to visualize it, which will be done in the next part of this series.
Annex: Notes and Definitions
For those of you who are outside of Canada, Canada’s geographic regions and their largest metropolitan areas are:
The Atlantic Provinces – Halifax
Central Canada – Toronto
The Prairie Provinces – Calgary
The West Coast – Vancouver
The Northern Territories – Whitehorse
These regions should not be confused with 10 provinces and 3 territories, which are Canada’s sub-national administrative divisions, much like states in the U.S. Each region consists of several provinces or territories, except the West Coast, which includes only one province – British Columbia. You can find more about Canada’s geographic regions and territorial structure here (pages 44 to 51).
For the definitions of employment rate, unemployment rate, labour force participation rate, full-time work, and part-time work, see Statistics Canada’s Guide to the Labour Force Survey.
You can find more about census geographic areas here and here. There is also a glossary of census-related geographic concepts.
In the Introduction to the “Working with Statistics Canada Data in R” series, I discussed the three main types of data available from Statistics Canada. It is now time to move on to the second of those data types – the Canadian National Census data.
cancensus is a specialized R package that allows you to retrieve Statistics Canada census data and geography. It follows the tidy approach to data processing. The package’s authors are Jens von Bergmann, Dmitry Shkolnik, and Aaron Jacobs. I am not associated with the authors, I just use this great package in my work on a regular basis. The package’s GitHub code repository can be found here. There is also a tutorial by the package’s authors, which I recommend taking a look at before proceeding.
Further in this series, I will provide an in-depth real-life example of working with Canadian Census data using the cancensus package. But first, let’s install cancensus:
Setting up cancensus: Adding API Key and Cache Path to .Rprofile
cancensus relies on queries to the CensusMapper API, which requires a CensusMapper API key. The key can be obtained for free as per the package’s authors instructions. Once you have the key, you can add it to your R environment:
Note that although the authors are warning that API requests are limited in volume, I have significantly exceeded my API quota on some occasions, and still had no issues with retrieving data.
That said, depending on how much data you need, you can draw down your quota very quickly, and here’s where local cache comes to the rescue. cancensus caches data every time you retrieve it, but by default the cache is not persistent between sessions. To make it persistent, as well as to remove the need to enter your API key every time you use cancensus, you can add both the API key and the cache path to your .Rprofile file.
If you’d like to learn in-depth what .Rprofile is, what it is for, and how to edit it, consider taking a look at sections 2.4.2 to 2.4.5 in “Efficient R Programming” by Colin Gillespie and Robin Lovelace. For a quick and simple edit, just keep reading.
Editing .Rprofile on Linux
First, find your R home directory with R.home(). Most likely, it will be in /usr/lib/R. If that is where your R home is, in the Linux Terminal (not in R), run:
sudo nano /usr/lib/R/library/base/R/.Rprofile # edit path if needed
On some systems, .Rprofile may not be hidden, so if the above command doesn’t open the .Rprofile file, try removing the dot before ‘Rprofile’:
sudo nano /usr/lib/R/library/base/R/Rprofile
(!) Note that this will edit the system .Rprofile file, which will always run on startup and will apply to all your R projects. The file itself will warn you that “it is a bad idea to use this file as a template for personal startup files”. You can safely ignore this warning as long as the only edit you are making is the one shown below, i.e. adding cancensus cache path and API key.
Then hit Ctrl+X, choose Y for “yes”, and press Enter to save changes. When you first retrieve data with cancensus, R will create .cancensus_cache directory for you.
Editing .Rprofile on Windows
Editing .Rprofile on Windows is a bit tricky. The best thing for you would be not to touch Windows system .Rprofile, or else risk weird errors and crashes (honestly, I was not able to figure out where they come from).
Instead, set up a project-specific .Rprofile. The downside is that you may need to set it separately for every project in which you are going to use cancensus. The upside is that the contents of the .Rprofile file should be exactly the same every time. In R or Rstudio (not in the command line), run:
Then, add these two lines to the “options” section, and save the file:
In the previous part of this series, we have retrieved CANSIM data on the weekly wages of Aboriginal and Non-Aboriginal Canadians of 25 years and older, living in Saskatchewan, and the CPI data for the same province. We then used the CPI data to adjust the wages for inflation, and saved the results as wages_0370 dataset. To get started, let’s take a quick look at the dataset, what types of variables it contains, which should be considered categorical, and what unique values categorical variables have:
# explore wages_0370 before plotting
The first two variables – year and group – are of the type “character”, and the rest are numeric of the type “double” (“double” simply means they are not integers, i.e. can have decimals).
Also, we can see that wages_0370 dataset is already in the tidy format, which is an important prerequisite for plotting with ggplot2 package. Since ggplot2 is included into tidyverse, there is no need to install it separately.
Preparing Data for Plotting
At this point, our data is almost ready to be plotted, but we need to make one final change. Looking at the unique values, we can see that the first two variables (year and group) should be numeric (integer) and categorical respectively, while the rest are continuous (as they should be).
In R, categorical variables are referred to as factors. It is important to expressly tell R which variables are categorical, because mapping ggplot aesthetics – things that go inside aes() – to a factor variable makes ggplot2 use a discrete colour scale (distinctly different colors) for different categories (different factor levels in R terms). Otherwise, values would be plotted to a gradient, i.e. different hues of the same color. There are several other reasons to make sure you expressly identify categorical variables as factors if you are planning to visualize your data. I understand that this might be a bit too technical, so if you are interested, you can find more here and here. For now, just remember to convert your categorical variables to factors if you are going to plot your data. Ideally, do it always – it is a good practice to follow.
( ! ) It is a good practice to always convert categorical variables to factors.
So, let’s do it: convert year to an integer, and group to a factor. Before doing so, let’s remove the word “population” from “Non-Aboriginal population” category, so that our plot’s legend takes less space inside the plot. We can also replace accented “é” with ordinary “e” to make typing in our IDE easier. Note that the order is important: first we edit the string values of a “character” class variable, and only then convert it to a factor. Otherwise, our factor will have missing levels.
( ! ) Converting a categorical variable to a factor should be the last step in cleaning your dataset.
Let’s now look at what this code does. We start with feeding our data object wages_0370 into ggplot function using pipe.
( ! ) Note that ggplot2 internal syntax differs from tidyverse pipe-based syntax, and uses + instead of %>% to join code into blocks.
Next, inside ggplot(aes()) call we assign common aesthetics for all layers, and then proceed to choosing the geoms we need. If needed, we can assign additional/override common aesthetics inside individual geoms, like we did when we told geom_label to use dollars_2007 variable values (rounded to a dollar) as labels. If you’d like to find out more about what layers are, and about ggplot2 grammar of graphics, I recommend this article by Hadley Wickham.
Choosing Plot Type
Plot type in each layer is determined by geom_* functions. This is where geoms are:
Choosing plot type is largely a judgement call, but you should always make sure to choose the type of graphic that would best suite the data you have. In this case, our goal is to reveal the dynamics of wages in Saskatchewan over time, hence our choice of geom_line. Note that the lines in our graphic are for visual aid only – to make it easier for an eye to follow the trend. They are not substantively meaningful like they would be, for example, in a regression plot. geom_point is also there primarily for visual purposes – to make the plot’s legend more visible. Note that unlike the lines, the points in this plot are substantively meaningful, i.e. they are exactly where our data is (but are covered by labels). If you don’t like the labels in the graphic, you can use points instead.
Finally, geom_label plots our substantive data. Note that I am using show.legend = FALSE argument – this is simply because I don’t like the look of geom_label legend symbols, and prefer a combined line+point symbol instead. If you prefer geom_label symbols in the plot’s legend, remove show.legend = FALSE argument from geom_label call, and add it to geom_line and geom_point.
Preventing Overlaps with ggrepel
You have noticed some commented lines in the ggplot call. You may also have noticed that some labels in our graphic overlap slightly. In this case the overlap is minute and can be ignored. But what if there are a lot of overlapping data points, enough to affect readability?
Fortunately, there is a package to solve this problem for the graphics that use text labels: ggrepel. It has *_repel versions of ggplot2::geom_label and ggplot2::geom_text functions, which repel the labels away from each other and away from the data points.
ggrepel functions can take the same arguments as corresponding ggplot2 functions, and also take the force argument that defines repelling force between overlapping text labels. I recommend setting it to a small value, as the default 1 seems way too strong.
Here is what our graphic looks like now. Note that the nearby labels no longer overlap:
coord_cartesian is the function I’d like to draw your attention to, as it is the best way to zoom the plot, i.e. to get rid of unnecessary empty space. Since we don’t have any values less than 650 or more than 950 (approximately), starting our Y scale at 0 would result in a less readable plot, where most space would be empty, and the space where we actually have data would be crowded. If you are interested in why coord_cartesian is the best way to set axis limits, there is an in-depth explanation.
( ! ) It is a good practice to use coord_cartesian to change axis limits.
First I selected a simple black-and-white theme theme_bw, and then overrode some of the theme’s default settings in order to improve the plot’s readability and overall appearance. Which theme and settings to use is up to you, just make sure that whatever you do makes the plot easier to read and comprehend at a glance. Here you can find out more about editing plot theme.
Finally, we enter the plot title and plot captions. Captions are used to provide information about the sources of our data. Note the use of \n (new line symbol) to break strings into multiple lines:
labs(title = "Average Weekly Wages, Adjusted for Inflation,\nby Aboriginal Group, 25 Years and Older",
caption = "Wages data: Statistics Canada Data Table 14-10-0370\nInflation data: Statistics Canada Data Vector v41694489")
Saving Your Plot
The last step is to save the plot so that we can use it externally: insert into reports and other publications, publish online, etc.
# save plot
ggsave takes various arguments, but only one is mandatory: file name as a string. The second argument plot defaults to the last plot displayed, but it is advisable to name the plot expressly to make sure the right one gets saved. You can find out more about how ggsave works here.
My favourite format to save graphics is SVG, which is being used in the code example above. SVG stands for Scalable Vector Graphics – an extremely lightweight vectorized format that ensures the graphic stays pixel-perfect at any resolution. Note that SVG is not really a pixel image like JPEG or PNG, but a bunch of XML code, which entails certain security implications when using SVG files online.
ggsave is dependent on the settings (e.g. aspect ratio, size, etc.) of your system’s graphics device, which can be inconsistent between sessions. If you notice inconsistencies in how the saved graphics look, consider using the export package – a specialized package for saving graphics and statistical output made in R. Its functions offer more customizable saving options, more formats (file types), and with the right settings ensure consistent output between sessions.
This was the last of the three articles about working with CANSIM data. In the next article in the “Working with Statistics Canada Data in R” series, I’ll move on to working with the national census data.
CANSIM data is stored in tables containing data on a specific subject. Individual entries (rows) and groups of entries in these tables are usually assigned vector codes. Thus, unless we already know what table or vector we need, finding the correct table should be our first step.
As always, start with loading packages:
Now we can start retrieving CANSIM data. How we do this, depends on whether we already know the table or vector numbers. If we do, things are simple: just use get_cansim to retrieve data tables, or get_cansim_vector to retrieve vectors.
But usually we don’t. One option is to use StatCan’s online search tool. Eventually you will find what you’ve been looking for, although you might also miss a few things – CANSIM is not that easy to search and work with manually.
A much better option is to let R do the tedious work for you with a few lines of code.
Searching by Index
In this example (based on the code I wrote for an actual research project), let’s look for CANSIM tables that refer to Aboriginal (Indigenous) Canadians anywhere in the tables’ titles, descriptions, keywords, notes, etc.:
# create an index to subset list_cansim_tables() output
index <- list_cansim_tables() %>%
Map(grepl, "(?i)aboriginal|(?i)indigenous", .) %>%
Reduce("|", .) %>%
# list all tables with Aboriginal data, drop redundant cols
tables <- list_cansim_tables()[index, ] %>%
select(c("title", "keywords", "notes", "subject",
Let’s look in detail at what this code does. First, we call the list_cansim_tables function, which returns a tibble dataframe, where each row provides a description of one CANSIM table. To get a better idea of list_cansim_tables output, run:
Then we search through the dataframe for Regex patterns matching our keywords. Note the (?i) flag – it tells Regex to ignore case when searching for patterns; alternatively, you can pass ignore.case = TRUE argument to grepl. The Map function allows to search for patterns in every column of the dataframe returned by list_cansim_tables. This step returns a very long list of logical values.
Our next step is to Reduce the list to a logical vector, where each value is either FALSE if there was not a single search term match per CANSIM table description (i.e. per row of list_cansim_tables output), or TRUE if there were one or more matches. The which function gives us the numeric indices of TRUE elements in the vector.
Finally, we subset the list_cansim_tables output by index. Since there are many redundant columns in the resulting dataframe, we select only the ones that contain potentially useful information.
Searching with dplyr::filter
There is also a simpler approach, which immediately returns a dataframe of tables:
However, keep in mind that this code would search only through the column or columns of the list_cansim_tables output, which were specified inside the grepl call (in this case, in the title column). This results in fewer tables listed in tables: 60 vs 73 you get if you search by index (as of the time of writing this). Often simple filtering would be sufficient, but if you want to be extra certain you haven’t missed anything, search by index as shown above.
( ! ) Note that some CANSIM data tables do not get updated after the initial release, so always check the date_published and the time_period_coverage_* attributes of the tables you are working with.
Saving Search Results
Finally, it could be a good idea to externally save the dataframe with the descriptions of CANSIM data tables in order to be able to view it as a spreadsheet. Before saving, let’s re-arrange the columns in a more logical order for viewers’ convenience, and sort the dataframe by CANSIM table number.
# re-arrange columns for viewing convenience
tables <- tables[c("cansim_table_number", "title", "subject",
"keywords", "notes", "date_published",
"time_period_coverage_end", "url_en")] %>%
# and save externally
write_delim(tables, "selected_data_tables.txt", delim = "|")
( ! ) Note that I am using write_delim function instead of a standard write.csv or tidyverse::write_csv, with | as a delimiter. I am doing this because there are many commas inside strings in CANSIM data, and saving as a comma-separated file would cause incorrect breaking down into columns in a spreadsheet software.
Now, finding the data tables can be as simple as looking through the tables dataframe or through the selected_data_tables.txt file.
Retrieving Data Tables
In order for the examples here to feel relevant and practical, let’s suppose we were asked to compare and visualize the weekly wages of Aboriginal and Non-Aboriginal Canadians of 25 years and older, living in a specific province (let’s say, Saskatchewan), adjusted for inflation.
Since we have already searched for all CANSIM tables that contain data about Aboriginal Canadians, we can easily figure out that we need CANSIM table #14-10-0370. Let’s retrieve it:
( ! ) Before we proceed further, take a look at the VECTOR variable – this is how we can find out individual vector codes for specific CANSIM data. More on that below.
Let’s now subset the data by province, drop redundant variables, and rename the remaining in a way that makes them easier to process in the R language. Generally, I suggest following The tidyverse Style Guide by Hadley Wickham. For instance, variable names should use only lowercase letters, numbers, and underscores instead of spaces:
Next, let’s explore the dataset again. Specifically, let’s see what unique values categorical variables year, group, type, age have. No need to do this with current_dollars, as all or almost all of its values would inevitably be unique due to it being a continuous variable.
The output looks as follows:
$year  “2007” “2008” “2009” “2010” “2011” “2012” “2013” “2014” “2015” “2016” “2017” “2018” $group  “Total population” “Aboriginal population” “First Nations” “Métis” “Non-Aboriginal population” $type  “Total employees” “Average hourly wage rate” “Average weekly wage rate” “Average usual weekly hours” $age  “15 years and over” “15 to 64 years” “15 to 24 years” “25 years and over” “25 to 54 years”
Now we can decide how to further subset the data.
We obviously need the data for as many years as we can get, so we keep all the years from the year variable.
For the group variable, we need Aboriginal and Non-Aboriginal data, but the “Aboriginal” category has two sub-categories: “First Nations” and “Métis”. It is our judgement call which to go with. Let’s say we want our data to be more granular and choose “First Nations”, “Métis”, and “Non-Aboriginal population”.
As far as the type variable is concerned, things are simple: we are only interested in the weekly wages, i.e. “Average weekly wage rate”. Note that we are using the data on average wages because for some reason CANSIM doesn’t provide the data on median wages for Aboriginal Canadians. Using average wages is not a commonly accepted way of analyzing wages, as it allows a small number of people with very high-paying jobs to distort the data, making wages look higher than they actually are. But well, one can only work with the data one has.
Finally, we need only one age group: “25 years and over”.
Having made these decisions, we can subset the data. We also drop two categorical variables (type and age) we no longer need, as both these variables would now have only one level each:
Of course, what I just did step-by-step for demonstration purposes, can be done with a single block of code to minimize typing and remove unnecessary repetition. The “pipe” operator %>% makes this super-easy. In R-Studio, you can use Shift+Ctrl+M shortcut to insert %>%.
Now that we have our weekly wages data, let’s adjust the wages for inflation, otherwise the data simply won’t be meaningful. In order to be able to do this, we need to get the information about the annual changes in the Consumer Price Index (CPI) in Canada, since the annual change in CPI is used as a measure of inflation. Let’s take a look at what CANSIM has on the subject:
# list tables with CPI data, exclude the US
cpi_tables <- list_cansim_tables() %>%
filter(grepl("Consumer Price Index", title) &
!grepl("United States", title))
Even when we search using filter instead of indexing, and remove the U.S. data from search results, we still get a list of 20 CANSIM tables, with multiple vectors in each. How do we choose the correct data vector? There are two main ways we can approach this.
First, we can use other sources to find out exactly what vectors to use. For example, we can take a look at how the Bank of Canada calculates inflation. According to the Bank of Canada’s “Inflation Calculator” web page, they use CANSIM vector v41690973 (Monthly CPI Indexes for Canada) to calculate inflation rates. So we can go ahead and retrieve this vector:
Since the data in the wages_0370 dataset covers the period from 2007 till 2018, we retrieve CPI data for the same period. The function takes two main arguments: vectors – the list of vector numbers (as strings), and start_time – starting date as a string in YYYY-MM-DD format. Since we don’t need data past 2018, we also add an optional end_time argument, which takes a string in the same format as start_time. Let’s take a look at the result of our get_cansim_vector call:
The resulting dataset contains monthly CPI indexes (take a look at the REF_DATE variable). However, our wages_0370 dataset only has the annual data on wages. What shall we do?
Well, one option could be to calculate annual CPI averages ourselves:
Alternatively, we could look through CANSIM tables to find annual CPI values that have already been calculated by Statistics Canada.
Thus, the second way to find which vectors to use, is by looking through the relevant CANSIM tables. This might be more labor-intensive, but can lead to more precise results. Also, we can do this if we can’t find vector numbers from other sources.
Let’s look at cpi_tables. Table # 18-10-0005 has “Consumer Price Index, annual average” in its title, so this is probably what we need.
# get CANSIM table with annual CPI values
cpi_annual_table <- get_cansim("18-10-0005")
Let’s now explore the data:
# explore the data
Turns out, the data is much more detailed than in the vector v41690973! Remember that in wages_0370 we selected the data for a specific province (Saskatchewan)? Well, table # 18-10-0005 has CPI breakdown by province and even by specific groups of products. This is just what we need! However, if you run unique(cpi_annual_table$VECTOR), you’ll see that table # 18-10-0005 includes data from over 2000 different vectors – it is a large dataset. So, how do we choose the right vector? By narrowing down the search:
# find out vector number from CANSIM table
rename(product = "Products and product groups") %>%
filter(GEO == "Saskatchewan" &
product == "All-items") %>%
This gives us CANSIM vector number for the “all items” group CPI for the province of Saskatchewan: v41694489.
Using StatCan Data Search Tool to Find Vectors
Some CANSIM tables are too large to be retrieved using cansim package. For example, running get_cansim(“12-10-0136”) will result in a long wait followed by “Problem downloading data, multiple timeouts” error. cansim will also advise you to “check your network connection”, but network connection is not the problem, it is the size of the dataset.
CANSIM table 12-10-0136 is very large: 16.2 GB. By default, R loads the full dataset into RAM, which can make things painfully slow when dealing with huge datasets. There are solutions for datasets <10 GB in size, but anything larger than 10 GB requires distributed computing. In practice, in R you are likely to start experiencing difficulties and slowdowns if your datasets exceed 1 GB.
Suppose we need to know how much wheat (in dollars) Canada exports to all other countries. CANSIM table 12-10-0136 “Canadian international merchandise trade by industry for all countries” has this data. But how do we get the data from this table if we can’t directly read it into R, and even if we manually download and unzip the dataset, R won’t be able to handle 16.2 GB of data?
This is where CANSIM data vectors come to the rescue. We need to get only one vector instead of the whole enormous table. To do that, we need the vector number, but we can’t look for the vector inside the table because the table is too large.
The solution is to find the correct vector using Statistics Canada Data search tool. Start with entering the table number in the “Keyword(s)” field. Obviously, you can search by keywords too, but search by table number is more precise:
Then click on the name of the table: “Canadian international merchandise trade by industry for all countries”. After the table preview opens, click “Add/Remove data”:
The “Customize table (Add/Remove data)” menu will open. It has the following tabs: “Geography”, “Trade”, “Trading Partners”, “North American Industry Classification System (NAICS)”, “Reference Period”, and “Customize Layout”. Note that the selection of tabs depends on the data in the table.
Now, do the following:
In the “Geography” tab, do nothing – just make sure “Canada” is checked.
In the “Trade” tab, uncheck “Imports”, since we are looking for the exports data.
In the “Trading Partners”, check “Total of all countries” and uncheck everything else.
Skip “Reference period” for now.
In “Customize Layout”, check “Display Vector Identifier and Coordinate”.
Finally, in the “North American Industry Classification System (NAICS)” tab, uncheck “Total of all industries”, find our product of interest – wheat – and check it. Then click “Apply”.
If you followed all the steps, here’s what your output should look like:
The vector number for Canada’s wheat exports to all other countries is v1063958702.
Did you notice that the output on screen has data only for a few months in 2019? This is just a sample of what our vector has. If you click “Reference period”, you’ll see that the table 12-10-0136 has data for the period from January 2002 to October 2019:
Now we can retrieve the data we’ve been looking for:
# get and clean wheat exports data
get_cansim_vector("v1063958702", "2002-01-01", "2019-10-31") %>%
mutate(ref_date = lubridate::ymd(REF_DATE)) %>%
rename(dollars = VALUE) %>%
# check object size
The resulting wheat_exports data object is only 4640 bytes in size: about 3,500,000 times smaller than the table it came from!
Note that I used lubridate::ymd function inside the mutate() call. This is not strictly required, but wheat_exports contains a time series, so it makes sense to convert the reference date column to an object of class “Date”. Since lubridate is not loaded with tidyverse (it is part of the tidyverse ecosystem, but only core components are loaded by default), I had to call the ymd function with lubridate::.
Finally, note that StatCan Data has another search tool that allows you to search by vector numbers. Unlike the data tables search tool, the vector search tool is very simple, so I’ll not be covering it in detail. You can find it here.
Cleaning Vector Data
Let’s now get provincial annual CPI data and clean it up a bit, removing all the redundant stuff and changing VALUE variable name to something in line with The tidyverse Style Guide:
# get mean annual CPI for Saskatchewan, clean up data
get_cansim_vector("v41694489", "2007-01-01", "2018-12-31") %>%
mutate(year = str_remove(REF_DATE, "-01-01")) %>%
rename(cpi = VALUE) %>%
As usual, you can directly feed the output of one code block into another with the %>% (“pipe”) operator:
By this point you may be wondering if there is an inverse operation, i.e. if we can find CANSIM table number if all we know is a vector number? We sure can! This is what get_cansim_vector_info function is for. And we can retrieve the table, too:
# find out CANSIM table number if you know a vector number
# get table if you know a vector number
cpi_table <- get_cansim_vector_info("v41690973")$table %>%
Joining Data (and Adjusting for Inflation)
Now that we have both weekly wages data (wages_0370) and CPI data (cpi_sk), we can calculate the inflation rate and adjust the wages for inflation. The formula for calculating the inflation rate for the period from the base year to year X is: (CPI in year X – CPI in base year) / CPI in base year
If we wanted the inflation rate to be expressed as a percentage, we would have multiplied the result by 100, but here it is more convenient to have inflation expressed as a proportion:
Now join the resulting dataset to wages_0370 with dplyr::left_join. Then use the inflation rates data to adjust wages for inflation with the following formula: base year $ = current year $ / (1 + inflation rate).
# join inflation rate data to wages_0370; adjust wages for inflation
wages_0370 <- wages_0370 %>%
left_join(cpi_sk, by = "year") %>%
mutate(dollars_2007 = round(current_dollars / (1 + infrate), 2))
Yesterday was a GIS Day! Congratulations to everyone who loves maps, GIS, and working with spatial data! And as you may have guessed from my choice of this blog’s header image, I am one of those people. So, to mark the day, here are some links to great sources of geospatial data and other GIS-related stuff that you may find useful:
GADM: Global Administrative Areas Database. GADM goal is to map the administrative areas of all countries, at all levels of sub-division at high spatial resolution. Very R-friendly: stores data as .rds files formatted for sf and sp packages, in Geopackage (.gpkg) format, as shapefiles (.shp), and as KMZ files.
Canadian GIS: a collection of Canadian GIS and geospatial resources.
CANSIM stands for Canadian Socioeconomic Information Management System. Not long ago it was renamed to “Statistics Canada Data,” but here I’ll be using the repository’s legacy name to avoid confusion with other kinds of data available from Statistics Canada. At the time of writing this, there were over 8,700 data tables in CANSIM, with new tables being added virtually every day. It is the most current source of publicly available socioeconomic data collected by the Government of Canada.
Although CANSIM can be directly accessed online, a much more convenient and reproducible way to access it, is through Statistics Canada API using the excellent cansim package developed by Dmitry Shkolnik and Jens von Bergmann. This package allows to search for relevant data tables quickly and precisely, and to read data directly into R without the intermediate steps of manually downloading and unpacking multiple archives, followed by reading each dataset separately into statistical software.
So why do we have to use tidyverse in addition to cansim (apart from the fact that tidyverse is probably the best data processing software in existence)? Well, data in Statistics Canada CANSIM repository has certain bugs features that one needs to be aware of, and which can best be fixed with the instruments included in tidyverse:
First, it is not always easy to find and retrieve the data manually. After all, you’ll have to search through thousands of data tables looking for the data you need. Online search tool often produces results many pages long, which are not well sorted by relevance (or at least that is my impression).
Second, StatCan data is not in the tidy format, and usually needs to be transformed as such, which is important for convenience and preventing errors, as well as for plotting data.
Third, don’t expect the main principles of organizing data into datasets to be observed. For example, multiple variables can be presented as values in the same column of a dataset, with corresponding values stored in the other column, instead of each variable assigned to its own column with values stored in that column (as it should be). If this sounds confusing, the code snippet below will give you a clear example.
Next, the datasets contain multiple irrelevant variables (columns), that result form how Statistics Canada processes and structures data, and do not have much substantive information.
Finally, CANSIM datasets contain a lot of text, i.e. strings, which often are unnecessarily long and cumbersome, and are sometimes prone to typos. Moreover, numeric variables are often incorrectly stored as class “character”.
If you’d like an example of a dataset exhibiting most of these issues, let’s looks at the responses from unemployed Aboriginal Canadians about why they experience difficulties in finding a job. To reduce the size of the dataset, let’s limit it to one province. Run line-by-line:
jobdif_0014 <- get_cansim("41-10-0014") %>%
filter(GEO == "Saskatchewan")
# Examine the dataset - lots of redundant variables.
# All variables except VALUE are of class "character",
# although some contain numbers, not text.
# Column "Statistics" contains variables’ names instead of values,
# while corresponding values are in a totally different column.
Before You Start
This is an optional step. You can skip it, although if you are planning to use cansim often, you probably shouldn’t.
Before you start working with the package, cansim authors recommend to set up cansim cache to substantially increase the speed of repeated data retrieval from StatCan repositories and to minimize web scraping. cansim cache is not persistent between sessions, so do this either at the start of each session, or set the cache path permanently in your Rprofile file (more on editing Rprofile in Part 4 of these series).
Run (or add to Rprofile):
options(cansim.cache_path = "your cache path")
If your code is going to be executed on different machines, keep in mind that Linux and Windows paths to cache will not be the same:
# Linux path example:
options(cansim.cache_path = "/home/username/Documents/R/.cansim_cache")
# Windows path example:
options(cansim.cache_path = "C:\Users\username\Documents\R\cansim_cache")
This is the Introduction to the series on working with Statistics Canada data in the R language. The goal of the series is to provide some examples (accompanied by detailed in-depth explanations) of working with Statistics Canada data in R. Besides, I’d love to see more economists, policy analysts, and social scientists using R in their work, so I’ll be doing my best to make this easy for people without STEM degrees.
Statistics Canada data is routinely used for economic and policy analysis, as well as for social science research, journalism, and many other applications. It is expected that the reader has some basic R skills.
For the purposes of this series, let’s assume that there are three main types of StatCan data:
Statistics Canada Data, previously known as Canadian Socio-economic Information Management System (CANSIM),
National census data, and
Geographic data provided in a multitude of formats that can be used by GIS software: ArcGIS shapefiles (.shp), Geography Markup Language files (.gml), MapInfo files (.tab), etc.
The “Working with Statistics Canada Data in R” series will follow these data types, and will consist of this Introduction and the following parts: parts 1, 2, and 3 about working with CANSIM data; parts 4, 5, and 6 about Canadian Census data; and parts 7 and 8 about working with StatCan geospatial data.
This is not an official classification of data types available from Statistics Canada. The classification into CANSIM, census, and geographic data is for convenience only, and is loosely based on the key tools used for StatCan data retrieval and processing in R.
The Tools You Need
To be more specific, cansim is the package designed to retrieve CANSIM data, and cancensus is the package to get census data. Further data processing will be done with the tidyverse meta-package (a collection of packages that is itself a package) which is some of the most powerful data manipulation software currently available. GIS data is a more complex matter, but at the very minimum you will need sf, tmap, and units packages. Obviously, just as the R language, all these are completely free and open source. I am not in any way associated with the authors of any of the above packages, I just use them a lot in my work.
Note that although CANSIM has been recently renamed to Statistics Canada Data, I will be using the historic name CANSIM throughout this series in order to distinguish the data obtained from Statistics Canada Data proper from other kinds of StatCan data, i.e. census and geographic data (see how confusing this can get?).
Finally, here’s the code that installs the minimum suite of packages required to run the examples from this series. Note that you might be unable to install sf and units right now, since they have system requirements such as certain libraries being installed, which don’t usually come available “out of the box”. More on sf and units installation in the upcoming “Working with Statistics Canada Geospatial Data” post.