Table of Contents
CANSIM stands for Canadian Socioeconomic Information Management System. Not long ago it was renamed to “Statistics Canada Data,” but here I’ll be using the repository’s legacy name to avoid confusion with other kinds of data available from Statistics Canada. At the time of writing this, there were over 8,700 data tables in CANSIM, with new tables being added virtually every day. It is the most current source of publicly available socioeconomic data collected by the Government of Canada.
Although CANSIM can be directly accessed online, a much more convenient and reproducible way to access it, is through Census Mapper API using the excellent
cansim package developed by Dmitry Shkolnik and Jens von Bergmann. This package allows to search for relevant data tables quickly and precisely, and to read data directly into R without the intermediate steps of manually downloading and unpacking multiple archives, followed by reading each dataset separately into statistical software.
To install and load
Let’s also install and load
tidyverse, as we’ll be using it a lot:
So why do we have to use
tidyverse in addition to
cansim (apart from the fact that
tidyverse is probably the best data processing software in existence)? Well, data in Statistics Canada CANSIM repository has certain
bugs features that one needs to be aware of, and which can best be fixed with the instruments included in
First, it is not always easy to find and retrieve the data manually. After all, you’ll have to search through thousands of data tables looking for the data you need. Online search tool often produces results many pages long, which are not well sorted by relevance (or at least that is my impression).
Second, StatCan data is not in the tidy format, and usually needs to be transformed as such, which is important for convenience and error prevention, as well as for plotting data.
Third, don’t expect the main principles of organizing data into datasets to be observed. For example, multiple variables can be presented as values in the same column of a dataset, with corresponding values stored in the other column, instead of each variable assigned to its own column with values stored in that column (as it should be). If this sounds confusing, the code snippet below will give you a clear example.
Next, the datasets contain multiple irrelevant variables (columns), that result form how Statistics Canada processes and structures data, and do not have much substantive information.
Finally, CANSIM datasets contain a lot of text, i.e. strings, which often are unnecessarily long and cumbersome, and are sometimes prone to typos. Moreover, numeric variables are often incorrectly stored as class “character”.
If you’d like an example of a dataset exhibiting most of these issues, let’s looks at the responses from unemployed Aboriginal Canadians about why they experience difficulties in finding a job. To reduce the size of the dataset, let’s limit it to one province. Run line-by-line:
# Get data
jobdif_0014 <- get_cansim("41-10-0014") %>%
filter(GEO == "Saskatchewan")
# Examine the dataset - lots of redundant variables.
glimpse(jobdif_0014, width = 120)
# All variables except VALUE are of class "character",
# although some contain numbers, not text.
# Column "Statistics" contains variables’ names instead of values,
# while corresponding values are in a totally different column.
This is an optional step. You can skip it, although if you are planning to use
cansim often, you probably shouldn’t.
Before you start working with the package,
cansim authors recommend to set up
cansim cache to substantially increase the speed of repeated data retrieval from StatCan repositories and to minimize web scraping.
cansim cache is not persistent between sessions, so do this either at the start of each session, or set the cache path permanently in your
Rprofile file (more on editing
Rprofile in Part 4 of these series).
Run (or add to
options(cansim.cache_path = "your cache path")
If your code is going to be executed on different machines, keep in mind that Linux and Windows paths to cache will not be the same:
# Linux path example:
options(cansim.cache_path = "/home/username/Documents/R/.cansim_cache")
# Windows path example:
options(cansim.cache_path = "C:\Users\username\Documents\R\cansim_cache")