In the Introduction to the “Working with Statistics Canada Data in R” series, I discussed the three main types of data available from Statistics Canada. It is now time to move on to the second of those data types – the Canadian National Census data.
cancensus is a specialized R package that allows you to retrieve Statistics Canada census data and geography. It follows the tidy approach to data processing. The package’s authors are Jens von Bergmann, Dmitry Shkolnik, and Aaron Jacobs. I am not associated with the authors, I just use this great package in my work on a regular basis. The package’s GitHub code repository can be found here. There is also a tutorial by the package’s authors, which I recommend taking a look at before proceeding.
Further in this series, I will provide an in-depth real-life example of working with Canadian Census data using the cancensus package. But first, let’s install cancensus:
Setting up cancensus: Adding API Key and Cache Path to .Rprofile
cancensus relies on queries to the CensusMapper API, which requires a CensusMapper API key. The key can be obtained for free as per the package’s authors instructions. Once you have the key, you can add it to your R environment:
Note that although the authors are warning that API requests are limited in volume, I have significantly exceeded my API quota on some occasions, and still had no issues with retrieving data.
That said, depending on how much data you need, you can draw down your quota very quickly, and here’s where local cache comes to the rescue. cancensus caches data every time you retrieve it, but by default the cache is not persistent between sessions. To make it persistent, as well as to remove the need to enter your API key every time you use cancensus, you can add both the API key and the cache path to your .Rprofile file.
If you’d like to learn in-depth what .Rprofile is, what it is for, and how to edit it, consider taking a look at sections 2.4.2 to 2.4.5 in “Efficient R Programming” by Colin Gillespie and Robin Lovelace. For a quick and simple edit, just keep reading.
Editing .Rprofile on Linux
First, find your R home directory with R.home(). Most likely, it will be in /usr/lib/R. If that is where your R home is, in the Linux Terminal (not in R), run:
sudo nano /usr/lib/R/library/base/R/.Rprofile # edit path if needed
On some systems, .Rprofile may not be hidden, so if the above command doesn’t open the .Rprofile file, try removing the dot before ‘Rprofile’:
sudo nano /usr/lib/R/library/base/R/Rprofile
(!) Note that this will edit the system .Rprofile file, which will always run on startup and will apply to all your R projects. The file itself will warn you that “it is a bad idea to use this file as a template for personal startup files”. You can safely ignore this warning as long as the only edit you are making is the one shown below, i.e. adding cancensus cache path and API key.
Then hit Ctrl+X, choose Y for “yes”, and press Enter to save changes. When you first retrieve data with cancensus, R will create .cancensus_cache directory for you.
Editing .Rprofile on Windows
Editing .Rprofile on Windows is a bit tricky. The best thing for you would be not to touch Windows system .Rprofile, or else risk weird errors and crashes (honestly, I was not able to figure out where they come from).
Instead, set up a project-specific .Rprofile. The downside is that you may need to set it separately for every project in which you are going to use cancensus. The upside is that the contents of the .Rprofile file should be exactly the same every time. In R or Rstudio (not in the command line), run:
Then, add these two lines to the “options” section, and save the file:
In the previous part of this series, we have retrieved CANSIM data on the weekly wages of Aboriginal and Non-Aboriginal Canadians of 25 years and older, living in Saskatchewan, and the CPI data for the same province. We then used the CPI data to adjust the wages for inflation, and saved the results as wages_0370 dataset. To get started, let’s take a quick look at the dataset, what types of variables it contains, which should be considered categorical, and what unique values categorical variables have:
# explore wages_0370 before plotting
The first two variables – year and group – are of the type “character”, and the rest are numeric of the type “double” (“double” simply means they are not integers, i.e. can have decimals).
Also, we can see that wages_0370 dataset is already in the tidy format, which is an important prerequisite for plotting with ggplot2 package. Since ggplot2 is included into tidyverse, there is no need to install it separately.
Preparing Data for Plotting
At this point, our data is almost ready to be plotted, but we need to make one final change. Looking at the unique values, we can see that the first two variables (year and group) should be numeric (integer) and categorical respectively, while the rest are continuous (as they should be).
In R, categorical variables are referred to as factors. It is important to expressly tell R which variables are categorical, because mapping ggplot aesthetics – things that go inside aes() – to a factor variable makes ggplot2 use a discrete colour scale (distinctly different colors) for different categories (different factor levels in R terms). Otherwise, values would be plotted to a gradient, i.e. different hues of the same color. There are several other reasons to make sure you expressly identify categorical variables as factors if you are planning to visualize your data. I understand that this might be a bit too technical, so if you are interested, you can find more here and here. For now, just remember to convert your categorical variables to factors if you are going to plot your data. Ideally, do it always – it is a good practice to follow.
( ! ) It is a good practice to always convert categorical variables to factors.
So, let’s do it: convert year to an integer, and group to a factor. Before doing so, let’s remove the word “population” from “Non-Aboriginal population” category, so that our plot’s legend takes less space inside the plot. We can also replace accented “é” with ordinary “e” to make typing in our IDE easier. Note that the order is important: first we edit the string values of a “character” class variable, and only then convert it to a factor. Otherwise, our factor will have missing levels.
( ! ) Converting a categorical variable to a factor should be the last step in cleaning your dataset.
Let’s now look at what this code does. We start with feeding our data object wages_0370 into ggplot function using pipe.
( ! ) Note that ggplot2 internal syntax differs from tidyverse pipe-based syntax, and uses + instead of %>% to join code into blocks.
Next, inside ggplot(aes()) call we assign common aesthetics for all layers, and then proceed to choosing the geoms we need. If needed, we can assign additional/override common aesthetics inside individual geoms, like we did when we told geom_label to use dollars_2007 variable values (rounded to a dollar) as labels. If you’d like to find out more about what layers are, and about ggplot2 grammar of graphics, I recommend this article by Hadley Wickham.
Choosing Plot Type
Plot type in each layer is determined by geom_* functions. This is where geoms are:
Choosing plot type is largely a judgement call, but you should always make sure to choose the type of graphic that would best suite the data you have. In this case, our goal is to reveal the dynamics of wages in Saskatchewan over time, hence our choice of geom_line. Note that the lines in our graphic are for visual aid only – to make it easier for an eye to follow the trend. They are not substantively meaningful like they would be, for example, in a regression plot. geom_point is also there primarily for visual purposes – to make the plot’s legend more visible. Note that unlike the lines, the points in this plot are substantively meaningful, i.e. they are exactly where our data is (but are covered by labels). If you don’t like the labels in the graphic, you can use points instead.
Finally, geom_label plots our substantive data. Note that I am using show.legend = FALSE argument – this is simply because I don’t like the look of geom_label legend symbols, and prefer a combined line+point symbol instead. If you prefer geom_label symbols in the plot’s legend, remove show.legend = FALSE argument from geom_label call, and add it to geom_line and geom_point.
Preventing Overlaps with ggrepel
You have noticed some commented lines in the ggplot call. You may also have noticed that some labels in our graphic overlap slightly. In this case the overlap is minute and can be ignored. But what if there are a lot of overlapping data points, enough to affect readability?
Fortunately, there is a package to solve this problem for the graphics that use text labels: ggprepel. It has *_repel versions of ggplot2::geom_label and ggplot2::geom_text functions, which repel the labels away from each other and away from the data points.
ggprepel functions can take the same arguments as corresponding ggplot2 functions, and also take the force argument that defines repelling force between overlapping text labels. I recommend setting it to a small value, as the default 1 seems way too strong.
Here is what our graphic looks like now. Note that the nearby labels no longer overlap:
coord_cartesian is the function I’d like to draw your attention to, as it is the best way to zoom the plot, i.e. to get rid of unnecessary empty space. Since we don’t have any values less than 650 or more than 950 (approximately), starting our Y scale at 0 would result in a less readable plot, where most space would be empty, and the space where we actually have data would be crowded. If you are interested in why coord_cartesian is the best way to set axis limits, there is an in-depth explanation.
( ! ) It is a good practice to use coord_cartesian to change axis limits.
First I selected a simple black-and-white theme theme_bw, and then overrode some of the theme’s default settings in order to improve the plot’s readability and overall appearance. Which theme and settings to use is up to you, just make sure that whatever you do makes the plot easier to read and comprehend at a glance. Here you can find out more about editing plot theme.
Finally, we enter the plot title and plot captions. Captions are used to provide information about the sources of our data. Note the use of \n (new line symbol) to break strings into multiple lines:
labs(title = "Average Weekly Wages, Adjusted for Inflation,\nby Aboriginal Group, 25 Years and Older",
caption = "Wages data: Statistics Canada Data Table 14-10-0370\nInflation data: Statistics Canada Data Vector v41694489")
Saving Your Plot
The last step is to save the plot so that we can use it externally: insert into reports and other publications, publish online, etc.
# save plot
ggsave takes various arguments, but only one is mandatory: file name as a string. The second argument plot defaults to the last plot displayed, but it is advisable to name the plot expressly to make sure the right one gets saved. You can find out more about how ggsave works here.
My favourite format to save graphics is SVG, which is being used in the code example above. SVG stands for Scalable Vector Graphics – an extremely lightweight vectorized format that ensures the graphic stays pixel-perfect at any resolution. Note that SVG is not really a pixel image like JPEG or PNG, but a bunch of XML code, which entails certain security implications when using SVG files online.
ggsave is dependent on the settings (e.g. aspect ratio, size, etc.) of your system’s graphics device, which can be inconsistent between sessions. If you notice inconsistencies in how the saved graphics look, consider using the export package – a specialized package for saving graphics and statistical output made in R. Its functions offer more customizable saving options, more formats (file types), and with the right settings ensure consistent output between sessions.
This was the last of the three articles about working with CANSIM data. In the next article in the “Working with Statistics Canada Data in R” series, I’ll move on to working with national census data.
CANSIM data is stored in tables containing data on a specific subject. Individual entries (rows) and groups of entries in these tables are usually assigned vector codes. Thus, unless we already know what table or vector we need, finding the correct table should be our first step.
As always, start with loading packages:
Now we can start retrieving CANSIM data. How we do this, depends on whether we already know the table or vector numbers. If we do, things are simple: just use get_cansim to retrieve data tables, or get_cansim_vector to retrieve vectors.
But usually we don’t. One option is to use StatCan’s online search tool. Eventually you will find what you’ve been looking for, although you might also miss a few things – CANSIM is not that easy to search and work with manually.
A much better option is to let R do the tedious work for you with a few lines of code.
Searching by Index
In this example (based on the code I wrote for an actual research project), let’s look for CANSIM tables that refer to Aboriginal (Indigenous) Canadians anywhere in the tables’ titles, descriptions, keywords, notes, etc.:
# create an index to subset list_cansim_tables() output
index <- list_cansim_tables() %>%
Map(grepl, "(?i)aboriginal|(?i)indigenous", .) %>%
Reduce("|", .) %>%
# list all tables with Aboriginal data, drop redundant cols
tables <- list_cansim_tables()[index, ] %>%
select(c("title", "keywords", "notes", "subject",
Let’s look in detail at what this code does. First, we call the list_cansim_tables function, which returns a tibble dataframe, where each row provides a description of one CANSIM table. To get a better idea of list_cansim_tables output, run:
Then we search through the dataframe for Regex patterns matching our keywords. Note the (?i) flag – it tells Regex to ignore case when searching for patterns; alternatively, you can pass ignore.case = TRUE argument to grepl. The Map function allows to search for patterns in every column of the dataframe returned by list_cansim_tables. This step returns a very long list of logical values.
Our next step is to Reduce the list to a logical vector, where each value is either FALSE if there was not a single search term match per CANSIM table description (i.e. per row of list_cansim_tables output), or TRUE if there were one or more matches. The which function gives us the numeric indices of TRUE elements in the vector.
Finally, we subset the list_cansim_tables output by index. Since there are many redundant columns in the resulting dataframe, we select only the ones that contain potentially useful information.
Searching with dplyr::filter
There is also a simpler approach, which immediately returns a dataframe of tables:
However, keep in mind that this code would search only through the column or columns of the list_cansim_tables output, which were specified inside the grepl call (in this case, in the title column). This results in fewer tables listed in tables: 60 vs 73 you get if you search by index (as of the time of writing this). Often simple filtering would be sufficient, but if you want to be extra certain you haven’t missed anything, search by index as shown above.
( ! ) Note that some CANSIM data tables do not get updated after the initial release, so always check the date_published and the time_period_coverage_* attributes of the tables you are working with.
Saving Search Results
Finally, it could be a good idea to externally save the dataframe with the descriptions of CANSIM data tables in order to be able to view it as a spreadsheet. Before saving, let’s re-arrange the columns in a more logical order for viewers’ convenience, and sort the dataframe by CANSIM table number.
# re-arrange columns for viewing convenience
tables <- tables[c("cansim_table_number", "title", "subject",
"keywords", "notes", "date_published",
"time_period_coverage_end", "url_en")] %>%
# and save externally
write_delim(tables, "selected_data_tables.txt", delim = "|")
( ! ) Note that I am using write_delim function instead of a standard write.csv or tidyverse::write_csv, with | as a delimiter. I am doing this because there are many commas inside strings in CANSIM data, and saving as a comma-separated file would cause incorrect breaking down into columns in a spreadsheet software.
Now, finding the data tables can be as simple as looking through the tables dataframe or through the selected_data_tables.txt file.
Retrieving Data Tables
In order for the examples here to feel relevant and practical, let’s suppose we were asked to compare and visualize the weekly wages of Aboriginal and Non-Aboriginal Canadians of 25 years and older, living in a specific province (let’s say, Saskatchewan), adjusted for inflation.
Since we have already searched for all CANSIM tables that contain data about Aboriginal Canadians, we can easily figure out that we need CANSIM table #14-10-0370. Let’s retrieve it:
( ! ) Before we proceed further, take a look at the VECTOR variable – this is how we can find out individual vector codes for specific CANSIM data. More on that below.
Let’s now subset the data by province, drop redundant variables, and rename the remaining in a way that makes them easier to process in the R language. Generally, I suggest following The tidyverse Style Guide by Hadley Wickham. For instance, variable names should use only lowercase letters, numbers, and underscores instead of spaces:
Next, let’s explore the dataset again. Specifically, let’s see what unique values categorical variables year, group, type, age have. No need to do this with current_dollars, as all or almost all of its values would inevitably be unique due to it being a continuous variable.
The output looks as follows:
$year  “2007” “2008” “2009” “2010” “2011” “2012” “2013” “2014” “2015” “2016” “2017” “2018” $group  “Total population” “Aboriginal population” “First Nations” “Métis” “Non-Aboriginal population” $type  “Total employees” “Average hourly wage rate” “Average weekly wage rate” “Average usual weekly hours” $age  “15 years and over” “15 to 64 years” “15 to 24 years” “25 years and over” “25 to 54 years”
Now we can decide how to further subset the data.
We obviously need the data for as many years as we can get, so we keep all the years from the year variable.
For the group variable, we need Aboriginal and Non-Aboriginal data, but the “Aboriginal” category has two sub-categories: “First Nations” and “Métis”. It is our judgement call which to go with. Let’s say we want our data to be more granular and choose “First Nations”, “Métis”, and “Non-Aboriginal population”.
As far as the type variable is concerned, things are simple: we are only interested in the weekly wages, i.e. “Average weekly wage rate”. Note that we are using the data on average wages because for some reason CANSIM doesn’t provide the data on median wages for Aboriginal Canadians. Using average wages is not a commonly accepted way of analyzing wages, as it allows a small number of people with very high-paying jobs to distort the data, making wages look higher than they actually are. But well, one can only work with the data one has.
Finally, we need only one age group: “25 years and over”.
Having made these decisions, we can subset the data. We also drop two categorical variables (type and age) we no longer need, as both these variables would now have only one level each:
Of course, what I just did step-by-step for demonstration purposes, can be done with a single block of code to minimize typing and remove unnecessary repetition. The “pipe” operator %>% makes this super-easy. In R-Studio, you can use Shift+Ctrl+M shortcut to insert %>%.
Now that we have our weekly wages data, let’s adjust the wages for inflation, otherwise the data simply won’t be meaningful. In order to be able to do this, we need to get the information about the annual changes in the Consumer Price Index (CPI) in Canada, since the annual change in CPI is used as a measure of inflation. Let’s take a look at what CANSIM has on the subject:
# list tables with CPI data, exclude the US
cpi_tables <- list_cansim_tables() %>%
filter(grepl("Consumer Price Index", title) &
!grepl("United States", title))
Even when we search using filter instead of indexing, and remove the U.S. data from search results, we still get a list of 20 CANSIM tables, with multiple vectors in each. How do we choose the correct data vector? There are two main ways we can approach this.
First, we can use other sources to find out exactly what vectors to use. For example, we can take a look at how the Bank of Canada calculates inflation. According to the Bank of Canada’s “Inflation Calculator” web page, they use CANSIM vector v41690973 (Monthly CPI Indexes for Canada) to calculate inflation rates. So we can go ahead and retrieve this vector:
Since the data in the wages_0370 dataset covers the period from 2007 till 2018, we retrieve CPI data for the same period. The function takes two main arguments: vectors – the list of vector numbers (as strings), and start_time – starting date as a string in YYYY-MM-DD format. Since we don’t need data past 2018, we also add an optional end_time argument, which takes a string in the same format as start_time. Let’s take a look at the result of our get_cansim_vector call:
The resulting dataset contains monthly CPI indexes (take a look at the REF_DATE variable). However, our wages_0370 dataset only has the annual data on wages. What shall we do?
Well, one option could be to calculate annual CPI averages ourselves:
Alternatively, we could look through CANSIM tables to find annual CPI values that have already been calculated by Statistics Canada.
Thus, the second way to find which vectors to use, is by looking through the relevant CANSIM tables. This might be more labor-intensive, but can lead to more precise results. Also, we can do this if we can’t find vector numbers from other sources.
Let’s look at cpi_tables. Table # 18-10-0005 has “Consumer Price Index, annual average” in its title, so this is probably what we need.
# get CANSIM table with annual CPI values
cpi_annual_table <- get_cansim("18-10-0005")
Let’s now explore the data:
# explore the data
Turns out, the data is much more detailed than in the vector v41690973! Remember that in wages_0370 we selected the data for a specific province (Saskatchewan)? Well, table # 18-10-0005 has CPI breakdown by province and even by specific groups of products. This is just what we need! However, if you run unique(cpi_annual_table$VECTOR), you’ll see that table # 18-10-0005 includes data from over 2000 different vectors – it is a large dataset. So, how do we choose the right vector? By narrowing down the search:
# find out vector number from CANSIM table
rename(product = "Products and product groups") %>%
filter(GEO == "Saskatchewan" &
product == "All-items") %>%
This gives us CANSIM vector number for the “all items” group CPI for the province of Saskatchewan: v41694489.
Using StatCan Data Search Tool to Find Vectors
Some CANSIM tables are too large to be retrieved using cansim package. For example, running get_cansim(“12-10-0136”) will result in a long wait followed by “Problem downloading data, multiple timeouts” error. cansim will also advise you to “check your network connection”, but network connection is not the problem, it is the size of the dataset.
CANSIM table 12-10-0136 is very large: 16.2 GB. By default, R loads the full dataset into RAM, which can make things painfully slow when dealing with huge datasets. There are solutions for datasets <10 GB in size, but anything larger than 10 GB requires distributed computing. In practice, in R you are likely to start experiencing difficulties and slowdowns if your datasets exceed 1 GB.
Suppose we need to know how much wheat (in dollars) Canada exports to all other countries. CANSIM table 12-10-0136 “Canadian international merchandise trade by industry for all countries” has this data. But how do we get the data from this table if we can’t directly read it into R, and even if we manually download and unzip the dataset, R won’t be able to handle 16.2 GB of data?
This is where CANSIM data vectors come to the rescue. We need to get only one vector instead of the whole enormous table. To do that, we need the vector number, but we can’t look for the vector inside the table because the table is too large.
The solution is to find the correct vector using Statistics Canada Data search tool. Start with entering the table number in the “Keyword(s)” field. Obviously, you can search by keywords too, but search by table number is more precise:
Then click on the name of the table: “Canadian international merchandise trade by industry for all countries”. After the table preview opens, click “Add/Remove data”:
The “Customize table (Add/Remove data)” menu will open. It has the following tabs: “Geography”, “Trade”, “Trading Partners”, “North American Industry Classification System (NAICS)”, “Reference Period”, and “Customize Layout”. Note that the selection of tabs depends on the data in the table.
Now, do the following:
In the “Geography” tab, do nothing – just make sure “Canada” is checked.
In the “Trade” tab, uncheck “Imports”, since we are looking for the exports data.
In the “Trading Partners”, check “Total of all countries” and uncheck everything else.
Skip “Reference period” for now.
In “Customize Layout”, check “Display Vector Identifier and Coordinate”.
Finally, in the “North American Industry Classification System (NAICS)” tab, uncheck “Total of all industries”, find our product of interest – wheat – and check it. Then click “Apply”.
If you followed all the steps, here’s what your output should look like:
The vector number for Canada’s wheat exports to all other countries is v1063958702.
Did you notice that the output on screen has data only for a few months in 2019? This is just a sample of what our vector has. If you click “Reference period”, you’ll see that the table 12-10-0136 has data for the period from January 2002 to October 2019:
Now we can retrieve the data we’ve been looking for:
# get and clean wheat exports data
get_cansim_vector("v1063958702", "2002-01-01", "2019-10-31") %>%
mutate(ref_date = lubridate::ymd(REF_DATE)) %>%
rename(dollars = VALUE) %>%
# check object size
The resulting wheat_exports data object is only 4640 bytes in size: about 3,500,000 times smaller than the table it came from!
Note that I used lubridate::ymd function inside the mutate() call. This is not strictly required, but wheat_exports contains a time series, so it makes sense to convert the reference date column to an object of class “Date”. Since lubridate is not loaded with tidyverse (it is part of the tidyverse ecosystem, but only core components are loaded by default), I had to call the ymd function with lubridate::.
Finally, note that StatCan Data has another search tool that allows you to search by vector numbers. Unlike the data tables search tool, the vector search tool is very simple, so I’ll not be covering it in detail. You can find it here.
Cleaning Vector Data
Let’s now get provincial annual CPI data and clean it up a bit, removing all the redundant stuff and changing VALUE variable name to something in line with The tidyverse Style Guide:
# get mean annual CPI for Saskatchewan, clean up data
get_cansim_vector("v41694489", "2007-01-01", "2018-12-31") %>%
mutate(year = str_remove(REF_DATE, "-01-01")) %>%
rename(cpi = VALUE) %>%
As usual, you can directly feed the output of one code block into another with the %>% (“pipe”) operator:
By this point you may be wondering if there is an inverse operation, i.e. if we can find CANSIM table number if all we know is a vector number? We sure can! This is what get_cansim_vector_info function is for. And we can retrieve the table, too:
# find out CANSIM table number if you know a vector number
# get table if you know a vector number
cpi_table <- get_cansim_vector_info("v41690973")$table %>%
Joining Data (and Adjusting for Inflation)
Now that we have both weekly wages data (wages_0370) and CPI data (cpi_sk), we can calculate the inflation rate and adjust the wages for inflation. The formula for calculating the inflation rate for the period from the base year to year X is: (CPI in year X – CPI in base year) / CPI in base year
If we wanted the inflation rate to be expressed as a percentage, we would have multiplied the result by 100, but here it is more convenient to have inflation expressed as a proportion:
Now join the resulting dataset to wages_0370 with dplyr::left_join. Then use the inflation rates data to adjust wages for inflation with the following formula: base year $ = current year $ / (1 + inflation rate)
# join inflation rate data to wages_0370; adjust wages for inflation
wages_0370 <- wages_0370 %>%
left_join(cpi_sk, by = "year") %>%
mutate(dollars_2007 = round(current_dollars / (1 + infrate), 2))
CANSIM stands for Canadian Socioeconomic Information Management System. Not long ago it was renamed to “Statistics Canada Data,” but here I’ll be using the repository’s legacy name to avoid confusion with other kinds of data available from Statistics Canada. At the time of writing this, there were over 8,700 data tables in CANSIM, with new tables being added virtually every day. It is the most current source of publicly available socioeconomic data collected by the Government of Canada.
Although CANSIM can be directly accessed online, a much more convenient and reproducible way to access it, is through Statistics Canada API using the excellent cansim package developed by Dmitry Shkolnik and Jens von Bergmann. This package allows to search for relevant data tables quickly and precisely, and to read data directly into R without the intermediate steps of manually downloading and unpacking multiple archives, followed by reading each dataset separately into statistical software.
So why do we have to use tidyverse in addition to cansim (apart from the fact that tidyverse is probably the best data processing software in existence)? Well, data in Statistics Canada CANSIM repository has certain bugs features that one needs to be aware of, and which can best be fixed with the instruments included in tidyverse:
First, it is not always easy to find and retrieve the data manually. After all, you’ll have to search through thousands of data tables looking for the data you need. Online search tool often produces results many pages long, which are not well sorted by relevance (or at least that is my impression).
Second, StatCan data is not in the tidy format, and usually needs to be transformed as such, which is important for convenience and preventing errors, as well as for plotting data.
Third, don’t expect the main principles of organizing data into datasets to be observed. For example, multiple variables can be presented as values in the same column of a dataset, with corresponding values stored in the other column, instead of each variable assigned to its own column with values stored in that column (as it should be). If this sounds confusing, the code snippet below will give you a clear example.
Next, the datasets contain multiple irrelevant variables (columns), that result form how Statistics Canada processes and structures data, and do not have much substantive information.
Finally, CANSIM datasets contain a lot of text, i.e. strings, which often are unnecessarily long and cumbersome, and are sometimes prone to typos. Moreover, numeric variables are often incorrectly stored as class “character”.
If you’d like an example of a dataset exhibiting most of these issues, let’s looks at the responses from unemployed Aboriginal Canadians about why they experience difficulties in finding a job. To reduce the size of the dataset, let’s limit it to one province. Run line-by-line:
jobdif_0014 <- get_cansim("41-10-0014") %>%
filter(GEO == "Saskatchewan")
# Examine the dataset - lots of redundant variables.
# All variables except VALUE are of class "character",
# although some contain numbers, not text.
# Column "Statistics" contains variables’ names instead of values,
# while corresponding values are in a totally different column.
Before You Start
This is an optional step. You can skip it, although if you are planning to use cansim often, you probably shouldn’t.
Before you start working with the package, cansim authors recommend to set up cansim cache to substantially increase the speed of repeated data retrieval from StatCan repositories and to minimize web scraping. cansim cache is not persistent between sessions, so do this either at the start of each session, or set the cache path permanently in your Rprofile file (more on editing Rprofile in the upcoming “Working with Census Data” later in this series).
Run (or add to Rprofile):
options(cansim.cache_path = "your cache path")
If your code is going to be executed on different machines, keep in mind that Linux and Windows paths to cache will not be the same:
# Linux path example:
options(cansim.cache_path = "/home/username/Documents/R/.cansim_cache")
# Windows path example:
options(cansim.cache_path = "C:\Users\username\Documents\R\cansim_cache")
This is the Introduction to the series on working with Statistics Canada data in the R language. The goal of the series is to provide some examples (accompanied by detailed in-depth explanations) of working with Statistics Canada data in R. Besides, I’d love to see more economists, policy analysts, and social scientists using R in their work, so I’ll be doing my best to make this easy for people without STEM degrees.
Statistics Canada data is routinely used for economic and policy analysis, as well as for social science research, journalism, and many other applications. It is expected that the reader has some basic R skills.
For the purposes of this series, let’s assume that there are three main types of StatCan data:
Statistics Canada Data, previously known as Canadian Socio-economic Information Management System (CANSIM),
National census data, and
Geographic data provided in a multitude of formats that can be used by GIS software: ArcGIS shapefiles (.shp), Geography Markup Language files (.gml), MapInfo files (.tab), etc.
The “Working with Statistics Canada Data in R” series will follow these data types, and will consist of this Introduction and the following parts: parts 1, 2, and 3 about working with CANSIM data; parts 4, 5, and 6 about Canadian Census data; and parts 7 and 8 about working with StatCan geospatial data.
This is not an official classification of data types available from Statistics Canada. The classification into CANSIM, census, and geographic data is for convenience only, and is loosely based on the key tools used for StatCan data retrieval and processing in R.
The Tools You Need
To be more specific, cansim is the package designed to retrieve CANSIM data, and cancensus is the package to get census data. Further data processing will be done with the tidyverse meta-package (a collection of packages that is itself a package) which is some of the most powerful data manipulation software currently available. GIS data is a more complex matter, but at the very minimum you will need sf, tmap, and units packages. Obviously, just as the R language, all these are completely free and open source. I am not in any way associated with the authors of any of the above packages, I just use them a lot in my work.
Note that although CANSIM has been recently renamed to Statistics Canada Data, I will be using the historic name CANSIM throughout this series in order to distinguish the data obtained from Statistics Canada Data proper from other kinds of StatCan data, i.e. census and geographic data (see how confusing this can get?).
Finally, here’s the code that installs the minimum suite of packages required to run the examples from this series. Note that you might be unable to install sf and units right now, since they have system requirements such as certain libraries being installed, which don’t usually come available “out of the box”. More on sf and units installation in the upcoming “Working with Statistics Canada Geospatial Data” post.