Tag: cancensus (package)

Working with Statistics Canada Data in R, Part 5: Retrieving Census Data

Back to Working with Statistics Canada Data in R, Part 4.
Forward to Working with Statistics Canada Data in R, Part 6.

Introduction

Now that we are ready to start working with Canadian Census data, let’s first briefly address the question why you may need to use it. After all, CANSIM data is often more up-to-date and covers a much broader range of topics than the national census data, which is gathered every five years in respect of a limited number of questions.

The main reason is that CANSIM data is far less granular geographically. Most of it is collected at the provincial or even higher regional level. You may be able to find CANSIM data on a limited number of questions for some of the country’s largest metropolitan areas, but if you need the data for a specific census division, city, town, or village, you’ll have to use the Census.

To illustrate the use of cancensus package, let’s do a small research project. First, in this post we’ll retrieve the following key labor force characteristics of the largest metropolitan areas in each of the five geographic regions of Canada:

  • Labor force participation rate, employment rate, and unemployment rate.
  • Percent of workers by work situation: full time vs part time, by gender.
  • Education levels of people aged 25 to 64, by gender.

The cities (metropolitan areas) that we are going to look at, are: Calgary, Halifax, Toronto, Vancouver, and Whitehorse. We’ll also get these data for Canada as a whole for comparison and to illustrate the retrieval of data at different geographic levels

Next, in Part 6 of the “Working with Statistics Canada Data in R” series, we will visualize these data, including making a faceted plot and writing a function to automate repetitive plotting tasks.

Keep in mind that cancensus also allows you to retrieve geospatial data, that is, borders of census regions at various geographic levels, in sp and sf formats. Retrieving and visualizing Statistics Canada geospatial data will be covered later in these series.

So, let’s get started by loading the required packages:

library(cancensus)
library(tidyverse)

Searching for Data

cancensus retrieves census data with the get_census function. get_census can take a number of arguments, the most important of which are dataset, regions, and vectors, which have no defaults. Thus, in order to be able to retrieve census data, you’ll first need to figure out:

  • your dataset,
  • your region(s), and
  • your data vector(s).

Find Census Datasets

Let’s see which census datasets are available through the CensusMapper API:

list_census_datasets()

Currently, datasets earlier than 2001 are not available, so if you need to work with the 20th century census data, you won’t be able to retrieve it with cancensus.

Find Census Regions

Next, let’s find the regions that we’ll be getting the data for. To search for census regions, use the search_census_regions function.

Let’s take a look at what region search returns for Toronto. Note that cancensus functions return their output as dataframes, making it is easy to subset. Here I limited the output to the most relevant columns to make sure it fits on screen. You can run the code without [c(1:5, 8)] to see all of it.

# all census levels
search_census_regions(searchterm = "Toronto", 
                      dataset = "CA16")[c(1:5, 8)]

Returns:

A tibble: 3 x 6
# region  name    level     pop municipal_status PR_UID
                          
1 35535   Toronto CMA   5928040 B                35    
2 3520    Toronto CD    2731571 CDR              35    
3 3520005 Toronto CSD   2731571 C                35    

You may have expected to get only one region: the city of Toronto, but instead you got three! So, what is the difference? Look at the column ‘level’ for the answer. Often, the same geographic region can be represented by several census levels, as is the case here. There are three levels for Toronto, which is simultaneously a census metropolitan area, a census division, and a census sub-division. Note also the ‘PR_UID’ column that contains numeric codes for Canada’s provinces and territories. These codes can help you distinguish between different census regions that have same or similar names but are located in different provinces. For an example, run the code above replacing “Toronto” with “Windsor”.

Remember that we were going to plot the data for census metropolitan areas? You can choose the geographic level with the level argument, which can take the following values: ‘C’ for Canada (national level), ‘PR’ for province, ‘CMA’ for census metropolitan area, ‘CD’ for census division, ‘CSD’ for census sub-division, or NA:

# specific census level
search_census_regions("Toronto", "CA16", level = "CMA")

Let’s now list census regions that may be relevant for our project:

# explore available census regions
names <- c("Canada", "Calgary", "Halifax", 
           "Toronto", "Vancouver", "Whitehorse")
map_df(names, ~ search_census_regions(., dataset = "CA16"))

purrr::map_df function applies search_census_regions iteratively to each element of the names vector and returns output as a single dataframe. Note also the ~ . syntax. Think of it as the tilde taking each element of names and passing it as an argument to a place indicated by the dot in the search_census_regions function. You can find more about the tilde-dot syntax here. It may be a good idea to read the whole tutorial: purrr is a super-useful package, but not the easiest to learn, and this tutorial does a great job explaining the basics.

Since there are multiple entries for each search term, we’ll need to choose the results for census metropolitan areas, or in case of Whitehorse, for census sub-division, since Whitehorse is too small to be considered a metropolitan area:

# select only the regions we need: CMAs (and CSD for Whitehorse)
regions <- list_census_regions(dataset = "CA16") %>% 
  filter(grepl("Calgary|Halifax|Toronto|Vancouver", name) &
         grepl("CMA", level) | 
         grepl("Canada|Whitehorse$", name)) %>% 
  as_census_region_list()

Pay attention to how the logical operators are used to filter the output by several conditions at once; also note using $ regex meta-character to choose from the ‘names’ column the entry ending with ‘Whitehorse’ (to filter out ‘Whitehorse, Unorganized’.

Finally, as_census_region_list converts list_census_regions output to a data object of type list that can be passed to the get_census function as its regions argument.

Find Census Vectors

Canadian census data is made up of individual variables, aka census vectors. Vector number(s) is another argument you need to specify in order to retrieve data with the get_census function.

cancensus has two functions that allow you to search through census data variables: list_census_vectors and search_census_vectors.

list_census_vectors returns all available vectors for a given dataset as a single dataframe containing vectors and their descriptions:

# structure of list_census_vectors output
str(list_census_vectors(dataset = 'CA16'))

# count variables in 'CA16' dataset
nrow(list_census_vectors(dataset = 'CA16'))

As you can see, there are 6623 (as of the time of writing this) variables in the 2016 census dataset, so list_census_vectors won’t be the most convenient function to find a specific vector. Note however that there are situations (such as when you need to select a lot of vectors at once), in which list_census_vectors would be appropriate.

Usually it is more convenient to use search_census_vectors to search for vectors. Just pass the text string of what you are looking for as the searchterm argument. You don’t have to be precise: this function works even if you make a typo or are uncertain about the spelling of your search term.

Let’s now find census data vectors for labor force involvement rates:

# get census data vectors for labor force involvement rates
lf_vectors <- 
  search_census_vectors(searchterm = "employment rate", 
                        dataset = "CA16") %>% 
  union(search_census_vectors("participation rate", "CA16")) %>% 
  filter(type == "Total") %>% 
  pull(vector)

Let’s take a look at what this code does. Since searchterm doesn’t have to be a precise match, “employment rate” search term retrieves unemployment rate vectors too. In the next line, union merges dataframes returned by search_census_vectors into a single dataframe. Note that in this case union could be substituted with bind_rows. I recommend using union in order to avoid data duplication. Next, we choose only the “Total” numbers, since we are not going to plot labor force indicators by gender. Finally, the pull command extracts a single vector from the dataframe, just like the $ subsetting operator: we need ‘lf_vectors’ to be a data object of type vector in order to pass it to the vectors argument of the get_census function.

There is another way to figure out search terms to put inside the search_census_vectors function: use Statistics Canada online Census Profile tool. It can be used to quickly explore census data as well as to figure out variable names (search terms) and their hierarchical structure.

For example, let’s look at the census labor data for Calgary metropolitan area. Scrolling down, you will quickly find the numbers and text labels for full-time and part-time workers:

Now we know the exact search terms, so we can get precisely the vectors we need, free from any extraneous data:

# get census data vectors for full and part time work

# get vectors and labels    
work_vectors_labels <- 
  search_census_vectors("full year, full time", "CA16") %>% 
  union(search_census_vectors("part year and/or part time", "CA16")) %>% 
  filter(type != "Total") %>% 
  select(1:3) %>% 
  mutate(label = str_remove(label, ".*, |.*and/or ")) %>% 
  mutate(type = fct_drop(type)) %>% 
  setNames(c("vector", "gender", "type"))

# extract vectors
work_vectors <- work_vectors_labels$vector

Note how this code differs from the code with which we extracted labor force involvement rates: since we need the data to be sub-divided both by the type of work and by gender (hence no “Total” values here), we are creating a dataframe that assigns respective labels to each vector number. This work_vectors_labels dataframe will supply categorical labels to be attached to the data retrieved with get_census.

Also, note these three lines:

  mutate(label = str_remove(label, ".*, |.*and/or ")) %>% 
  mutate(type = fct_drop(type)) %>% 
  setNames(c("vector", "gender", "type"))

The first mutate call removes all text up to and including ‘, ‘ and ‘and/or ‘ (spaces included) from the ‘label’ column. The second drops unused factor level “Total” – it is a good practice to make sure there are no unused factor levels if you are going to use ggplot2 to plot your data. Finally, setNames renames variables for convenience.

Finally, let’s retrieve vectors for the education data for the age group from 25 to 64 years, by gender. Before we do this, I’d like to draw your attention to the fact that some of the census data is hierarchical, which means that some variables (census vectors) are included into parent and/or include child variables. It is very important to choose vectors at proper hierarchical levels so that you do not double-count or omit your data.

Education data is a good example of hierarchical data. You can explore data hierarchy using parent_census_vectors and child_census_vectors functions as described here. However, you may find exploring the hierarchy visually to be more convenient:

So, let’s now retrieve and label the education data vectors:

# get vectors and labels
ed_vectors_labels <-
  search_census_vectors("certificate", "CA16") %>%
  union(search_census_vectors("degree", "CA16")) %>%
  union(search_census_vectors("doctorate", "CA16")) %>%
  filter(type != "Total") %>%
  filter(grepl("25 to 64 years", details)) %>%
  slice(-1,-2,-7,-8,-11:-14,-19,-20,-23:-28) %>%
  select(1:3) %>%
  mutate(label =
           str_remove_all(label,
                          " cert.*diploma| dipl.*cate|, CEGEP| level|")) %>%
  mutate(label =
           str_replace_all(label, 
                           c("No.*" = "None",
                             "Secondary.*" = "High school or equivalent",
                             "other non-university" = "equivalent",
                             "University above" = "Cert. or dipl. above",
                             "medicine.*" = "health**",
                             ".*doctorate$" = "Doctorate*"))) %>%
  mutate(type = fct_drop(type)) %>%
  setNames(c("vector", "gender", "level"))

# extract vectors
ed_vectors <- ed_vectors_labels$vector

Note the slice function that allows to manually select specific rows from a dataframe: positive numbers choose rows to keep, negative numbers choose rows to drop. I used slice to drop the hierarchical levels from the data that are either too general or too granular. Note also that I had to edit text strings in the data. Finally, I added asterisks after “Doctorate” and “health”. These are not regex symbols, but actual asterisks that will be used to refer to footnotes in plot captions later on.

Now that we have figured out our dataset, regions, and data vectors (and labelled the vectors, too), we are finally ready to retrieve the data.

Retrieve Census Data

To retrieve census data, feed the dataset, regions, and data vectors into get_census as its’ respective arguments. Note that get_census has use_cache argument (set to TRUE by default), which tells get_census to retrieve data from cache if available. If there is no cached data, the function will query CensusMapper API for the data and will save it in the cache, while use_cache = FALSE will force get_census to query the API and update the cache.

# get census data for labor force involvement rates
# feed regions and vectors into get_census()
labor <- 
  get_census(dataset = "CA16", 
             regions = regions,
             vectors = lf_vectors) %>% 
  select(-c(1, 2, 4:7)) %>% 
  setNames(c("region", "employment rate", 
             "unemployment rate", 
             "participation rate")) %>% 
  mutate(region = str_remove(region, " (.*)")) %>% 
  pivot_longer("employment rate":"participation rate", 
               names_to = "indicator",
               values_to = "rate") %>% 
  mutate_if(is.character, as_factor)

The select call drops columns with irrelevant data. setNames renames columns to remove vector numbers from variable names – we don’t need vector numbers in variable names because variable names will be converted to values in the ‘indicator’ column. str_remove inside the mutate call drops municipal status codes ‘(B)’ and ‘(CY)’ from region names. Finally, mutate_if converts characters to factors for subsequent plotting.

An important function here is tidyr::pivot_longer. It converts the dataframe from wide to long format. It takes three columns: ‘employment rate’, ‘unemployment rate’, and ‘participation rate’, and passes their names to values of the ‘indicator’ variable, while their numeric values are passed to the ‘rate’ variable. The reason for conversion is that we are going to plot the data for all three labor force indicators in the same graphic, which makes it necessary to store the indicators as a single factor variable.

Next, let’s retrieve census data about the percent of full time vs part time workers, by gender, and the data about the education levels of people aged 25 to 64, by gender:

# get census data for full time and part time work
work <- 
  get_census(dataset = "CA16", 
             regions = regions,
             vectors = work_vectors) %>% 
  select(-c(1, 2, 4:7)) %>% 
  rename(region = "Region Name") %>% 
  pivot_longer(2:5, names_to = "vector", 
                    values_to = "count") %>% 
  mutate(region = str_remove(region, " (.*)")) %>% 
  mutate(vector = str_remove(vector, ":.*")) %>% 
  left_join(work_vectors_labels, by = "vector") %>% 
  mutate(gender = str_to_lower(gender)) %>% 
  mutate_if(is.character, as_factor)
# get census data for education levels
education <- 
  get_census(dataset = "CA16", 
             regions = regions,
             vectors = ed_vectors) %>% 
  select(-c(1, 2, 4:7)) %>% 
  rename(region = "Region Name") %>% 
  pivot_longer(2:21, names_to = "vector", 
                     values_to = "count") %>% 
  mutate(region = str_remove(region, " (.*)")) %>% 
  mutate(vector = str_remove(vector, ":.*")) %>% 
  left_join(ed_vectors_labels, by = "vector") %>% 
  mutate_if(is.character, as_factor)

Note one important difference from the code I used to retrieve the labor force involvement data: here I added the dplyr::left_join function that joins labels to the census data.

We now have the data and are ready to visualize it, which will be done in the next part of this series.

Annex: Notes and Definitions

For those of you who are outside of Canada, Canada’s geographic regions and their largest metropolitan areas are:

  • The Atlantic Provinces – Halifax
  • Central Canada – Toronto
  • The Prairie Provinces – Calgary
  • The West Coast – Vancouver
  • The Northern Territories – Whitehorse

These regions should not be confused with 10 provinces and 3 territories, which are Canada’s sub-national administrative divisions, much like states in the U.S. Each region consists of several provinces or territories, except the West Coast, which includes only one province – British Columbia. You can find more about Canada’s geographic regions and territorial structure here (pages 44 to 51).

For the definitions of employment rate, unemployment rate, labour force participation rate, full-time work, and part-time work, see Statistics Canada’s Guide to the Labour Force Survey.

You can find more about census geographic areas here and here. There is also a glossary of census-related geographic concepts.

Working with Statistics Canada Data in R, Part 4: Canadian Census Data – cancensus Package Setup

Back to Working with Statistics Canada Data in R, Part 3.
Forward to Working with Statistics Canada Data in R, Part 5.

What is cancensus

In the Introduction to the “Working with Statistics Canada Data in R” series, I discussed the three main types of data available from Statistics Canada. It is now time to move on to the second of those data types – the Canadian National Census data.

cancensus is a specialized R package that allows you to retrieve Statistics Canada census data and geography. It follows the tidy approach to data processing. The package’s authors are Jens von Bergmann, Dmitry Shkolnik, and Aaron Jacobs. I am not associated with the authors, I just use this great package in my work on a regular basis. The package’s GitHub code repository can be found here. There is also a tutorial by the package’s authors, which I recommend taking a look at before proceeding.

Further in this series, I will provide an in-depth real-life example of working with Canadian Census data using the cancensus package. But first, let’s install cancensus:

install.packages("cancensus")
library(cancensus)

Setting up cancensus: Adding API Key and Cache Path to .Rprofile

cancensus relies on queries to the CensusMapper API, which requires a CensusMapper API key. The key can be obtained for free as per the package’s authors instructions. Once you have the key, you can add it to your R environment:

options(cancensus.api_key = "CensusMapper_your_api_key")

Note that although the authors are warning that API requests are limited in volume, I have significantly exceeded my API quota on some occasions, and still had no issues with retrieving data.

That said, depending on how much data you need, you can draw down your quota very quickly, and here’s where local cache comes to the rescue. cancensus caches data every time you retrieve it, but by default the cache is not persistent between sessions. To make it persistent, as well as to remove the need to enter your API key every time you use cancensus, you can add both the API key and the cache path to your .Rprofile file.

If you’d like to learn in-depth what .Rprofile is, what it is for, and how to edit it, consider taking a look at sections 2.4.2 to 2.4.5 in “Efficient R Programming” by Colin Gillespie and Robin Lovelace. For a quick and simple edit, just keep reading.

Editing .Rprofile on Linux

First, find your R home directory with R.home(). Most likely, it will be in /usr/lib/R. If that is where your R home is, in the Linux Terminal (not in R), run:

sudo nano /usr/lib/R/library/base/R/.Rprofile # edit path if needed

On some systems, .Rprofile may not be hidden, so if the above command doesn’t open the .Rprofile file, try removing the dot before ‘Rprofile’:

sudo nano /usr/lib/R/library/base/R/Rprofile

(!) Note that this will edit the system .Rprofile file, which will always run on startup and will apply to all your R projects. The file itself will warn you that “it is a bad idea to use this file as a template for personal startup files”. You can safely ignore this warning as long as the only edit you are making is the one shown below, i.e. adding cancensus cache path and API key.

In the “options” section, add these two lines:

options(cancensus.api_key = "CensusMapper_your_api_key")
options(cancensus.cache_path = "/home/your_username/path_to_your_R_directory/.cancensus_cache")

Then hit Ctrl+X, choose Y for “yes”, and press Enter to save changes. When you first retrieve data with cancensus, R will create .cancensus_cache directory for you.

Editing .Rprofile on Windows

Editing .Rprofile on Windows is a bit tricky. The best thing for you would be not to touch Windows system .Rprofile, or else risk weird errors and crashes (honestly, I was not able to figure out where they come from).

Instead, set up a project-specific .Rprofile. The downside is that you may need to set it separately for every project in which you are going to use cancensus. The upside is that the contents of the .Rprofile file should be exactly the same every time. In R or Rstudio (not in the command line), run:

file.edit(".Rprofile")

Then, add these two lines to the “options” section, and save the file:

options(cancensus.api_key = "CensusMapper_your_api_key")
options(cancensus.cache_path = "C:\\Users\\Home\\Documents\\R\\cancensus_cache")

Note that when used inside R, \ symbols in file paths in Windows may need to be escaped with another \ symbol. If this file path doesn’t work, try replacing duplicate \\ with single \.

At this point you should be ready to start retrieving data with cancensus, which will be addressed in detail in my next post.

Working with Statistics Canada Data in R, Introduction

Forward to Working with Statistics Canada Data in R, Part 1.

This is the Introduction to the series on working with Statistics Canada data in the R language. The goal of the series is to provide some examples (accompanied by detailed in-depth explanations) of working with Statistics Canada data in R. Besides, I’d love to see more economists, policy analysts, and social scientists using R in their work, so I’ll be doing my best to make this easy for people without STEM degrees.

Data Types
The Tools You Need

Data Types

Statistics Canada data is routinely used for economic and policy analysis, as well as for social science research, journalism, and many other applications. It is expected that the reader has some basic R skills.

For the purposes of this series, let’s assume that there are three main types of StatCan data:

  • Statistics Canada Data, previously known as Canadian Socio-economic Information Management System (CANSIM),
  • National census data, and
  • Geographic data provided in a multitude of formats that can be used by GIS software: ArcGIS shapefiles (.shp), Geography Markup Language files (.gml), MapInfo files (.tab), etc.

The “Working with Statistics Canada Data in R” series will follow these data types, and will consist of this Introduction and the following parts: parts 1, 2, and 3 about working with CANSIM data; parts 4, 5, and 6 about Canadian Census data; and parts 7 and 8 about working with StatCan geospatial data.

This is not an official classification of data types available from Statistics Canada. The classification into CANSIM, census, and geographic data is for convenience only, and is loosely based on the key tools used for StatCan data retrieval and processing in R.

The Tools You Need

To be more specific, cansim is the package designed to retrieve CANSIM data, and cancensus is the package to get census data. Further data processing will be done with the tidyverse meta-package (a collection of packages that is itself a package) which is some of the most powerful data manipulation software currently available. GIS data is a more complex matter, but at the very minimum you will need sf, tmap, and units packages. Obviously, just as the R language, all these are completely free and open source. I am not in any way associated with the authors of any of the above packages, I just use them a lot in my work.

Note that although CANSIM has been recently renamed to Statistics Canada Data, I will be using the historic name CANSIM throughout this series in order to distinguish the data obtained from Statistics Canada Data proper from other kinds of StatCan data, i.e. census and geographic data (see how confusing this can get?).

Finally, here’s the code that installs the minimum suite of packages required to run the examples from this series. Note that you might be unable to install sf and units right now, since they have system requirements such as certain libraries being installed, which don’t usually come available “out of the box”. More on sf and units installation in the upcoming “Working with Statistics Canada Geospatial Data” post.

install.packages(c("cansim", "cancensus", "tidyverse", "tmap"))
# install.packages(c("sf", "units"))

Continue to Working with Statistics Canada Data in R, Part 1.