This is the introduction to the series on working with Statistics Canada data in the R language. The goal of the series is to provide some examples (accompanied by detailed in-depth explanations) of working with Statistics Canada data in R. Besides, I’d love to see more economists, policy analysts, and social scientists using R in their work, so I’ll be doing my best to make this easy for people without STEM degrees.
Table of Contents
Data Types
Statistics Canada data is routinely used for economic and policy analysis, as well as for social science research, journalism, and many other applications. It is expected that the reader has some basic R skills.
For the purposes of this series, let’s assume that there are three main types of StatCan data:
- Statistics Canada Data, previously known as Canadian Socio-economic Information Management System (CANSIM),
- Canadian Census data, and
- Geographic data such as boundary files provided in a multitude of formats that can be used by GIS software: ArcGIS shapefiles (.shp), Geography Markup Language files (.gml), MapInfo files (.tab), etc.
The “Working with Statistics Canada Data in R” series will follow these data types, and will consist of this Introduction and several articles about working with CANSIM data, Canadian Census data, and StatCan geospatial data.
This is not an official classification of data types available from Statistics Canada. The classification into CANSIM, census, and geographic data is for convenience only, and is loosely based on the key tools used for StatCan data retrieval and processing in R.
The Tools You Need
To be more specific, cansim
is the package designed to retrieve CANSIM data, and cancensus
is the package to get census data. Further data processing will be done with the tidyverse
meta-package (a collection of packages that is itself a package) which is some of the most powerful data manipulation software currently available. GIS data is a more complex matter, but at the very minimum you will need sf
, tmap
, and units
packages. Obviously, just as the R language, all these are completely free and open source. I am not in any way associated with the authors of any of the above packages, I just use them a lot in my work.
Note that although CANSIM has been recently renamed to Statistics Canada Data, I will be using the historic name CANSIM throughout this series in order to distinguish the data obtained from Statistics Canada Data proper from other kinds of StatCan data, i.e. census and geographic data (see how confusing this can get?).
Finally, here’s the code that installs the minimum suite of packages required to run the examples from this series. Note that you might be unable to install sf and units right now, since they have system requirements such as certain libraries being installed, which don’t usually come available “out of the box”. More on sf and units installation in the upcoming “Working with Statistics Canada Geospatial Data” post.
install.packages(c("cansim", "cancensus", "tidyverse", "tmap"))
# install.packages(c("sf", "units"))