bcdata: An R package for searching & retrieving data from the B.C. Data Catalogue

bcdata is an R package that connects publicly available metadata and data sets in the British Columbia (B.C.) Data Catalogue (DataBC Program (2020)) to the diverse array of mapping, modeling and data processing capabilities of the R ecosystem. bcdata enables the efficient retrieval of British Columbia’s geospatial data, and supports repeatable and reproducible analysis of hundreds of open-licensed British Columbia public sector data sets. By enabling programmatic access to the B.C. Data Catalogue using familiar R dplyr syntax (Wickham et al. (2020)), bcdata helps both novice and experienced R users find and use British Columbia government public and open data holdings.


Introduction
The British Columbia government hosts over 2000 tabular and geospatial data sets in the B.C. Data Catalogue. Most provincial geospatial data is available through the B.C. Data Catalogue under an open licence, via a Web Feature Service. A Web Feature Service is a powerful and flexible service for distributing geographic features over the web, supporting both geospatial and non-spatial querying. The bcdata package for the R programming language (R Core Team (2017)) wraps two distinct but complimentary web application programming interfaces -one for the B.C. Data Catalogue and one for the Web Feature Service. This allows R users to search, download and import metadata and data from the B.C. Data Catalogue, as well as efficiently query and directly read geospatial data from the Web Feature Service into their R session. The bcdata package implements a novel application of dbplyr (Wickham & Ruiz (2020)) using a Web Feature Service backend-rather than a database backendwhere a locally constructed query is processed by a remote server. This allows for fast and efficient geospatial data retrieval while using dplyr syntax. Through this functionality the bcdata package connects British Columbia government public data holdings in the B.C. Data Catalogue with the vast capabilities of R.

Related Work
Open data and geospatial data science are currently popular topics in the R community. Packages related to bcdata include ckanr (Chamberlain et al. (2021)) for interacting with CKAN instances, and ows4R (Blondel (2020)) which provides a low-level R6 interface to Open Geospatial Consortium Web Services. bcdata seamlessly unifies these operations for B.C. public data holdings, and provides a user-friendly interface using a functional programming style that is familiar to users of the popular tidyverse tools (Wickham et al. (2019)). There are many packages available for other jurisdictional data portals (e.g., opendatatoronto, opendataes) however as far as the authors are aware, no other packages provide the dplyr like syntax to large geospatial data sets via a Web Feature Service. Access a single record by calling bcdc_get_record(ID) with the ID from the desired record.

Usage
The user can retrieve the metadata for a single catalogue record by using the record name or permanent ID with bcdc_get_record(). A catalogue record can have one or multiple data files-or "resources." The user can use the bcdc_tidy_resources() function to return a data frame listing all of the data resources and corresponding resource IDs for a catalogue record.

Get Data
Once the user has located the B.C. Data Catalogue record with the data they want, bcdat a::bcdc_get_data() can be used to download and read the data from the record. While any of the record name, permanent ID or the result from bcdc_get_record() can be used to specify the data record, bcdata suggests supplying the more reliable permanent ID to the record argument to guard against future name changes in an English string.
Let's try to access data for scholarships in B.C. school record: The record you are trying to access appears to have more than one resource. --------Please choose one option: 1: AwardsScholarshipsHist.xlsx 2: AwardsScholarshipsHist.txt Since there are multiple data resources in the record, the user will need to specify which data resource they want. bcdata gives the user the option to interactively choose a resource, however for scripts it is usually better to be explicit and specify the desired data resource using the resource argument. We are interested, in this case, in the .xlsx file so we choose option 1 or:

Query & Read Geospatial Data
While bcdc_get_data() will retrieve geospatial data, sometimes the geospatial file is very large-and slow to download-or the user may only want some of the data. bcdc_query_g eodata() allows the user to query catalogue geospatial data available from the Web Feature Service using select and filter functions (just like in dplyr, ). The bcdc::collect() function returns the bcdc_query_geodata() query results as an sf object (Pebesma (2018)) in the R session. The query is processed on the server, filtering the data to only those records and fields the user has specified. Once the query is complete and the user requests the final result, only then is the filtered data downloaded and loaded into R as an 'sf' object, substantially reducing the size of the data being downloaded. This functionality is implemented using a custom dbplyr backend-while other dbplyr backends interface with various databases (e.g., SQLite, PostgreSQL), the bcdata backend interfaces with the B.C. Data Catalogue Web Feature Service.
To demonstrate, we will query the Vancouver Island Marmot location polygons from the publicly-available Species and Ecosystems at Risk Occurrences geospatial data-the whole file takes over 100 seconds to download and we only need the marmot polygons, so the request can be narrowed: This demonstrates the efficiency of the filter-first then download approach: the size of the object downloaded by using bcdc_query_geodata() with filter() is 1118 times smaller than downloading the entire data set using bcdc_get_data() and filtering locally.

Conclusion
The bcdata R package connects R users with British Columbia government's vast collection of data holdings in the B.C. Data Catalogue through an efficient and familiar interface. This enables the use of cutting edge statistical and plotting capabilities in a modern data science context, and provides a pathway to generate important insights from open and public data.