To visit this project on GitHub, please visit this link: https://nathankchan.github.io/covid-19-survey-analysis/
NB: Show or hide all code snippets using the
Code
button located in the upper right corner.
This R project produces an analysis of self-reported web-based survey data. It is intended to serve as a code demonstration for several advanced statistical methods, including:
This project makes extensive use of R packages created by Derek Beaton, especially for OuRS and TExPosition. Derek - thank you for the time and dedication you took to developing these tools and enabling multivariate mixed data analyses!
Data used in the preparation of this project was obtained from the COVID-19 Behavior Determinants Database. Data collection and sharing of this database was funded by the Centre for Addiction and Mental Health (CAMH) Foundation.
The COVID-19 pandemic introduced unprecedented and disruptive changes to society. Characterizing how such changes have impacted individual health and wellness is important for informing public health policy discussions.
To this end, scientists at CAMH conducted a self-reported web-based survey of persons located in the U.S. or Canada, collecting information about their socioeconomic & demographic status, mental wellness, and behaviour during the COVID-19 pandemic. Participants were located in the most populous U.S. states (New York, California, Florida, and Texas) and Canadian provinces (excluding Quebec). Data were collected at three time points: May 2020, July 2020, and March 2021.
Scientists previously reported findings from this database using a traditional hypothesis-driven approach - a common paradigm in academic research used to generate formal evidence supporting a proposed explanation. Nevertheless, hypotheses that were not considered or tested could reveal other important findings, but given the size of the database, exhaustively testing all possible hypotheses would be both impractical and statistically inappropriate.
Using a data-driven approach may reveal additional information from the database. Data-driven approaches are designed to generate new hypotheses from the data instead of supportive evidence for a particular explanation. They aim to characterize the underlying “variance structure” of a dataset, helping the user intuitively grasp associations between groups of variables and potentially identify unexpected relationships.
Thus, the aim of this project is to investigate the COVID-19 Behaviour Determinants Database using an exploratory, data-driven approach. In doing so, I hope to identify the strongest and any unexpected associations between variables of interest and other variables in the dataset.
This analysis requires R to be installed. If it is not installed, please visit r-project.org to download the latest version.
This project was built with R version 4.1.2 “Bird Hippie”. To reproduce this analysis exactly, please ensure that the correct package versions are installed on your machine.
This analysis requires the following R packages:
This analysis also requires the following GitHub packages:
To automatically install required packages, open RStudio and source
00_init.R
(see below for example). The script will ask for
your permission to install before proceeding. Please ensure all
packages install successfully before proceeding.
Please restart R and re-run 00_init.R
if any packages are installed. Packages must be loaded in order to avoid
issues with namespace conflicts. Failing to re-source
00_init.R
may result in errors.
source(paste0(getwd(), "/scripts/00_init.R"))
## ./R/functions.R is loaded.
## Loading required package: remotes
## Loading required package: dataverse
## Loading required package: plot.matrix
## Loading required package: mice
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
## Loading required package: knitr
## Loading required package: kableExtra
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks mice::filter(), stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag() masks stats::lag()
## Loading required package: haven
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: htmlwidgets
## Loading required package: shiny
## Loading required package: colorspace
## Loading required package: ExPosition
## Loading required package: prettyGraphs
## Loading required package: TExPosition
## Loading required package: GSVD
## Loading required package: GPLS
## Loading required package: ours
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
##
## Attaching package: 'ours'
## The following objects are masked from 'package:GPLS':
##
## ca_preproc, escofier_coding, thermometer_coding
## All required packages are loaded
## ./scripts/00_init.R was executed
This project uses a series of R scripts to automate data processing
and analysis. Scripts are housed in ./scripts/
, and each
script starts with sourcing the immediately preceding script. Scripts
write computationally expensive intermediates in ./output/
and pass R objects downstream. This system enables the user to make
changes at any point of the analysis pipeline while balancing
performance needs.
To improve performance, computationally expensive intermediates are
loaded by scripts if (1) the file is available AND (2) the input to the
intermediate is unchanged from the time the script was last run.
Otherwise, the intermediate is re-computed and written out before
continuing. Note that certain steps (e.g., multiple imputation as
performed in 02_cleandata.R
) may take several hours to
complete if their intermediates are modified or deleted.
The remainder of this report is divided into several sections.
Data extraction & pre-processing demonstrates an automated process to download data from Harvard Dataverse. The report then examines the structure of the data and prepares it for later analysis. Key topics discussed include data cleaning and imputation of missing data.
Outlier analysis takes the prepared dataset and examines it for any potential outliers by computing a “multivariate standard deviation”. The most important variables contributing to the “outlierness” of observations are also identified. Key topics discussed include Mahalanobis distances and the Garthwaite-Koch partition.
Finally, inferential analysis illustrates the magnitude and direction of statistical associations between groups of variables of interest (e.g., responses about vaccine hesitancy behaviours) and all other variables in the dataset. Key topics discussed include correspondence analysis and data visualization.