Shopify Data Science Intern Challenge

To visit this project on GitHub, please visit this link: https://github.com/nathankchan/shopify-ds-intern-challenge

NB: Show or hide all code snippets using the Code button located in the upper right corner.

Question 1: Analysis of AOV
Question 2: SQL for Northwind
Appendix

Question 1: Analysis of AOV

Prompt

Given some sample data, write a program to answer the following: click here to access the required data set

On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.

Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.
What metric would you report for this dataset?
What is its value?

Executive Summary

See Part 1a. There are outliers in the dataset. Other measures of central tendency are more appropriate.
See Part 1b. The most appropriate metric to report is the geometric mean.
See Part 1c. The value of the geometric mean is $285.02.

For convenience, summary statements are provided in blue in each section.

If you already have R installed, consider downloading this project from GitHub to run your own custom analysis! Open a terminal, navigate to the repo, and type in Rscript app.R for an interactive GUI that produces histogram-density plots and key metrics.

Alternatively, visit the Histogram-Density Plot Viewer Shiny app online and upload your data for your own interactive analysis.

Getting started

This analysis requires R to be installed. If it is not installed, please visit r-project.org to download the latest version.

This analysis requires the following R packages:

If any of these packages are not installed, this analysis will ask for your permission to install before proceeding.

Introduction

To ensure this analysis is reproducible, a series of scripts (see Appendix) are run to initialize the environment and load the “2019 Winter Data Science Intern Challenge Data Set.xlsx” excel file into R. Data are stored in the tibble shopifydata. For ease of access and readability, shopifydata is also attached to the R search path.


# NB: If any required packages are missing and you are running this code block
# within an R Notebook, it will fail to execute properly. This behaviour is
# intentional and acts as a safety mechanism to ensure the user provides
# explicit consent before installing packages. If this occurs and you wish to
# install the packages, copy the code below into your terminal or console, then
# continue with the rest of this document.

source(paste0(getwd(), "/scripts/01_loaddata.R"))
attach(shopifydata)

It is generally good practice to examine the data before moving further. shopifydata contains 5000 rows and 7 columns describing sales data from sneaker shops. For readability, only the first few rows are displayed. As seen below, order value is contained in the column order_amount.

kbl(head(shopifydata), 
    caption = "Table 1: 2019 Winter Data Science Intern Challenge Data Set") %>%
  kable_styling("hover", full_width = F) %>%
  scroll_box(
    height = "250px")

Table 1: 2019 Winter Data Science Intern Challenge Data Set
order_id	shop_id	user_id	order_amount	total_items	payment_method	created_at
1	53	746	224	2	cash	2017-03-13 12:36:56
2	92	925	90	1	cash	2017-03-03 17:38:51
3	44	861	144	1	cash	2017-03-14 04:23:55
4	18	935	156	1	credit_card	2017-03-26 12:43:36
5	18	883	156	1	credit_card	2017-03-01 04:35:10
6	58	882	138	1	credit_card	2017-03-14 15:25:00

Let’s also quickly confirm that the earlier cited average order value (AOV) is correct.

print(paste0("The AOV is $", mean(order_amount), "."))

## [1] "The AOV is $3145.128."

The AOV is $3414.13 (rounded to the nearest cent). It looks like everything checks out!

Part 1a: Evaluate data

“Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.”

TL;DR: The data contain many extreme outliers and outliers such that the arithmetic mean does not represent the center of the distribution for order_amount. The geometric mean more accurately describes the center for these data.

Two key approaches should be considered when evaluating a data set:

summary statistics, and
visualization.

Using these approaches, I generate several observations.

Summary statistics

TL;DR: The AOV (a.k.a. arithmetic mean) misrepresents the “center” of the distribution for order_amount, as it lies far beyond the 3rd quartile. We should consider other measures of central tendency, such as the median, mode, geometric mean, and harmonic mean.

First, let’s take a look at some summary statistics. We’ll use a custom function called summ_stats() on order_amount to produce the table below.

summarystats <- summ_stats(order_amount)

kbl(summarystats, 
      digits = 2,
      align = "lr", 
      caption = "Table 2: Summary of Order Value") %>%
  kable_styling("hover", full_width = FALSE) %>%
  row_spec(c(3,4,7:9), bold = TRUE) %>%
  column_spec(c(1,2), width = "50%") %>%
  scroll_box(
    width = "100%",
    box_css = "border: 0px solid #ddd; padding: 5px; ")

Table 2: Summary of Order Value
Statistic	Value
Minimum	90.00
1st Quartile	163.00
Median	284.00
Arithmetic Mean	3145.13
3rd Quartile	390.00
Maximum	704000.00
Mode	153.00
Geometric Mean	285.02
Harmonic Mean	235.32

Bold rows provide the key metrics we’re most interested in. That said, let’s first take a look at some of the additional information provided.

The table shows that order_amount contains a maximum value of 704000.00 — far greater than the AOV of 3145.13 (herein described as the “arithmetic mean”). It’s possible that the arithmetic mean is being pulled up by at least one extremely large value in the data.

Also, the 3rd quartile is 390.00, which is far less than the arithmetic mean. This means that at least 75% of the data is below the arithmetic mean of order_amount.

Back to the key metrics: arithmetic mean, median, mode, geometric mean, and harmonic mean. These metrics represent different ways of representing the “center” of the data. We’ll describe these metrics in greater detail in Part 1b. For now, just note that the median (284.00), mode (153.00), geometric mean (285.02), and harmonic mean (235.32) are all less than the arithmetic mean (3145.13).

Together, these suggest that the distribution of order_amount is heavily right-skewed. We will verify this intuition when we visualize the data in the next section.

Visualization

TL;DR: Visualizing the data confirms that the arithmetic mean is far from the center of the distribution of order_amount. Other measures of central tendency more appropriately characterize the center.

Let’s make a visualization of the distribution of order_amount. Using the ggplot2 and plotly packages, we can create an interactive histogram to examine the data.

This plot is interactive. Drag your cursor over the plot to zoom in. Double-click to zoom out. Click on the legend items to hide or show plot elements.

binwidth1 <- 10000

p1 <-
  ggplot(data = shopifydata, aes(x = order_amount)) +
  geom_histogram(
    aes(color = "Histogram"),
    fill = "springgreen4",
    size = 0,
    binwidth = binwidth1) +
  geom_density(
    # Density plots are usually constrained within [0,1]. However, ggplot 
    # requires that the y-axis of plots have the same scale. This is a 
    # workaround to let our density plot display properly.
    aes(y = ..density.. * nrow(shopifydata) * binwidth1 / 100, color = "Density Plot"),
    fill = "springgreen4",
    size = 0,
    alpha = 0.3) +
  labs(x = "Order Value ($)", y = "Count") +
  scale_x_continuous(
    labels = function(x)
      format(x, scientific = FALSE)
  ) +
  scale_color_manual(
    values = c(
      "Histogram" = "springgreen4",
      "Density Plot" = "springgreen4"
    )
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

# plotly does not support subtitles or captions, so we need to manually define
# them with HTML as part of the title
ggplotly(p1) %>%
  layout(title = list(text = paste0(
    '<span style = "font-size: 15px;">',
    "<b>Figure 1: Histogram of Order Value</b>",
    "</span>")),
    legend = list(
      orientation = "h", x = 0.5, y = -0.25, xanchor = "center"))

It’s clear that order_amount has several extreme outliers at 704000, as well as outliers between 20000 to 200000. While these values are few in number, they are far greater than any of the key metrics we considered in the previous section.

Moreover, we can visually confirm the extreme right skewness of order_amount. Our earlier intuition was correct.

Lastly, let’s take a look at order_amount after we remove any values over 20000. We’ll also plot the key metrics directly on the graph.

This plot is interactive. Drag your cursor over the plot to zoom in. Double-click to zoom out. Click on the legend items to hide or show plot elements. Values of key metrics are summarized in Table 3.

shopifydata_under20000 <- shopifydata %>% filter(order_amount < 20000)
binwidth2 <- 25

p2 <-
  ggplot(data = shopifydata_under20000, aes(x = order_amount)) +
  geom_histogram(
    aes(color = "Histogram"),
    fill = "springgreen4",
    size = 0,
    binwidth = binwidth2) +
  geom_density(
    # Density plots are usually constrained within [0,1]. However, ggplot 
    # requires that the y-axis of plots have the same scale. This is a 
    # workaround to let our density plot display properly.
    aes(y = ..density.. * nrow(shopifydata_under20000) * binwidth2, color = "Density Plot"),
    fill = "springgreen4",
    size = 0,
    alpha = 0.3
  ) +
  geom_vline(
    aes(xintercept = summarystats[which(summarystats == "Arithmetic Mean"), 2], color = "Arithmetic Mean"),
    linetype = "longdash",
    size = 0.25,
  ) +
  geom_vline(
    aes(xintercept = summarystats[which(summarystats == "Median"), 2], color = "Median"),
    linetype = "dotdash",
    size = 0.25,
  ) +
  geom_vline(
    aes(xintercept = summarystats[which(summarystats == "Mode"), 2], color = "Mode"),
    linetype = "dotted",
    size = 0.25,
  ) +
  geom_vline(
    aes(xintercept = summarystats[which(summarystats == "Geometric Mean"), 2], color = "Geometric Mean"),
    linetype = "twodash",
    size = 0.25,
  ) +
  geom_vline(
    aes(xintercept = summarystats[which(summarystats == "Harmonic Mean"), 2], color = "Harmonic Mean"),
    linetype = "dashed",
    size = 0.25,
  ) +
  labs(
    x = "Order Value ($)",
    y = "Count"
  ) +
  scale_x_continuous(
    labels = function(x)
      format(x, scientific = FALSE),
    guide = guide_legend()
  ) +
  scale_color_manual(
    name = "",
    values = c(
      "Histogram" = "springgreen4",
      "Density Plot" = "springgreen4",
      "Arithmetic Mean" = "red",
      "Median" = "blue",
      "Mode" = "orange",
      "Geometric Mean" = "green",
      "Harmonic Mean" = "grey"
    )
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

# plotly does not support subtitles or captions, so we need to manually define
# them with HTML as part of the title
ggplotly(p2) %>%
  layout(title = list(text = paste0(
    '<span style = "font-size: 15px;">',
    "<b>Figure 2: Histogram of Order Value</b>",
    "</span>",
    "<br>",
    '<span style = "font-size: 14px;">',
    "<sup>Showing only orders under $20000</sup>",
    "</span>")),
    legend = list(
      orientation = "h", x = 0.5, y = -0.25, xanchor = "center"))

Even after removing orders values over 20000, the distribution still appears right skewed. Moreover, the distribution appears to be multi-modal, with peaks separated at intervals of about 150 to 200 and decreasing in size as order value increases.

Observations

TL;DR: Extreme outliers might be orders from a wholesaler. Outliers might be orders from luxury sneaker shops. The data appear to be multi-modal, likely due to the fact that sneakers are purchased in whole numbers.

Knowing that these data represent the value of orders from sneaker shops, I suggest several hypotheses and opportunities for further investigation.

Extreme Outliers

The extreme outliers at order_amount == 704000 could represent bulk orders from a sneaker wholesaler to a retailer. This inference is supported by the following observations:

all of these orders were purchased from one shop (shop_id == 42);
one user (user_id == 607) was responsible for these orders; and
the order quantity for each order was large (total_items == 2000).

# Select the extreme outliers only
shopifydata %>% 
  filter(order_amount > 500000) %>%
  kbl(., caption = "Table 3: Extreme Outliers") %>%
  kable_styling("hover", fixed_thead = TRUE) %>%
  scroll_box(height = "250px")

Table 3: Extreme Outliers
order_id	shop_id	user_id	order_amount	total_items	payment_method	created_at
16	42	607	704000	2000	credit_card	2017-03-07 04:00:00
61	42	607	704000	2000	credit_card	2017-03-04 04:00:00
521	42	607	704000	2000	credit_card	2017-03-02 04:00:00
1105	42	607	704000	2000	credit_card	2017-03-24 04:00:00
1363	42	607	704000	2000	credit_card	2017-03-15 04:00:00
1437	42	607	704000	2000	credit_card	2017-03-11 04:00:00
1563	42	607	704000	2000	credit_card	2017-03-19 04:00:00
1603	42	607	704000	2000	credit_card	2017-03-17 04:00:00
2154	42	607	704000	2000	credit_card	2017-03-12 04:00:00
2298	42	607	704000	2000	credit_card	2017-03-07 04:00:00
2836	42	607	704000	2000	credit_card	2017-03-28 04:00:00
2970	42	607	704000	2000	credit_card	2017-03-28 04:00:00
3333	42	607	704000	2000	credit_card	2017-03-24 04:00:00
4057	42	607	704000	2000	credit_card	2017-03-28 04:00:00
4647	42	607	704000	2000	credit_card	2017-03-02 04:00:00
4869	42	607	704000	2000	credit_card	2017-03-22 04:00:00
4883	42	607	704000	2000	credit_card	2017-03-25 04:00:00

Outliers

The outliers between 20000 to 200000 could represent orders from a luxury sneaker retailer. This inference is supported by the following observations:

all of these orders were purchased from one shop (shop_id == 78);
the order quantity for each order was small (total_items between 1 to 6); and
the unit cost for each order was large and constant (order_amount == 25725).

# Select the outliers and compute the unit cost for each row
shopifydata %>% 
  filter(order_amount > 20000 & order_amount < 500000) %>%
  mutate(unit_cost = order_amount / total_items) %>%
  kbl(.,
      caption = "Table 4: Outliers") %>%
  kable_styling("hover", fixed_thead = TRUE) %>%
  scroll_box(height = "250px")

Table 4: Outliers
order_id	shop_id	user_id	order_amount	total_items	payment_method	created_at	unit_cost
161	78	990	25725	1	credit_card	2017-03-12 05:56:56	25725
491	78	936	51450	2	debit	2017-03-26 17:08:18	25725
494	78	983	51450	2	cash	2017-03-16 21:39:35	25725
512	78	967	51450	2	cash	2017-03-09 07:23:13	25725
618	78	760	51450	2	cash	2017-03-18 11:18:41	25725
692	78	878	154350	6	debit	2017-03-27 22:51:43	25725
1057	78	800	25725	1	debit	2017-03-15 10:16:44	25725
1194	78	944	25725	1	debit	2017-03-16 16:38:25	25725
1205	78	970	25725	1	credit_card	2017-03-17 22:32:21	25725
1260	78	775	77175	3	credit_card	2017-03-27 09:27:19	25725
1385	78	867	25725	1	cash	2017-03-17 16:38:06	25725
1420	78	912	25725	1	cash	2017-03-30 12:23:42	25725
1453	78	812	25725	1	credit_card	2017-03-17 18:09:54	25725
1530	78	810	51450	2	cash	2017-03-29 07:12:01	25725
2271	78	855	25725	1	credit_card	2017-03-14 23:58:21	25725
2453	78	709	51450	2	cash	2017-03-27 11:04:04	25725
2493	78	834	102900	4	debit	2017-03-04 04:37:33	25725
2496	78	707	51450	2	cash	2017-03-26 04:38:52	25725
2513	78	935	51450	2	debit	2017-03-18 18:57:13	25725
2549	78	861	25725	1	cash	2017-03-17 19:35:59	25725
2565	78	915	77175	3	debit	2017-03-25 01:19:35	25725
2691	78	962	77175	3	debit	2017-03-22 07:33:25	25725
2774	78	890	25725	1	cash	2017-03-26 10:36:43	25725
2819	78	869	51450	2	debit	2017-03-17 06:25:50	25725
2822	78	814	51450	2	cash	2017-03-02 17:13:25	25725
2907	78	817	77175	3	debit	2017-03-16 03:45:46	25725
2923	78	740	25725	1	debit	2017-03-12 20:10:58	25725
3086	78	910	25725	1	cash	2017-03-26 01:59:26	25725
3102	78	855	51450	2	credit_card	2017-03-21 05:10:34	25725
3152	78	745	25725	1	credit_card	2017-03-18 13:13:07	25725
3168	78	927	51450	2	cash	2017-03-12 12:23:07	25725
3404	78	928	77175	3	debit	2017-03-16 09:45:04	25725
3441	78	982	25725	1	debit	2017-03-19 19:02:53	25725
3706	78	828	51450	2	credit_card	2017-03-14 20:43:14	25725
3725	78	766	77175	3	credit_card	2017-03-16 14:13:25	25725
3781	78	889	25725	1	cash	2017-03-11 21:14:49	25725
4041	78	852	25725	1	cash	2017-03-02 14:31:11	25725
4080	78	946	51450	2	cash	2017-03-20 21:13:59	25725
4193	78	787	77175	3	credit_card	2017-03-18 09:25:31	25725
4312	78	960	51450	2	debit	2017-03-01 03:02:10	25725
4413	78	756	51450	2	debit	2017-03-02 04:13:38	25725
4421	78	969	77175	3	debit	2017-03-09 15:21:34	25725
4506	78	866	25725	1	debit	2017-03-22 22:06:00	25725
4585	78	997	25725	1	cash	2017-03-25 21:48:43	25725
4716	78	818	77175	3	debit	2017-03-05 05:10:43	25725
4919	78	823	25725	1	cash	2017-03-15 13:26:46	25725

Multi-Modality

The multi-modal distribution, especially when examining the data with outliers excluded, could emerge due to the discreteness of the underlying data structure.

This discreteness could be attributed to the typical unit cost of sneakers of about 150 (note that all key metrics have approximately the same value in the table below) and that sneakers are always purchased as multiples of whole numbers (i.e., total_items is always a positive non-zero integer).

# Exclude the outliers and compute the unit cost, then summarize in a table
shopifydata %>% 
  filter(order_amount < 20000) %>%
  mutate(unit_cost = order_amount / total_items) %>%
  pull(unit_cost) %>%
  summ_stats() %>%
  kbl(., 
      digits = 2,
      align = "lr", 
      caption = "Table 5: Summary of Unit Cost Excluding Outliers") %>%
  kable_styling("hover", full_width = FALSE) %>%
  row_spec(c(3,4,7:9), bold = TRUE) %>%
  column_spec(c(1,2), width = "50%") %>%
  scroll_box(
    width = "100%",
    box_css = "border: 0px solid #ddd; padding: 5px; ")

Table 5: Summary of Unit Cost Excluding Outliers
Statistic	Value
Minimum	90.00
1st Quartile	132.00
Median	153.00
Arithmetic Mean	151.79
3rd Quartile	166.00
Maximum	352.00
Mode	153.00
Geometric Mean	149.31
Harmonic Mean	146.95

Part 1b: Metric to report

What metric would you report for this dataset?

TL;DR: The most appropriate metric to report is the geometric mean as it accurately represents the center of the distribution while accommodating the presence of outliers. The median could be easier to conceptualize but has certain drawbacks. Practically speaking, both values are nearly the same for these data.

As described earlier in Table 2, several key metrics can be considered. These metrics represent different ways of measuring the “center” of a dataset. The most appropriate metric to report depends on the underlying characteristics of the data.

For convenience, key metrics have been reprinted below. The same colours used in Figure 2 are applied to highlight the table below.

keymetrics <- summarystats[c(4,3,7:9),]
row.names(keymetrics) <- NULL

kable(keymetrics, 
      digits = 2,
      align = "lr", 
      caption = "Table 3: Key Metrics") %>%
  kable_styling("hover", full_width = FALSE) %>%
  row_spec(1, italic = TRUE, color = "red") %>%
  row_spec(2, color = "blue") %>%
  row_spec(3, color = "orange") %>%
  row_spec(4, bold = TRUE, color = "lime", background = "black") %>%
  row_spec(5, color = "grey") %>%
  column_spec(c(1,2), width = "50%") %>%
  scroll_box(
    width = "100%",
    box_css = "border: 0px solid #ddd; padding: 5px; ")

Table 3: Key Metrics
Statistic	Value
Arithmetic Mean	3145.13
Median	284.00
Mode	153.00
Geometric Mean	285.02
Harmonic Mean	235.32

Of these metrics, I argue that the geometric mean is the most appropriate. As seen in below, the geometric mean lies at the approximate “center” of the distribution.

This plot is interactive. Drag your cursor over the plot to zoom in. Double-click to zoom out. Click on the legend items to hide or show plot elements.

p3 <- p2 +
  coord_cartesian(xlim = c(0,1000))

# plotly does not support subtitles or captions, so we need to manually define
# them with HTML as part of the title
ggplotly(p3) %>%
  layout(title = list(text = paste0(
    '<span style = "font-size: 15px;">',
    "<b>Figure 3: Histogram of Order Value</b>",
    "</span>",
    "<br>",
    '<span style = "font-size: 14px;">',
    "<sup>Showing only orders under $20000 | Zoomed in from $0 to $1000</sup>",
    "</span>")),
    legend = list(
      orientation = "h", x = 0.5, y = -0.25, xanchor = "center"))

While this metric is often used for data with multiplicative relationships (e.g., interest rates or ratios), it is also useful when data do not exist on the same scale or have large outliers. In other words, the geometric mean is less susceptible to skew effects than other metrics. The geometric mean also makes full use of the available data.

With that said, the geometric mean could be unfamiliar to lay audiences. If familiarity and comprehensibility are the priority, then the median may be more appropriate. In this particular case, the geometric mean and median are extremely close, and both metrics appear to lie at the “center” of the distribution. Either metric might be acceptable.

Nevertheless, both the geometric mean and the median have drawbacks. Although not a concern for this dataset, it should be noted that the geometric mean is not defined when the data have meaningful zeros or negative numbers.

Meanwhile, the median (and mode) ignore parts of the data. For example, say a particular sneaker that cost $150 became a viral phenomenon in the next month of data (i.e., April 2017), accounting for the lowest 51% of orders. Meanwhile, the remaining 49% of orders cost anywhere from $151 to $704000. This could make the median misleading to report as it would be far from the “center” of the distribution. Since the median only considers the “middle” of the data, it remains susceptible to skew effects.

In any case, reporting the arithmetic mean, mode, or harmonic mean as the “center” of the dataset would be inappropriate. As already seen, the arithmetic mean is extremely susceptible to outliers. The mode does not characterize the “center” of the data, but instead focuses on the most frequent values. Finally, the harmonic mean is more appropriate for averaging rates or ratios and not (raw) values like order value.

Part 1c: Value of metric

What is [the metric’s] value?

The most appropriate metric to report is the geometric mean. Its value is $285.02.

If familiarity and comprehensibility are the priority, the median could also be reported, though it has drawbacks as described earlier.

Question 2: SQL for Northwind

Prompt

For this question you’ll need to use SQL. Follow this link to access the data set required for the challenge. Please use queries to answer the following questions. Paste your queries along with your final numerical answers below.

How many orders were shipped by Speedy Express in total?
What is the last name of the employee with the most orders?
What product was ordered the most by customers in Germany?

Part 2a: Speedy Express orders

How many orders were shipped by Speedy Express in total?

Speedy Express shipped 54 orders in total.


SELECT
  COUNT(*) AS TotalOrders
FROM
  Orders
INNER JOIN
  Shippers ON Orders.ShipperID = Shippers.ShipperID
WHERE
  ShipperName = 'Speedy Express';

Part 2b: Employee last name

What is the last name of the employee with the most orders?

The last name of the employee with the most orders is Peacock.


--  2b.1 Using MAX() and INNER JOIN
/*
NB: 2b.1 is faster than 2b.2 but is less useful if the intention is to rank 
    employees by number of orders.
*/
SELECT
  LastName
FROM (
  SELECT
    LastName,
    MAX(TotalOrders) AS MaxTotalOrders
  FROM (
    SELECT
      Employees.LastName,
      COUNT(*) AS TotalOrders
    FROM
      Employees
    INNER JOIN
      Orders ON Employees.EmployeeID = Orders.EmployeeID
    GROUP BY
      Employees.EmployeeID
    )
  );


-- 2b.2 Using ORDER BY and LIMIT
/*
NB: 2b.2 is slower than 2b.1 but makes it easier to get a table of employees
    ranked by number of orders.
*/
SELECT
  LastName
FROM (
  SELECT
    Employees.LastName,
    COUNT(*) AS TotalOrders
  FROM
    Employees
  INNER JOIN
    Orders ON Employees.EmployeeID = Orders.EmployeeID
  GROUP BY
    Employees.EmployeeID
  ORDER BY
    TotalOrders DESC
  )
LIMIT 1;

Part 2c: Product most ordered

What product was ordered the most by customers in Germany?

The product most ordered by customers in Germany was Boston Crab Meat.


--  2c.1 Using MAX() and INNER JOIN
/*
NB: 2c.1 is slower than 2c.2 but is less useful if the intention is to rank 
    the most popular products among German customers.
*/
SELECT
  ProductName
FROM (
  SELECT
    ProductName,
    MAX(SumQty) AS MaxSumQty
  FROM (
    SELECT
      Products.ProductName,
      SUM(OrderDetails.Quantity) AS SumQty
    FROM
      Products
    INNER JOIN
      OrderDetails ON Products.ProductID = OrderDetails.ProductID
    INNER JOIN
      Orders ON OrderDetails.OrderID = Orders.OrderID
    INNER JOIN
      Customers ON Orders.CustomerID = Customers.CustomerID
    WHERE
      Customers.Country = 'Germany'
    GROUP BY
      OrderDetails.ProductID
    )
  );

-- 2c.2 Using ORDER BY and LIMIT
/*
NB: 2c.2 is slower than 2c.1 but makes it easier to get a table of the most
    popular products among German customers.
*/
SELECT
  ProductName
FROM (
  SELECT
    Products.ProductName,
    SUM(OrderDetails.Quantity) AS SumQty
  FROM
    Products
  INNER JOIN
    OrderDetails ON Products.ProductID = OrderDetails.ProductID
  INNER JOIN
    Orders ON OrderDetails.OrderID = Orders.OrderID
  INNER JOIN
    Customers ON Orders.CustomerID = Customers.CustomerID
  WHERE
    Customers.Country = 'Germany'
  GROUP BY
    OrderDetails.ProductID
  ORDER BY
    SumQty DESC
  )
LIMIT 1;

Appendix

./R/functions.R


cat ./R/functions.R

# File name: functions.R
# Path: './R/functions.R'

# Author: N. Chan
# Purpose: Script for housing custom functions required for project.

# NB: Strictly speaking, this is a project and not a package. That said, using
# the Roxygen2 documentation style for writing functions is helpful for keeping
# information about custom functions organized.


#' @title Check, install, and load required packages
#'
#' @description Automated method for checking, installing, and loading a list
#' of packages provided by the user. Function asks the user before installing.
#'
#' @param ... A list or vector containing the names of packages as strings to
#' be loaded and installed.
#'
#' @return None
#'
#' @examples
#' \dontrun{
#' pkgs <- c("ggplot2", "tidyverse")
#' using(pkgs)
#' }
#'
#' @author N. Chan
#' @export
#'

using <- function(...) {
  libs <- unlist(list(...))
  req <- suppressWarnings(unlist(lapply(libs, require, character.only = TRUE)))
  need <- libs[req == FALSE]
  n <- length(need)
  
  if (n > 0) {
    libsmsg <-
      if (n > 2) {
        paste(paste(need[1:(n - 1)], collapse = ", "), ",", sep = "")
      } else
        need[1]
    
    if (n > 1) {
      libsmsg <- paste(libsmsg, " and ", need[n], sep = "")
    }
    
    libsmsg <-
      paste(
        "The following packages could not be found: ",
        libsmsg,
        "\n\r\n\rInstall missing packages?",
        collapse = ""
      )
    
    # Checks if R is in interactive mode. If yes, then prompt user for
    # interactive response. If no, prompt user for input from stdin.
    if (interactive()) {
      if (!(askYesNo(libsmsg, default = FALSE) %in% c(NA, FALSE))) {
        install.packages(need)
        lapply(need, require, character.only = TRUE)
      } else {
        stop("required packages were not installed or loaded")
      }
      
    } else {
      cat(libsmsg, "(yes/No/cancel) ")
      response <- readLines("stdin", n = 1)
      input <- pmatch(tolower(response), c("yes", "no", "cancel"))
      
      if (!nchar(response) | input %in% c(2, 3)) {
        stop("required packages were not installed or loaded")
      } else if (is.na(input)) {
        stop("Unrecognized response ", dQuote(response))
      } else {
        install.packages(need)
        lapply(need, require, character.only = TRUE)
      }
    }
  
  }
  
  return(invisible(NULL))
}


user_input <- function(prompt) {
  if(interactive()) {
    return(readlin)
  }
}


#' @title Compute the statistical mode of an R object
#'
#' @description Computes the statistical mode of an R object.
#'
#' @param x An R object that can be interpreted as factors.
#'
#' @return A table with the name of the most frequent element and its respective
#'   count.
#'
#' @examples
#' \dontrun{
#' mydata <- c(1,1,1,1,2,2,2,3,3)
#' mode_stat(mydata)
#' }
#'
#' @author N. Chan
#' @export
#'

mode_stat <- function(x) {
  freqtable <- table(x)
  return(freqtable[which.max(freqtable)])
}


#' @title Compute the geometric mean of a numeric object
#'
#' @description Computes the geometric mean of a numeric object.
#'
#' @param x An R object that can be coerced into a numeric vector.
#'
#' @return A double representing the geometric mean.
#'
#' @examples
#' \dontrun{
#' mydata <- c(1,1,1,1,2,2,2,3,3)
#' geo_mean(mydata)
#' }
#'
#' @author N. Chan
#' @export
#'

geo_mean <- function(x) {
  # Taking the mean of the natural log then raising e to the mean is more 
  # computationally efficient than taking the product of all elements of x
  # and taking the n-th root.
  gm <- exp(mean(log(x)))
  return(gm)
}


#' @title Compute the harmonic mean of a numeric object
#'
#' @description Computes the harmonic mean of a numeric object.
#'
#' @param x An R object that can be coerced into a numeric vector.
#'
#' @return A double representing the harmonic mean.
#'
#' @examples
#' \dontrun{
#' mydata <- c(1,1,1,1,2,2,2,3,3)
#' har_mean(mydata)
#' }
#'
#' @author N. Chan
#' @export
#'

har_mean <- function(x) {
  hm <- 1/(mean(1/x))
  return(hm)
}

#' @title Provides summary statistics 
#'
#' @description Wrapper for summary(), mode_stat(), geo_mean(), and har_mean()
#'
#' @param x An R object that can be coerced into a numeric vector.
#'
#' @return A data frame containing two columns (name of statistic and value)
#'
#' @examples
#' \dontrun{
#' mydata <- c(1,1,1,1,2,2,2,3,3)
#' summ_stats(mydata)
#' }
#'
#' @author N. Chan
#' @export
#'

summ_stats <- function(x) {
  
  if(!is.numeric(x)) {
    stop("x must be a numeric object.")
  }
  
  summ <- summary(x)
  mo <- as.numeric(names(mode_stat(x)))
  gm <- geo_mean(x)
  hm <- har_mean(x)
  
  out <- data.frame(
    Statistic = c(names(summ), "Mode", "Geometric Mean", "Harmonic Mean"),
    Value = c(as.vector(summ), mo, gm, hm)
  )
  
  out[which(out[,1] == "Min."), 1] <- "Minimum"
  out[which(out[,1] == "1st Qu."), 1] <- "1st Quartile"
  out[which(out[,1] == "Mean"), 1] <- "Arithmetic Mean"
  out[which(out[,1] == "3rd Qu."), 1] <- "3rd Quartile"
  out[which(out[,1] == "Max."), 1] <- "Maximum"
  
  return(out)
}

./scripts/00_init.R


cat ./scripts/00_init.R

# File name: 00_init.R
# Path: './scripts/00_init.R'

# Author: N. Chan
# Purpose: Initializes project and installs required dependencies.

# Load up custom functions
source(paste0(getwd(), "/R/functions.R"))
message("./R/functions.R is loaded.")

# List of required packages for analysis
required_packages <-
  c(
    "knitr",
    "kableExtra",
    #"fansi",
    "tidyverse",
    "readxl",
    "plotly",
    "htmlwidgets",
    "shiny")

# Check, install, and load required packages
using(required_packages)

message("./scripts/00_init.R was executed.")

./scripts/01_loaddata.R


cat ./scripts/01_loaddata.R

# File name: 01_loaddata.R
# Path: './scripts/01_loaddata.R'

# Author: N. Chan
# Purpose: Loads the 2019 Winter Data Science Intern Challenge Data Set

# Load and install missing packages
source(paste0(getwd(), "/scripts/00_init.R"))

# Read in data
if (exists("xlsxfile")) {
  datafile <- xlsxfile
  message(paste0("Using user-provided xlsxfile: ", datafile))
} else {
  datafile <- paste0(
    getwd(),
    "/data/raw_data/2019 Winter Data Science Intern Challenge Data Set.xlsx"
    )
  message(paste0("Using default xlsxfile: ", datafile))
}

shopifydata <- read_excel(datafile)

message("./scripts/01_loaddata.R was executed.")

./scripts/02_analyze.R

NB: Script not used in report. Provided as option for custom analysis.


cat ./scripts/02_analyze.R

# File name: 02_analyze.R
# Path: './scripts/02_analyze.R'

# Author: N. Chan
# Purpose: Performs analysis per questions provided in challenge

# Prompt:
# Question 1: On Shopify, we have exactly 100 sneaker shops, and each of these
# shops sells only one model of shoe. We want to do some analysis of the average
# order value (AOV). When we look at orders data over a 30 day window, we
# naively calculate an AOV of $3145.13. Given that we know these shops are
# selling sneakers, a relatively affordable item, something seems wrong with our
# analysis.
# 
# 1a. Think about what could be going wrong with our calculation. Think about a
# better way to evaluate this data. 
# 1b. What metric would you report for this dataset? 
# 1c. What is its value?


# Load up packages and data required for analysis
source(paste0(getwd(), "/scripts/01_loaddata.R"))
attach(shopifydata)

if (exists("voi")) {
  var_interest <- shopifydata %>% pull(voi)
  voi_char <- as.character(voi)
  message(paste0("Using user-provided variable of interest: ", voi))
} else {
  var_interest <- order_amount
  voi_char <- "order_amount"
  message(paste0("Using default variable of interest: order_amount"))
}

if (!(is.numeric(var_interest))) {
  stop("The variable of interest is not numeric. Only numeric variables are supported.")
}

if (exists("binsize")) {
  binwidth <- as.numeric(binsize)
  binwidth_density <- as.numeric(binsize)
  message(paste0("Using user-provided bin width: ", binsize))
} else {
  binwidth <- 10000
  binwidth_density <- binwidth/100
  message(paste0("Using default bin width: ", binwidth))
}

# Get summary statistics for var_interest
summarystats <- summ_stats(var_interest)

# Produce plot of var_interest
p <-
  ggplot(data = shopifydata, aes(x = var_interest)) +
  geom_histogram(
    aes(color = "Histogram"),
    fill = "springgreen4",
    size = 0,
    binwidth = binwidth) +
  geom_density(
    # Density plots are usually constrained within [0,1]. However, ggplot 
    # requires that the y-axis of plots have the same scale. This is a 
    # workaround to let our density plot display properly.
    aes(y = ..density.. * nrow(shopifydata) * binwidth_density, color = "Density Plot"),
    fill = "springgreen4",
    size = 0,
    alpha = 0.3
  ) +
  geom_vline(
    aes(xintercept = summarystats[which(summarystats == "Arithmetic Mean"), 2], color = "Arithmetic Mean"),
    linetype = "longdash",
    size = 0.25,
  ) +
  geom_vline(
    aes(xintercept = summarystats[which(summarystats == "Median"), 2], color = "Median"),
    linetype = "dotdash",
    size = 0.25,
  ) +
  geom_vline(
    aes(xintercept = summarystats[which(summarystats == "Mode"), 2], color = "Mode"),
    linetype = "dotted",
    size = 0.25,
  ) +
  geom_vline(
    aes(xintercept = summarystats[which(summarystats == "Geometric Mean"), 2], color = "Geometric Mean"),
    linetype = "twodash",
    size = 0.25,
  ) +
  geom_vline(
    aes(xintercept = summarystats[which(summarystats == "Harmonic Mean"), 2], color = "Harmonic Mean"),
    linetype = "dashed",
    size = 0.25,
  ) +
  labs(
    x = "Value",
    y = "Count"
  ) +
  scale_x_continuous(
    labels = function(x)
      format(x, scientific = FALSE),
    guide = guide_legend()
  ) +
  scale_color_manual(
    name = "Key Metrics",
    values = c(
      "Histogram" = "springgreen4",
      "Density Plot" = "springgreen4",
      "Arithmetic Mean" = "red",
      "Median" = "blue",
      "Mode" = "orange",
      "Geometric Mean" = "green",
      "Harmonic Mean" = "grey"
    )
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# plotly does not support subtitles or captions, so we need to manually define
# them with HTML as part of the title
p_out <- 
  ggplotly(p) %>%
  layout(title = list(text = paste0(
    '<span style = "font-size: 15px;">',
    "<b>Histogram of ",
    voi_char,
    "</b>",
    "</span>")))

# Check if ./output exists and if not, creates it
if (!(dir.exists(paste0(getwd(), "/output")))) {
  dir.create(paste0(getwd(), "/output"))
}

# Save out the key metrics to csv and plot to HTML file
write.csv(summarystats, 
          file = paste0(
            getwd(),
            "/output/summarystats.csv"))
saveWidget(p_out,
           file = paste0(
             getwd(),
             "/output/plot.html"))

# Uncomment the line below to delete dependency files created by saveWidget()
# Not actually implemented as this should be fixed in htmlwidgets itself...

# unlink(paste0(getwd(), "/output/plot_files"), recursive = TRUE)

message("./scripts/02_analyze.R was executed.")
message(paste0("Outputs saved to ", getwd(), "/output"))

./scripts/03_question2.sql

NB: Script not used in report.


cat ./scripts/03_question2.sql

/*

Challenge: https://docs.google.com/document/d/13VCtoyto9X1PZ74nPI4ZEDdb8hF8LAlcmLH1ZTHxKxE/edit#
SQL Dataset: https://www.w3schools.com/SQL/TRYSQL.ASP?FILENAME=TRYSQL_SELECT_ALL

*/

-- Question 2:


-- 2a. How many orders were shipped by Speedy Express in total?

-- 2a.1 Using nested SELECT
SELECT
    COUNT(*) AS TotalOrders
FROM
    Orders
WHERE
    ShipperID = (
        SELECT ShipperID
        FROM Shippers
        WHERE ShipperName = 'Speedy Express'
    );

-- 2a.2 Using INNER JOIN
SELECT
    COUNT(*) AS TotalOrders
FROM
    Orders
INNER JOIN
    Shippers ON Orders.ShipperID = Shippers.ShipperID
WHERE
    ShipperName = 'Speedy Express';


-- 2b. What is the last name of the employee with the most orders?

-- 2b.1 Using nested SELECT
SELECT
    LastName
FROM
    Employees
WHERE
    EmployeeID = (
        SELECT
            EmployeeID
        FROM (
            SELECT
                MAX(TotalOrders),
                EmployeeID
            FROM (
                SELECT
                    COUNT(*) AS TotalOrders,
                    EmployeeID
                FROM
                    Orders
                GROUP BY
                    EmployeeID
            )
        )
    );

--  2b.2 Using MAX() and INNER JOIN
/*
NB (1): 2b.2 returns an additional column (MaxTotalOrders).
NB (2): 2b.2 is faster than 2b.3 but is less useful if the intention is to rank 
        employees by number of orders.
*/
SELECT
    LastName
FROM (
    SELECT
        LastName,
        MAX(TotalOrders) AS MaxTotalOrders
    FROM (
        SELECT
            Employees.LastName,
            COUNT(*) AS TotalOrders
        FROM
            Employees
        INNER JOIN
            Orders ON Employees.EmployeeID = Orders.EmployeeID
        GROUP BY
            Employees.EmployeeID
        )
    );


-- 2b.3 Using ORDER BY and LIMIT
/*
NB (1): 2b.3 provides the exact answer requested.
NB (2): 2b.3 is slower than 2b.2 but makes it easier to get a table of employees
        ranked by number of orders.
*/
SELECT
    LastName
FROM (
    SELECT
        Employees.LastName,
        COUNT(*) AS TotalOrders
    FROM
        Employees
    INNER JOIN
        Orders ON Employees.EmployeeID = Orders.EmployeeID
    GROUP BY
        Employees.EmployeeID
    ORDER BY
        TotalOrders DESC
    )
LIMIT 1;



-- 2c. What product was ordered the most by customers in Germany?

-- 2c.1 Using nested SELECT
SELECT
    ProductName
FROM
    Products
WHERE
    ProductID = (
        SELECT
            ProductID
        FROM (
            SELECT
                ProductID,
                MAX(SumQty)
            FROM (
                SELECT
                    ProductID,
                    SUM(Quantity) AS SumQty
                FROM
                    OrderDetails
                WHERE
                    OrderID IN (
                        SELECT
                            OrderID
                        FROM
                            Orders
                        WHERE
                            CustomerID IN (
                                SELECT
                                    CustomerID
                                FROM
                                    Customers
                                WHERE
                                    Country = 'Germany'
                            )
                    )
                GROUP BY
                    ProductID
            )
        )
    );

--  2c.2 Using MAX() and INNER JOIN
/*
NB: 2c.2 is slower than 2c.3 but is less useful if the intention is to rank 
    the most popular products among German customers.
*/
SELECT
    ProductName
FROM (
    SELECT
        ProductName,
        MAX(SumQty) AS MaxSumQty
    FROM (
        SELECT
            Products.ProductName,
            SUM(OrderDetails.Quantity) AS SumQty
        FROM
            Products
        INNER JOIN
            OrderDetails ON Products.ProductID = OrderDetails.ProductID
        INNER JOIN
            Orders ON OrderDetails.OrderID = Orders.OrderID
        INNER JOIN
            Customers ON Orders.CustomerID = Customers.CustomerID
        WHERE
            Customers.Country = 'Germany'
        GROUP BY
            OrderDetails.ProductID
        )
    );

-- 2c.3 Using ORDER BY and LIMIT
/*
NB: 2c.3 is slower than 2c.2 but makes it easier to get a table of the most
    popular products among German customers.
*/
SELECT
    ProductName
FROM (
    SELECT
        Products.ProductName,
        SUM(OrderDetails.Quantity) AS SumQty
    FROM
        Products
    INNER JOIN
        OrderDetails ON Products.ProductID = OrderDetails.ProductID
    INNER JOIN
        Orders ON OrderDetails.OrderID = Orders.OrderID
    INNER JOIN
        Customers ON Orders.CustomerID = Customers.CustomerID
    WHERE
        Customers.Country = 'Germany'
    GROUP BY
        OrderDetails.ProductID
    ORDER BY
        SumQty DESC
    )
LIMIT 1;

./run.R

NB: Script not used in report. Provided as option for custom analysis.


cat ./run.R

# File name: run.R
# Path: './run.R'

# Author: N. Chan
# Purpose: Command line option to run all scripts and analyses
# Usage: Rscript run.R [xlsxfile voi binsize]

# Takes three optional arguments: (1) path to excel file, (2) name of a numeric 
# variable of interest, and (3) size of histogram bins. If using optional 
# arguments, all three must be defined.

# Returns a histogram-density plot of the data depicting several key metrics,
# as well as a csv file containing the names and values of the key metrics.
# Files are saved in "shopify-ds-intern-challenge/output".

# Optional arguments:
#   xlsxfile voi binsize
#       xlsxfile is the path to the excel file.
#       voi is the name of the numeric variable of interest as written in row 1 
#         of the respective column in the excel file.
#       binsize is the size of each bin in the histogram.
#       All three arguments must be defined if using.

cmdargs <- commandArgs(trailingOnly = TRUE)

if(length(cmdargs) == 0) {
  message("No arguments provided. Using defaults.")
} else if (length(cmdargs) > 3 | length(cmdargs) %in% c(1,2)) {
  message("Invalid number of optional arguments provided. Using defaults.")
} else {
  xlsxfile <- cmdargs[1]
  voi <- cmdargs[2]
  binsize <- cmdargs[3]
}

source(paste0(getwd(), "/scripts/02_analyze.R"))
cat(paste0("The geometric mean of ", voi_char, " is ", geo_mean(var_interest), ".\n"))
message("run.R has completed.")

./app.R

NB: Script not used in report. Provided as option for custom analysis.


cat ./app.R

# File name: app.R
# Path: './app.R'

# Author: N. Chan
# Purpose: Runs a Shiny R app to interactively view a histogram from input data

source(paste0(getwd(), "/scripts/01_loaddata.R"))

# Define UI

ui <- fluidPage(
  
  titlePanel("Histogram-Density Plot Viewer"),
  
  sidebarLayout(
    sidebarPanel(
      h5("Prepared by Nathan Chan"),
      h6("for the Shopify Summer 2022 Data Science Intern Challenge"),
      br(),
      helpText("Select your file and specify the settings below."),
      fileInput(inputId = "xlsxfile", "Select Excel File", accept = ".xlsx", multiple = FALSE),
      selectInput("selectedcolumn", "Choose Column (must be numeric)", choices = c("")),
      br(),
      textOutput("displaymax"),
      tags$head(tags$style("#displaymax{color: blue; font-style: bold;}")),
      textOutput("displaymin"),
      tags$head(tags$style("#displaymin{color: blue; font-style: bold;}")),
      textOutput("displayrechist"),
      tags$head(tags$style("#displayrechist{color: blue; font-style: bold;}")),
      br(),
      helpText("Specify plot limits."),
      numericInput("selectedmax", "X-axis Maximum", value = 1),
      numericInput("selectedmin", "X-axis Minimum", value = 0),
      br(),
      helpText("Specify a histogram bin size between the minimum and maximum of the column specified."),
      numericInput("selectedbinsize", "Histogram Bin Size", value = 1),
      helpText("Specify a density plot scale factor between the minimum and maximum of the column specified. It is recommended to start with the same value as the recommended histogram bin size above and adjust from there."),
      numericInput("selecteddensize", "Density Plot Scale Factor", value = 1)
      # sliderInput("selectedbinsize", "Specify Histogram Bin Size", min = 0, max = 10, value = 5, step = 1),
      # sliderInput("selecteddensize", "Specify Density Plot Scale Factor", min = 0, max = 10, value = 5, step = 1),
      # sliderInput("selectedmax", "Specify X-axis Maximum", min = 0, max = 10, value = 5, step = 1),
      # sliderInput("selectedmin", "Specify X-axis Minimum", min = 0, max = 10, value = 5, step = 1)
      
    ),
    mainPanel(
      tabsetPanel(
        tabPanel(
          title = "Data",
          dataTableOutput(outputId = "displaydataTable")
        ),
        tabPanel(
          title = "Metrics",
          dataTableOutput(outputId = "displaymetrics")
        ),
        tabPanel(
          title = "Plot",
          plotlyOutput(outputId = "displayplot",
                       width = "100%",
                       height = "100%")
        )
    ))
    
  )
  
  
)


server <- function(input, output, session) {
  
  xlsxdata <- reactive({
    req(input$xlsxfile)
    read_excel(input$xlsxfile$datapath)
  })
  
  output$displaymetrics <- renderDataTable({
    req(input$xlsxfile)
    req(input$selectedcolumn)
    xlsxdata <- xlsxdata()
    summ_stats(xlsxdata[[input$selectedcolumn]])
  })
  
  output$displaydataTable <- renderDataTable({xlsxdata()})
  
  output$displaymax <- renderText({
    req(input$xlsxfile)
    req(input$selectedcolumn)
    xlsxdata <- xlsxdata()
    selectedcolumn <- as.character(input$selectedcolumn)
    maxval <- max(xlsxdata[, selectedcolumn])
    paste0("Maximum: ", maxval)
  })
  
  output$displaymin <- renderText({
    req(input$xlsxfile)
    req(input$selectedcolumn)
    xlsxdata <- xlsxdata()
    selectedcolumn <- as.character(input$selectedcolumn)
    minval <- min(xlsxdata[, selectedcolumn])
    paste0("Minimum: ", minval)
  })
  
  output$displayrechist <- renderText({
    req(input$xlsxfile)
    req(input$selectedcolumn)
    xlsxdata_max <- input$selectedmax
    xlsxdata_min <- input$selectedmin
    xlsxdata_range <- xlsxdata_max - xlsxdata_min
    paste0("Recommended Histogram Bin Size: ", xlsxdata_range * 0.01)
  })
  
  output$displayplot <- renderPlotly({
    req(input$xlsxfile)
    req(input$selectedcolumn)
    req(input$selectedbinsize)
    xlsxdata <- xlsxdata()
    selectedcolumn <- input$selectedcolumn
    summarystats <- summ_stats(xlsxdata[[selectedcolumn]])
    binwidth <- input$selectedbinsize
    binwidth2 <- input$selecteddensize
    xmin <- input$selectedmin
    xmax <- input$selectedmax
    
    p <-
      ggplot(data = xlsxdata, aes_string(x = selectedcolumn)) +
      geom_histogram(
        aes(color = "Histogram"),
        fill = "springgreen4",
        size = 0,
        binwidth = binwidth) +
      geom_density(
        # Density plots are usually constrained within [0,1]. However, ggplot
        # requires that the y-axis of plots have the same scale. This is a
        # workaround to let our density plot display properly.
        aes(y = ..density.. * nrow(xlsxdata) * binwidth2, color = "Density Plot"),
        fill = "springgreen4",
        size = 0,
        alpha = 0.3
      ) +
      geom_vline(
        aes(xintercept = summarystats[which(summarystats == "Arithmetic Mean"), 2], color = "Arithmetic Mean"),
        linetype = "longdash",
        size = 0.25,
      ) +
      geom_vline(
        aes(xintercept = summarystats[which(summarystats == "Median"), 2], color = "Median"),
        linetype = "dotdash",
        size = 0.25,
      ) +
      geom_vline(
        aes(xintercept = summarystats[which(summarystats == "Mode"), 2], color = "Mode"),
        linetype = "dotted",
        size = 0.25,
      ) +
      geom_vline(
        aes(xintercept = summarystats[which(summarystats == "Geometric Mean"), 2], color = "Geometric Mean"),
        linetype = "twodash",
        size = 0.25,
      ) +
      geom_vline(
        aes(xintercept = summarystats[which(summarystats == "Harmonic Mean"), 2], color = "Harmonic Mean"),
        linetype = "dashed",
        size = 0.25,
      ) +
      labs(
        y = "Count"
      ) +
      scale_x_continuous(
        labels = function(x)
          format(x, scientific = FALSE),
        guide = guide_legend(),
        limits = c(xmin, xmax)
      ) +
      scale_color_manual(
        name = "",
        values = c(
          "Histogram" = "springgreen4",
          "Density Plot" = "springgreen4",
          "Arithmetic Mean" = "red",
          "Median" = "blue",
          "Mode" = "orange",
          "Geometric Mean" = "green",
          "Harmonic Mean" = "grey"
        )
      ) +
      theme_classic() +
      theme(plot.title = element_text(hjust = 0.5),
            axis.text.x = element_text(angle = 45, hjust = 1))
    
    ggplotly(p) %>%
      layout(
        legend = list(
          orientation = "h", x = 0.5, y = -0.25, xanchor = "center"))
    
  })
  
  
  observe({
    req(input$xlsxfile)
    xlsxdata <- xlsxdata()
    updateSelectInput(
      session = session,
      inputId = "selectedcolumn",
      choices = colnames(xlsxdata[, sapply(xlsxdata, is.numeric)])
    )
  })
  
  observe({
    req(input$xlsxfile)
    req(input$selectedcolumn)
    xlsxdata <- xlsxdata()
    selectedcolumn <- as.character(input$selectedcolumn)
    xlsxdata_max <- range(xlsxdata[, selectedcolumn])[2]
    xlsxdata_min <- range(xlsxdata[, selectedcolumn])[1]
    xlsxdata_range <- xlsxdata_max - xlsxdata_min
    xlsxdata_rec <- (input$selectedmax - input$selectedmin) * 0.01
    
    updateNumericInput(
      session = session,
      inputId = "selectedbinsize",
      max = xlsxdata_range * 0.1,
      min = xlsxdata_range * 0.001,
      value = xlsxdata_rec,
      step = xlsxdata_range * 0.001
    )
    updateNumericInput(
      session = session,
      inputId = "selecteddensize",
      max = xlsxdata_range * 0.1,
      min = xlsxdata_range * 0.0001,
      value = xlsxdata_rec,
      step = xlsxdata_range * 0.0001
    )
    # updateSliderInput(
    #   session = session,
    #   inputId = "selectedbinsize",
    #   max = xlsxdata_range * 0.1,
    #   min = xlsxdata_range * 0.001,
    #   value = xlsxdata_range * 0.01,
    #   step = xlsxdata_range * 0.001
    # )
    # updateSliderInput(
    #   session = session,
    #   inputId = "selecteddensize",
    #   max = xlsxdata_range * 0.1,
    #   min = xlsxdata_range * 0.0001,
    #   value = xlsxdata_range * 0.01,
    #   step = xlsxdata_range * 0.0001
    # )
  })
  
  observe({
    req(input$xlsxfile)
    req(input$selectedcolumn)
    xlsxdata <- xlsxdata()
    selectedcolumn <- as.character(input$selectedcolumn)
    xlsxdata_max <- range(xlsxdata[, selectedcolumn])[2]
    xlsxdata_min <- range(xlsxdata[, selectedcolumn])[1]
    xlsxdata_range <- xlsxdata_max - xlsxdata_min
    updateNumericInput(
      session = session,
      inputId = "selectedmax",
      max = xlsxdata_max,
      min = xlsxdata_min,
      value = xlsxdata_max,
      step = xlsxdata_range * 0.001
    )
    updateNumericInput(
      session = session,
      inputId = "selectedmin",
      max = xlsxdata_max,
      min = xlsxdata_min,
      value = xlsxdata_min,
      step = xlsxdata_range * 0.001
    )
    # updateSliderInput(
    #   session = session,
    #   inputId = "selectedmax",
    #   max = xlsxdata_max,
    #   min = xlsxdata_min,
    #   value = xlsxdata_max,
    #   step = xlsxdata_range * 0.001
    # )
    # updateSliderInput(
    #   session = session,
    #   inputId = "selectedmin",
    #   max = xlsxdata_max,
    #   min = xlsxdata_min,
    #   value = xlsxdata_min,
    #   step = xlsxdata_range * 0.001
    # )
  })
  
}


shinyApp(ui, server)

Shopify Data Science Intern Challenge

Nathan Chan

Table of Contents

Question 1: Analysis of AOV

Prompt

Executive Summary

Getting started

Introduction

Part 1a: Evaluate data

Summary statistics

Visualization

Observations

Extreme Outliers

Outliers

Multi-Modality

Part 1b: Metric to report

Part 1c: Value of metric

Question 2: SQL for Northwind

Prompt

Part 2a: Speedy Express orders

Part 2b: Employee last name

Part 2c: Product most ordered

Appendix

./R/functions.R

./scripts/00_init.R

./scripts/01_loaddata.R

./scripts/02_analyze.R

./scripts/03_question2.sql

./run.R

./app.R