To visit this project on GitHub, please visit this link: https://github.com/nathankchan/shopify-ds-intern-challenge
NB: Show or hide all code snippets using the Code
button located in the upper right corner.
Given some sample data, write a program to answer the following: click here to access the required data set
$285.02
.If you already have R installed, consider downloading this project from GitHub to run your own custom analysis! Open a terminal, navigate to the repo, and type in Rscript app.R
for an interactive GUI that produces histogram-density plots and key metrics.
This analysis requires R to be installed. If it is not installed, please visit r-project.org to download the latest version.
This analysis requires the following R packages:
If any of these packages are not installed, this analysis will ask for your permission to install before proceeding.
To ensure this analysis is reproducible, a series of scripts (see Appendix) are run to initialize the environment and load the “2019 Winter Data Science Intern Challenge Data Set.xlsx” excel file into R. Data are stored in the tibble shopifydata
. For ease of access and readability, shopifydata
is also attached to the R search path.
# NB: If any required packages are missing and you are running this code block
# within an R Notebook, it will fail to execute properly. This behaviour is
# intentional and acts as a safety mechanism to ensure the user provides
# explicit consent before installing packages. If this occurs and you wish to
# install the packages, copy the code below into your terminal or console, then
# continue with the rest of this document.
source(paste0(getwd(), "/scripts/01_loaddata.R"))
attach(shopifydata)
It is generally good practice to examine the data before moving further. shopifydata
contains 5000
rows and 7
columns describing sales data from sneaker shops. For readability, only the first few rows are displayed. As seen below, order value is contained in the column order_amount
.
kbl(head(shopifydata),
caption = "Table 1: 2019 Winter Data Science Intern Challenge Data Set") %>%
kable_styling("hover", full_width = F) %>%
scroll_box(
height = "250px")
order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at |
---|---|---|---|---|---|---|
1 | 53 | 746 | 224 | 2 | cash | 2017-03-13 12:36:56 |
2 | 92 | 925 | 90 | 1 | cash | 2017-03-03 17:38:51 |
3 | 44 | 861 | 144 | 1 | cash | 2017-03-14 04:23:55 |
4 | 18 | 935 | 156 | 1 | credit_card | 2017-03-26 12:43:36 |
5 | 18 | 883 | 156 | 1 | credit_card | 2017-03-01 04:35:10 |
6 | 58 | 882 | 138 | 1 | credit_card | 2017-03-14 15:25:00 |
Let’s also quickly confirm that the earlier cited average order value (AOV) is correct.
print(paste0("The AOV is $", mean(order_amount), "."))
## [1] "The AOV is $3145.128."
The AOV is $3414.13
(rounded to the nearest cent). It looks like everything checks out!
TL;DR: The data contain many extreme outliers and outliers such that the arithmetic mean does not represent the center of the distribution for order_amount
. The geometric mean more accurately describes the center for these data.
Two key approaches should be considered when evaluating a data set:
Using these approaches, I generate several observations.
TL;DR: The AOV (a.k.a. arithmetic mean) misrepresents the “center” of the distribution for order_amount
, as it lies far beyond the 3rd quartile. We should consider other measures of central tendency, such as the median, mode, geometric mean, and harmonic mean.
First, let’s take a look at some summary statistics. We’ll use a custom function called summ_stats()
on order_amount
to produce the table below.
summarystats <- summ_stats(order_amount)
kbl(summarystats,
digits = 2,
align = "lr",
caption = "Table 2: Summary of Order Value") %>%
kable_styling("hover", full_width = FALSE) %>%
row_spec(c(3,4,7:9), bold = TRUE) %>%
column_spec(c(1,2), width = "50%") %>%
scroll_box(
width = "100%",
box_css = "border: 0px solid #ddd; padding: 5px; ")
Statistic | Value |
---|---|
Minimum | 90.00 |
1st Quartile | 163.00 |
Median | 284.00 |
Arithmetic Mean | 3145.13 |
3rd Quartile | 390.00 |
Maximum | 704000.00 |
Mode | 153.00 |
Geometric Mean | 285.02 |
Harmonic Mean | 235.32 |
Bold rows provide the key metrics we’re most interested in. That said, let’s first take a look at some of the additional information provided.
The table shows that order_amount
contains a maximum value of 704000.00
— far greater than the AOV of 3145.13
(herein described as the “arithmetic mean”). It’s possible that the arithmetic mean is being pulled up by at least one extremely large value in the data.
Also, the 3rd quartile is 390.00
, which is far less than the arithmetic mean. This means that at least 75% of the data is below the arithmetic mean of order_amount
.
Back to the key metrics: arithmetic mean, median, mode, geometric mean, and harmonic mean. These metrics represent different ways of representing the “center” of the data. We’ll describe these metrics in greater detail in Part 1b. For now, just note that the median (284.00
), mode (153.00
), geometric mean (285.02
), and harmonic mean (235.32
) are all less than the arithmetic mean (3145.13
).
Together, these suggest that the distribution of order_amount
is heavily right-skewed. We will verify this intuition when we visualize the data in the next section.
TL;DR: Visualizing the data confirms that the arithmetic mean is far from the center of the distribution of order_amount
. Other measures of central tendency more appropriately characterize the center.
Let’s make a visualization of the distribution of order_amount
. Using the ggplot2
and plotly
packages, we can create an interactive histogram to examine the data.
This plot is interactive. Drag your cursor over the plot to zoom in. Double-click to zoom out. Click on the legend items to hide or show plot elements.
binwidth1 <- 10000
p1 <-
ggplot(data = shopifydata, aes(x = order_amount)) +
geom_histogram(
aes(color = "Histogram"),
fill = "springgreen4",
size = 0,
binwidth = binwidth1) +
geom_density(
# Density plots are usually constrained within [0,1]. However, ggplot
# requires that the y-axis of plots have the same scale. This is a
# workaround to let our density plot display properly.
aes(y = ..density.. * nrow(shopifydata) * binwidth1 / 100, color = "Density Plot"),
fill = "springgreen4",
size = 0,
alpha = 0.3) +
labs(x = "Order Value ($)", y = "Count") +
scale_x_continuous(
labels = function(x)
format(x, scientific = FALSE)
) +
scale_color_manual(
values = c(
"Histogram" = "springgreen4",
"Density Plot" = "springgreen4"
)
) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1))
# plotly does not support subtitles or captions, so we need to manually define
# them with HTML as part of the title
ggplotly(p1) %>%
layout(title = list(text = paste0(
'<span style = "font-size: 15px;">',
"<b>Figure 1: Histogram of Order Value</b>",
"</span>")),
legend = list(
orientation = "h", x = 0.5, y = -0.25, xanchor = "center"))
It’s clear that order_amount
has several extreme outliers at 704000
, as well as outliers between 20000
to 200000
. While these values are few in number, they are far greater than any of the key metrics we considered in the previous section.
Moreover, we can visually confirm the extreme right skewness of order_amount
. Our earlier intuition was correct.
Lastly, let’s take a look at order_amount
after we remove any values over 20000
. We’ll also plot the key metrics directly on the graph.
This plot is interactive. Drag your cursor over the plot to zoom in. Double-click to zoom out. Click on the legend items to hide or show plot elements. Values of key metrics are summarized in Table 3.
shopifydata_under20000 <- shopifydata %>% filter(order_amount < 20000)
binwidth2 <- 25
p2 <-
ggplot(data = shopifydata_under20000, aes(x = order_amount)) +
geom_histogram(
aes(color = "Histogram"),
fill = "springgreen4",
size = 0,
binwidth = binwidth2) +
geom_density(
# Density plots are usually constrained within [0,1]. However, ggplot
# requires that the y-axis of plots have the same scale. This is a
# workaround to let our density plot display properly.
aes(y = ..density.. * nrow(shopifydata_under20000) * binwidth2, color = "Density Plot"),
fill = "springgreen4",
size = 0,
alpha = 0.3
) +
geom_vline(
aes(xintercept = summarystats[which(summarystats == "Arithmetic Mean"), 2], color = "Arithmetic Mean"),
linetype = "longdash",
size = 0.25,
) +
geom_vline(
aes(xintercept = summarystats[which(summarystats == "Median"), 2], color = "Median"),
linetype = "dotdash",
size = 0.25,
) +
geom_vline(
aes(xintercept = summarystats[which(summarystats == "Mode"), 2], color = "Mode"),
linetype = "dotted",
size = 0.25,
) +
geom_vline(
aes(xintercept = summarystats[which(summarystats == "Geometric Mean"), 2], color = "Geometric Mean"),
linetype = "twodash",
size = 0.25,
) +
geom_vline(
aes(xintercept = summarystats[which(summarystats == "Harmonic Mean"), 2], color = "Harmonic Mean"),
linetype = "dashed",
size = 0.25,
) +
labs(
x = "Order Value ($)",
y = "Count"
) +
scale_x_continuous(
labels = function(x)
format(x, scientific = FALSE),
guide = guide_legend()
) +
scale_color_manual(
name = "",
values = c(
"Histogram" = "springgreen4",
"Density Plot" = "springgreen4",
"Arithmetic Mean" = "red",
"Median" = "blue",
"Mode" = "orange",
"Geometric Mean" = "green",
"Harmonic Mean" = "grey"
)
) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1))
# plotly does not support subtitles or captions, so we need to manually define
# them with HTML as part of the title
ggplotly(p2) %>%
layout(title = list(text = paste0(
'<span style = "font-size: 15px;">',
"<b>Figure 2: Histogram of Order Value</b>",
"</span>",
"<br>",
'<span style = "font-size: 14px;">',
"<sup>Showing only orders under $20000</sup>",
"</span>")),
legend = list(
orientation = "h", x = 0.5, y = -0.25, xanchor = "center"))
Even after removing orders values over 20000
, the distribution still appears right skewed. Moreover, the distribution appears to be multi-modal, with peaks separated at intervals of about 150
to 200
and decreasing in size as order value increases.
TL;DR: Extreme outliers might be orders from a wholesaler. Outliers might be orders from luxury sneaker shops. The data appear to be multi-modal, likely due to the fact that sneakers are purchased in whole numbers.
Knowing that these data represent the value of orders from sneaker shops, I suggest several hypotheses and opportunities for further investigation.
The extreme outliers at order_amount == 704000
could represent bulk orders from a sneaker wholesaler to a retailer. This inference is supported by the following observations:
shop_id == 42
);user_id == 607
) was responsible for these orders; andtotal_items == 2000
).# Select the extreme outliers only
shopifydata %>%
filter(order_amount > 500000) %>%
kbl(., caption = "Table 3: Extreme Outliers") %>%
kable_styling("hover", fixed_thead = TRUE) %>%
scroll_box(height = "250px")
order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at |
---|---|---|---|---|---|---|
16 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-07 04:00:00 |
61 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-04 04:00:00 |
521 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-02 04:00:00 |
1105 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-24 04:00:00 |
1363 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-15 04:00:00 |
1437 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-11 04:00:00 |
1563 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-19 04:00:00 |
1603 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-17 04:00:00 |
2154 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-12 04:00:00 |
2298 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-07 04:00:00 |
2836 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 04:00:00 |
2970 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 04:00:00 |
3333 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-24 04:00:00 |
4057 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 04:00:00 |
4647 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-02 04:00:00 |
4869 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-22 04:00:00 |
4883 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-25 04:00:00 |
The outliers between 20000
to 200000
could represent orders from a luxury sneaker retailer. This inference is supported by the following observations:
shop_id == 78
);total_items
between 1
to 6
); andorder_amount == 25725
).# Select the outliers and compute the unit cost for each row
shopifydata %>%
filter(order_amount > 20000 & order_amount < 500000) %>%
mutate(unit_cost = order_amount / total_items) %>%
kbl(.,
caption = "Table 4: Outliers") %>%
kable_styling("hover", fixed_thead = TRUE) %>%
scroll_box(height = "250px")
order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | unit_cost |
---|---|---|---|---|---|---|---|
161 | 78 | 990 | 25725 | 1 | credit_card | 2017-03-12 05:56:56 | 25725 |
491 | 78 | 936 | 51450 | 2 | debit | 2017-03-26 17:08:18 | 25725 |
494 | 78 | 983 | 51450 | 2 | cash | 2017-03-16 21:39:35 | 25725 |
512 | 78 | 967 | 51450 | 2 | cash | 2017-03-09 07:23:13 | 25725 |
618 | 78 | 760 | 51450 | 2 | cash | 2017-03-18 11:18:41 | 25725 |
692 | 78 | 878 | 154350 | 6 | debit | 2017-03-27 22:51:43 | 25725 |
1057 | 78 | 800 | 25725 | 1 | debit | 2017-03-15 10:16:44 | 25725 |
1194 | 78 | 944 | 25725 | 1 | debit | 2017-03-16 16:38:25 | 25725 |
1205 | 78 | 970 | 25725 | 1 | credit_card | 2017-03-17 22:32:21 | 25725 |
1260 | 78 | 775 | 77175 | 3 | credit_card | 2017-03-27 09:27:19 | 25725 |
1385 | 78 | 867 | 25725 | 1 | cash | 2017-03-17 16:38:06 | 25725 |
1420 | 78 | 912 | 25725 | 1 | cash | 2017-03-30 12:23:42 | 25725 |
1453 | 78 | 812 | 25725 | 1 | credit_card | 2017-03-17 18:09:54 | 25725 |
1530 | 78 | 810 | 51450 | 2 | cash | 2017-03-29 07:12:01 | 25725 |
2271 | 78 | 855 | 25725 | 1 | credit_card | 2017-03-14 23:58:21 | 25725 |
2453 | 78 | 709 | 51450 | 2 | cash | 2017-03-27 11:04:04 | 25725 |
2493 | 78 | 834 | 102900 | 4 | debit | 2017-03-04 04:37:33 | 25725 |
2496 | 78 | 707 | 51450 | 2 | cash | 2017-03-26 04:38:52 | 25725 |
2513 | 78 | 935 | 51450 | 2 | debit | 2017-03-18 18:57:13 | 25725 |
2549 | 78 | 861 | 25725 | 1 | cash | 2017-03-17 19:35:59 | 25725 |
2565 | 78 | 915 | 77175 | 3 | debit | 2017-03-25 01:19:35 | 25725 |
2691 | 78 | 962 | 77175 | 3 | debit | 2017-03-22 07:33:25 | 25725 |
2774 | 78 | 890 | 25725 | 1 | cash | 2017-03-26 10:36:43 | 25725 |
2819 | 78 | 869 | 51450 | 2 | debit | 2017-03-17 06:25:50 | 25725 |
2822 | 78 | 814 | 51450 | 2 | cash | 2017-03-02 17:13:25 | 25725 |
2907 | 78 | 817 | 77175 | 3 | debit | 2017-03-16 03:45:46 | 25725 |
2923 | 78 | 740 | 25725 | 1 | debit | 2017-03-12 20:10:58 | 25725 |
3086 | 78 | 910 | 25725 | 1 | cash | 2017-03-26 01:59:26 | 25725 |
3102 | 78 | 855 | 51450 | 2 | credit_card | 2017-03-21 05:10:34 | 25725 |
3152 | 78 | 745 | 25725 | 1 | credit_card | 2017-03-18 13:13:07 | 25725 |
3168 | 78 | 927 | 51450 | 2 | cash | 2017-03-12 12:23:07 | 25725 |
3404 | 78 | 928 | 77175 | 3 | debit | 2017-03-16 09:45:04 | 25725 |
3441 | 78 | 982 | 25725 | 1 | debit | 2017-03-19 19:02:53 | 25725 |
3706 | 78 | 828 | 51450 | 2 | credit_card | 2017-03-14 20:43:14 | 25725 |
3725 | 78 | 766 | 77175 | 3 | credit_card | 2017-03-16 14:13:25 | 25725 |
3781 | 78 | 889 | 25725 | 1 | cash | 2017-03-11 21:14:49 | 25725 |
4041 | 78 | 852 | 25725 | 1 | cash | 2017-03-02 14:31:11 | 25725 |
4080 | 78 | 946 | 51450 | 2 | cash | 2017-03-20 21:13:59 | 25725 |
4193 | 78 | 787 | 77175 | 3 | credit_card | 2017-03-18 09:25:31 | 25725 |
4312 | 78 | 960 | 51450 | 2 | debit | 2017-03-01 03:02:10 | 25725 |
4413 | 78 | 756 | 51450 | 2 | debit | 2017-03-02 04:13:38 | 25725 |
4421 | 78 | 969 | 77175 | 3 | debit | 2017-03-09 15:21:34 | 25725 |
4506 | 78 | 866 | 25725 | 1 | debit | 2017-03-22 22:06:00 | 25725 |
4585 | 78 | 997 | 25725 | 1 | cash | 2017-03-25 21:48:43 | 25725 |
4716 | 78 | 818 | 77175 | 3 | debit | 2017-03-05 05:10:43 | 25725 |
4919 | 78 | 823 | 25725 | 1 | cash | 2017-03-15 13:26:46 | 25725 |
The multi-modal distribution, especially when examining the data with outliers excluded, could emerge due to the discreteness of the underlying data structure.
This discreteness could be attributed to the typical unit cost of sneakers of about 150
(note that all key metrics have approximately the same value in the table below) and that sneakers are always purchased as multiples of whole numbers (i.e., total_items
is always a positive non-zero integer).
# Exclude the outliers and compute the unit cost, then summarize in a table
shopifydata %>%
filter(order_amount < 20000) %>%
mutate(unit_cost = order_amount / total_items) %>%
pull(unit_cost) %>%
summ_stats() %>%
kbl(.,
digits = 2,
align = "lr",
caption = "Table 5: Summary of Unit Cost Excluding Outliers") %>%
kable_styling("hover", full_width = FALSE) %>%
row_spec(c(3,4,7:9), bold = TRUE) %>%
column_spec(c(1,2), width = "50%") %>%
scroll_box(
width = "100%",
box_css = "border: 0px solid #ddd; padding: 5px; ")
Statistic | Value |
---|---|
Minimum | 90.00 |
1st Quartile | 132.00 |
Median | 153.00 |
Arithmetic Mean | 151.79 |
3rd Quartile | 166.00 |
Maximum | 352.00 |
Mode | 153.00 |
Geometric Mean | 149.31 |
Harmonic Mean | 146.95 |
TL;DR: The most appropriate metric to report is the geometric mean as it accurately represents the center of the distribution while accommodating the presence of outliers. The median could be easier to conceptualize but has certain drawbacks. Practically speaking, both values are nearly the same for these data.
As described earlier in Table 2, several key metrics can be considered. These metrics represent different ways of measuring the “center” of a dataset. The most appropriate metric to report depends on the underlying characteristics of the data.
For convenience, key metrics have been reprinted below. The same colours used in Figure 2 are applied to highlight the table below.
keymetrics <- summarystats[c(4,3,7:9),]
row.names(keymetrics) <- NULL
kable(keymetrics,
digits = 2,
align = "lr",
caption = "Table 3: Key Metrics") %>%
kable_styling("hover", full_width = FALSE) %>%
row_spec(1, italic = TRUE, color = "red") %>%
row_spec(2, color = "blue") %>%
row_spec(3, color = "orange") %>%
row_spec(4, bold = TRUE, color = "lime", background = "black") %>%
row_spec(5, color = "grey") %>%
column_spec(c(1,2), width = "50%") %>%
scroll_box(
width = "100%",
box_css = "border: 0px solid #ddd; padding: 5px; ")
Statistic | Value |
---|---|
Arithmetic Mean | 3145.13 |
Median | 284.00 |
Mode | 153.00 |
Geometric Mean | 285.02 |
Harmonic Mean | 235.32 |
Of these metrics, I argue that the geometric mean is the most appropriate. As seen in below, the geometric mean lies at the approximate “center” of the distribution.
This plot is interactive. Drag your cursor over the plot to zoom in. Double-click to zoom out. Click on the legend items to hide or show plot elements.
p3 <- p2 +
coord_cartesian(xlim = c(0,1000))
# plotly does not support subtitles or captions, so we need to manually define
# them with HTML as part of the title
ggplotly(p3) %>%
layout(title = list(text = paste0(
'<span style = "font-size: 15px;">',
"<b>Figure 3: Histogram of Order Value</b>",
"</span>",
"<br>",
'<span style = "font-size: 14px;">',
"<sup>Showing only orders under $20000 | Zoomed in from $0 to $1000</sup>",
"</span>")),
legend = list(
orientation = "h", x = 0.5, y = -0.25, xanchor = "center"))
While this metric is often used for data with multiplicative relationships (e.g., interest rates or ratios), it is also useful when data do not exist on the same scale or have large outliers. In other words, the geometric mean is less susceptible to skew effects than other metrics. The geometric mean also makes full use of the available data.
With that said, the geometric mean could be unfamiliar to lay audiences. If familiarity and comprehensibility are the priority, then the median may be more appropriate. In this particular case, the geometric mean and median are extremely close, and both metrics appear to lie at the “center” of the distribution. Either metric might be acceptable.
Nevertheless, both the geometric mean and the median have drawbacks. Although not a concern for this dataset, it should be noted that the geometric mean is not defined when the data have meaningful zeros or negative numbers.
Meanwhile, the median (and mode) ignore parts of the data. For example, say a particular sneaker that cost $150
became a viral phenomenon in the next month of data (i.e., April 2017), accounting for the lowest 51% of orders. Meanwhile, the remaining 49% of orders cost anywhere from $151
to $704000
. This could make the median misleading to report as it would be far from the “center” of the distribution. Since the median only considers the “middle” of the data, it remains susceptible to skew effects.
In any case, reporting the arithmetic mean, mode, or harmonic mean as the “center” of the dataset would be inappropriate. As already seen, the arithmetic mean is extremely susceptible to outliers. The mode does not characterize the “center” of the data, but instead focuses on the most frequent values. Finally, the harmonic mean is more appropriate for averaging rates or ratios and not (raw) values like order value.
The most appropriate metric to report is the geometric mean. Its value is $285.02
.
If familiarity and comprehensibility are the priority, the median could also be reported, though it has drawbacks as described earlier.
SELECT
COUNT(*) AS TotalOrders
FROM
Orders
INNER JOIN
Shippers ON Orders.ShipperID = Shippers.ShipperID
WHERE
ShipperName = 'Speedy Express';
-- 2b.1 Using MAX() and INNER JOIN
/*
NB: 2b.1 is faster than 2b.2 but is less useful if the intention is to rank
employees by number of orders.
*/
SELECT
LastName
FROM (
SELECT
LastName,
MAX(TotalOrders) AS MaxTotalOrders
FROM (
SELECT
Employees.LastName,
COUNT(*) AS TotalOrders
FROM
Employees
INNER JOIN
Orders ON Employees.EmployeeID = Orders.EmployeeID
GROUP BY
Employees.EmployeeID
)
);
-- 2b.2 Using ORDER BY and LIMIT
/*
NB: 2b.2 is slower than 2b.1 but makes it easier to get a table of employees
ranked by number of orders.
*/
SELECT
LastName
FROM (
SELECT
Employees.LastName,
COUNT(*) AS TotalOrders
FROM
Employees
INNER JOIN
Orders ON Employees.EmployeeID = Orders.EmployeeID
GROUP BY
Employees.EmployeeID
ORDER BY
TotalOrders DESC
)
LIMIT 1;
-- 2c.1 Using MAX() and INNER JOIN
/*
NB: 2c.1 is slower than 2c.2 but is less useful if the intention is to rank
the most popular products among German customers.
*/
SELECT
ProductName
FROM (
SELECT
ProductName,
MAX(SumQty) AS MaxSumQty
FROM (
SELECT
Products.ProductName,
SUM(OrderDetails.Quantity) AS SumQty
FROM
Products
INNER JOIN
OrderDetails ON Products.ProductID = OrderDetails.ProductID
INNER JOIN
Orders ON OrderDetails.OrderID = Orders.OrderID
INNER JOIN
Customers ON Orders.CustomerID = Customers.CustomerID
WHERE
Customers.Country = 'Germany'
GROUP BY
OrderDetails.ProductID
)
);
-- 2c.2 Using ORDER BY and LIMIT
/*
NB: 2c.2 is slower than 2c.1 but makes it easier to get a table of the most
popular products among German customers.
*/
SELECT
ProductName
FROM (
SELECT
Products.ProductName,
SUM(OrderDetails.Quantity) AS SumQty
FROM
Products
INNER JOIN
OrderDetails ON Products.ProductID = OrderDetails.ProductID
INNER JOIN
Orders ON OrderDetails.OrderID = Orders.OrderID
INNER JOIN
Customers ON Orders.CustomerID = Customers.CustomerID
WHERE
Customers.Country = 'Germany'
GROUP BY
OrderDetails.ProductID
ORDER BY
SumQty DESC
)
LIMIT 1;
cat ./R/functions.R
# File name: functions.R
# Path: './R/functions.R'
# Author: N. Chan
# Purpose: Script for housing custom functions required for project.
# NB: Strictly speaking, this is a project and not a package. That said, using
# the Roxygen2 documentation style for writing functions is helpful for keeping
# information about custom functions organized.
#' @title Check, install, and load required packages
#'
#' @description Automated method for checking, installing, and loading a list
#' of packages provided by the user. Function asks the user before installing.
#'
#' @param ... A list or vector containing the names of packages as strings to
#' be loaded and installed.
#'
#' @return None
#'
#' @examples
#' \dontrun{
#' pkgs <- c("ggplot2", "tidyverse")
#' using(pkgs)
#' }
#'
#' @author N. Chan
#' @export
#'
using <- function(...) {
libs <- unlist(list(...))
req <- suppressWarnings(unlist(lapply(libs, require, character.only = TRUE)))
need <- libs[req == FALSE]
n <- length(need)
if (n > 0) {
libsmsg <-
if (n > 2) {
paste(paste(need[1:(n - 1)], collapse = ", "), ",", sep = "")
} else
need[1]
if (n > 1) {
libsmsg <- paste(libsmsg, " and ", need[n], sep = "")
}
libsmsg <-
paste(
"The following packages could not be found: ",
libsmsg,
"\n\r\n\rInstall missing packages?",
collapse = ""
)
# Checks if R is in interactive mode. If yes, then prompt user for
# interactive response. If no, prompt user for input from stdin.
if (interactive()) {
if (!(askYesNo(libsmsg, default = FALSE) %in% c(NA, FALSE))) {
install.packages(need)
lapply(need, require, character.only = TRUE)
} else {
stop("required packages were not installed or loaded")
}
} else {
cat(libsmsg, "(yes/No/cancel) ")
response <- readLines("stdin", n = 1)
input <- pmatch(tolower(response), c("yes", "no", "cancel"))
if (!nchar(response) | input %in% c(2, 3)) {
stop("required packages were not installed or loaded")
} else if (is.na(input)) {
stop("Unrecognized response ", dQuote(response))
} else {
install.packages(need)
lapply(need, require, character.only = TRUE)
}
}
}
return(invisible(NULL))
}
user_input <- function(prompt) {
if(interactive()) {
return(readlin)
}
}
#' @title Compute the statistical mode of an R object
#'
#' @description Computes the statistical mode of an R object.
#'
#' @param x An R object that can be interpreted as factors.
#'
#' @return A table with the name of the most frequent element and its respective
#' count.
#'
#' @examples
#' \dontrun{
#' mydata <- c(1,1,1,1,2,2,2,3,3)
#' mode_stat(mydata)
#' }
#'
#' @author N. Chan
#' @export
#'
mode_stat <- function(x) {
freqtable <- table(x)
return(freqtable[which.max(freqtable)])
}
#' @title Compute the geometric mean of a numeric object
#'
#' @description Computes the geometric mean of a numeric object.
#'
#' @param x An R object that can be coerced into a numeric vector.
#'
#' @return A double representing the geometric mean.
#'
#' @examples
#' \dontrun{
#' mydata <- c(1,1,1,1,2,2,2,3,3)
#' geo_mean(mydata)
#' }
#'
#' @author N. Chan
#' @export
#'
geo_mean <- function(x) {
# Taking the mean of the natural log then raising e to the mean is more
# computationally efficient than taking the product of all elements of x
# and taking the n-th root.
gm <- exp(mean(log(x)))
return(gm)
}
#' @title Compute the harmonic mean of a numeric object
#'
#' @description Computes the harmonic mean of a numeric object.
#'
#' @param x An R object that can be coerced into a numeric vector.
#'
#' @return A double representing the harmonic mean.
#'
#' @examples
#' \dontrun{
#' mydata <- c(1,1,1,1,2,2,2,3,3)
#' har_mean(mydata)
#' }
#'
#' @author N. Chan
#' @export
#'
har_mean <- function(x) {
hm <- 1/(mean(1/x))
return(hm)
}
#' @title Provides summary statistics
#'
#' @description Wrapper for summary(), mode_stat(), geo_mean(), and har_mean()
#'
#' @param x An R object that can be coerced into a numeric vector.
#'
#' @return A data frame containing two columns (name of statistic and value)
#'
#' @examples
#' \dontrun{
#' mydata <- c(1,1,1,1,2,2,2,3,3)
#' summ_stats(mydata)
#' }
#'
#' @author N. Chan
#' @export
#'
summ_stats <- function(x) {
if(!is.numeric(x)) {
stop("x must be a numeric object.")
}
summ <- summary(x)
mo <- as.numeric(names(mode_stat(x)))
gm <- geo_mean(x)
hm <- har_mean(x)
out <- data.frame(
Statistic = c(names(summ), "Mode", "Geometric Mean", "Harmonic Mean"),
Value = c(as.vector(summ), mo, gm, hm)
)
out[which(out[,1] == "Min."), 1] <- "Minimum"
out[which(out[,1] == "1st Qu."), 1] <- "1st Quartile"
out[which(out[,1] == "Mean"), 1] <- "Arithmetic Mean"
out[which(out[,1] == "3rd Qu."), 1] <- "3rd Quartile"
out[which(out[,1] == "Max."), 1] <- "Maximum"
return(out)
}
cat ./scripts/00_init.R
# File name: 00_init.R
# Path: './scripts/00_init.R'
# Author: N. Chan
# Purpose: Initializes project and installs required dependencies.
# Load up custom functions
source(paste0(getwd(), "/R/functions.R"))
message("./R/functions.R is loaded.")
# List of required packages for analysis
required_packages <-
c(
"knitr",
"kableExtra",
#"fansi",
"tidyverse",
"readxl",
"plotly",
"htmlwidgets",
"shiny")
# Check, install, and load required packages
using(required_packages)
message("./scripts/00_init.R was executed.")
cat ./scripts/01_loaddata.R
# File name: 01_loaddata.R
# Path: './scripts/01_loaddata.R'
# Author: N. Chan
# Purpose: Loads the 2019 Winter Data Science Intern Challenge Data Set
# Load and install missing packages
source(paste0(getwd(), "/scripts/00_init.R"))
# Read in data
if (exists("xlsxfile")) {
datafile <- xlsxfile
message(paste0("Using user-provided xlsxfile: ", datafile))
} else {
datafile <- paste0(
getwd(),
"/data/raw_data/2019 Winter Data Science Intern Challenge Data Set.xlsx"
)
message(paste0("Using default xlsxfile: ", datafile))
}
shopifydata <- read_excel(datafile)
message("./scripts/01_loaddata.R was executed.")
NB: Script not used in report. Provided as option for custom analysis.
cat ./scripts/02_analyze.R
# File name: 02_analyze.R
# Path: './scripts/02_analyze.R'
# Author: N. Chan
# Purpose: Performs analysis per questions provided in challenge
# Prompt:
# Question 1: On Shopify, we have exactly 100 sneaker shops, and each of these
# shops sells only one model of shoe. We want to do some analysis of the average
# order value (AOV). When we look at orders data over a 30 day window, we
# naively calculate an AOV of $3145.13. Given that we know these shops are
# selling sneakers, a relatively affordable item, something seems wrong with our
# analysis.
#
# 1a. Think about what could be going wrong with our calculation. Think about a
# better way to evaluate this data.
# 1b. What metric would you report for this dataset?
# 1c. What is its value?
# Load up packages and data required for analysis
source(paste0(getwd(), "/scripts/01_loaddata.R"))
attach(shopifydata)
if (exists("voi")) {
var_interest <- shopifydata %>% pull(voi)
voi_char <- as.character(voi)
message(paste0("Using user-provided variable of interest: ", voi))
} else {
var_interest <- order_amount
voi_char <- "order_amount"
message(paste0("Using default variable of interest: order_amount"))
}
if (!(is.numeric(var_interest))) {
stop("The variable of interest is not numeric. Only numeric variables are supported.")
}
if (exists("binsize")) {
binwidth <- as.numeric(binsize)
binwidth_density <- as.numeric(binsize)
message(paste0("Using user-provided bin width: ", binsize))
} else {
binwidth <- 10000
binwidth_density <- binwidth/100
message(paste0("Using default bin width: ", binwidth))
}
# Get summary statistics for var_interest
summarystats <- summ_stats(var_interest)
# Produce plot of var_interest
p <-
ggplot(data = shopifydata, aes(x = var_interest)) +
geom_histogram(
aes(color = "Histogram"),
fill = "springgreen4",
size = 0,
binwidth = binwidth) +
geom_density(
# Density plots are usually constrained within [0,1]. However, ggplot
# requires that the y-axis of plots have the same scale. This is a
# workaround to let our density plot display properly.
aes(y = ..density.. * nrow(shopifydata) * binwidth_density, color = "Density Plot"),
fill = "springgreen4",
size = 0,
alpha = 0.3
) +
geom_vline(
aes(xintercept = summarystats[which(summarystats == "Arithmetic Mean"), 2], color = "Arithmetic Mean"),
linetype = "longdash",
size = 0.25,
) +
geom_vline(
aes(xintercept = summarystats[which(summarystats == "Median"), 2], color = "Median"),
linetype = "dotdash",
size = 0.25,
) +
geom_vline(
aes(xintercept = summarystats[which(summarystats == "Mode"), 2], color = "Mode"),
linetype = "dotted",
size = 0.25,
) +
geom_vline(
aes(xintercept = summarystats[which(summarystats == "Geometric Mean"), 2], color = "Geometric Mean"),
linetype = "twodash",
size = 0.25,
) +
geom_vline(
aes(xintercept = summarystats[which(summarystats == "Harmonic Mean"), 2], color = "Harmonic Mean"),
linetype = "dashed",
size = 0.25,
) +
labs(
x = "Value",
y = "Count"
) +
scale_x_continuous(
labels = function(x)
format(x, scientific = FALSE),
guide = guide_legend()
) +
scale_color_manual(
name = "Key Metrics",
values = c(
"Histogram" = "springgreen4",
"Density Plot" = "springgreen4",
"Arithmetic Mean" = "red",
"Median" = "blue",
"Mode" = "orange",
"Geometric Mean" = "green",
"Harmonic Mean" = "grey"
)
) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
# plotly does not support subtitles or captions, so we need to manually define
# them with HTML as part of the title
p_out <-
ggplotly(p) %>%
layout(title = list(text = paste0(
'<span style = "font-size: 15px;">',
"<b>Histogram of ",
voi_char,
"</b>",
"</span>")))
# Check if ./output exists and if not, creates it
if (!(dir.exists(paste0(getwd(), "/output")))) {
dir.create(paste0(getwd(), "/output"))
}
# Save out the key metrics to csv and plot to HTML file
write.csv(summarystats,
file = paste0(
getwd(),
"/output/summarystats.csv"))
saveWidget(p_out,
file = paste0(
getwd(),
"/output/plot.html"))
# Uncomment the line below to delete dependency files created by saveWidget()
# Not actually implemented as this should be fixed in htmlwidgets itself...
# unlink(paste0(getwd(), "/output/plot_files"), recursive = TRUE)
message("./scripts/02_analyze.R was executed.")
message(paste0("Outputs saved to ", getwd(), "/output"))
NB: Script not used in report.
cat ./scripts/03_question2.sql
/*
Challenge: https://docs.google.com/document/d/13VCtoyto9X1PZ74nPI4ZEDdb8hF8LAlcmLH1ZTHxKxE/edit#
SQL Dataset: https://www.w3schools.com/SQL/TRYSQL.ASP?FILENAME=TRYSQL_SELECT_ALL
*/
-- Question 2:
-- 2a. How many orders were shipped by Speedy Express in total?
-- 2a.1 Using nested SELECT
SELECT
COUNT(*) AS TotalOrders
FROM
Orders
WHERE
ShipperID = (
SELECT ShipperID
FROM Shippers
WHERE ShipperName = 'Speedy Express'
);
-- 2a.2 Using INNER JOIN
SELECT
COUNT(*) AS TotalOrders
FROM
Orders
INNER JOIN
Shippers ON Orders.ShipperID = Shippers.ShipperID
WHERE
ShipperName = 'Speedy Express';
-- 2b. What is the last name of the employee with the most orders?
-- 2b.1 Using nested SELECT
SELECT
LastName
FROM
Employees
WHERE
EmployeeID = (
SELECT
EmployeeID
FROM (
SELECT
MAX(TotalOrders),
EmployeeID
FROM (
SELECT
COUNT(*) AS TotalOrders,
EmployeeID
FROM
Orders
GROUP BY
EmployeeID
)
)
);
-- 2b.2 Using MAX() and INNER JOIN
/*
NB (1): 2b.2 returns an additional column (MaxTotalOrders).
NB (2): 2b.2 is faster than 2b.3 but is less useful if the intention is to rank
employees by number of orders.
*/
SELECT
LastName
FROM (
SELECT
LastName,
MAX(TotalOrders) AS MaxTotalOrders
FROM (
SELECT
Employees.LastName,
COUNT(*) AS TotalOrders
FROM
Employees
INNER JOIN
Orders ON Employees.EmployeeID = Orders.EmployeeID
GROUP BY
Employees.EmployeeID
)
);
-- 2b.3 Using ORDER BY and LIMIT
/*
NB (1): 2b.3 provides the exact answer requested.
NB (2): 2b.3 is slower than 2b.2 but makes it easier to get a table of employees
ranked by number of orders.
*/
SELECT
LastName
FROM (
SELECT
Employees.LastName,
COUNT(*) AS TotalOrders
FROM
Employees
INNER JOIN
Orders ON Employees.EmployeeID = Orders.EmployeeID
GROUP BY
Employees.EmployeeID
ORDER BY
TotalOrders DESC
)
LIMIT 1;
-- 2c. What product was ordered the most by customers in Germany?
-- 2c.1 Using nested SELECT
SELECT
ProductName
FROM
Products
WHERE
ProductID = (
SELECT
ProductID
FROM (
SELECT
ProductID,
MAX(SumQty)
FROM (
SELECT
ProductID,
SUM(Quantity) AS SumQty
FROM
OrderDetails
WHERE
OrderID IN (
SELECT
OrderID
FROM
Orders
WHERE
CustomerID IN (
SELECT
CustomerID
FROM
Customers
WHERE
Country = 'Germany'
)
)
GROUP BY
ProductID
)
)
);
-- 2c.2 Using MAX() and INNER JOIN
/*
NB: 2c.2 is slower than 2c.3 but is less useful if the intention is to rank
the most popular products among German customers.
*/
SELECT
ProductName
FROM (
SELECT
ProductName,
MAX(SumQty) AS MaxSumQty
FROM (
SELECT
Products.ProductName,
SUM(OrderDetails.Quantity) AS SumQty
FROM
Products
INNER JOIN
OrderDetails ON Products.ProductID = OrderDetails.ProductID
INNER JOIN
Orders ON OrderDetails.OrderID = Orders.OrderID
INNER JOIN
Customers ON Orders.CustomerID = Customers.CustomerID
WHERE
Customers.Country = 'Germany'
GROUP BY
OrderDetails.ProductID
)
);
-- 2c.3 Using ORDER BY and LIMIT
/*
NB: 2c.3 is slower than 2c.2 but makes it easier to get a table of the most
popular products among German customers.
*/
SELECT
ProductName
FROM (
SELECT
Products.ProductName,
SUM(OrderDetails.Quantity) AS SumQty
FROM
Products
INNER JOIN
OrderDetails ON Products.ProductID = OrderDetails.ProductID
INNER JOIN
Orders ON OrderDetails.OrderID = Orders.OrderID
INNER JOIN
Customers ON Orders.CustomerID = Customers.CustomerID
WHERE
Customers.Country = 'Germany'
GROUP BY
OrderDetails.ProductID
ORDER BY
SumQty DESC
)
LIMIT 1;
NB: Script not used in report. Provided as option for custom analysis.
cat ./run.R
# File name: run.R
# Path: './run.R'
# Author: N. Chan
# Purpose: Command line option to run all scripts and analyses
# Usage: Rscript run.R [xlsxfile voi binsize]
# Takes three optional arguments: (1) path to excel file, (2) name of a numeric
# variable of interest, and (3) size of histogram bins. If using optional
# arguments, all three must be defined.
# Returns a histogram-density plot of the data depicting several key metrics,
# as well as a csv file containing the names and values of the key metrics.
# Files are saved in "shopify-ds-intern-challenge/output".
# Optional arguments:
# xlsxfile voi binsize
# xlsxfile is the path to the excel file.
# voi is the name of the numeric variable of interest as written in row 1
# of the respective column in the excel file.
# binsize is the size of each bin in the histogram.
# All three arguments must be defined if using.
cmdargs <- commandArgs(trailingOnly = TRUE)
if(length(cmdargs) == 0) {
message("No arguments provided. Using defaults.")
} else if (length(cmdargs) > 3 | length(cmdargs) %in% c(1,2)) {
message("Invalid number of optional arguments provided. Using defaults.")
} else {
xlsxfile <- cmdargs[1]
voi <- cmdargs[2]
binsize <- cmdargs[3]
}
source(paste0(getwd(), "/scripts/02_analyze.R"))
cat(paste0("The geometric mean of ", voi_char, " is ", geo_mean(var_interest), ".\n"))
message("run.R has completed.")
NB: Script not used in report. Provided as option for custom analysis.
cat ./app.R
# File name: app.R
# Path: './app.R'
# Author: N. Chan
# Purpose: Runs a Shiny R app to interactively view a histogram from input data
source(paste0(getwd(), "/scripts/01_loaddata.R"))
# Define UI
ui <- fluidPage(
titlePanel("Histogram-Density Plot Viewer"),
sidebarLayout(
sidebarPanel(
h5("Prepared by Nathan Chan"),
h6("for the Shopify Summer 2022 Data Science Intern Challenge"),
br(),
helpText("Select your file and specify the settings below."),
fileInput(inputId = "xlsxfile", "Select Excel File", accept = ".xlsx", multiple = FALSE),
selectInput("selectedcolumn", "Choose Column (must be numeric)", choices = c("")),
br(),
textOutput("displaymax"),
tags$head(tags$style("#displaymax{color: blue; font-style: bold;}")),
textOutput("displaymin"),
tags$head(tags$style("#displaymin{color: blue; font-style: bold;}")),
textOutput("displayrechist"),
tags$head(tags$style("#displayrechist{color: blue; font-style: bold;}")),
br(),
helpText("Specify plot limits."),
numericInput("selectedmax", "X-axis Maximum", value = 1),
numericInput("selectedmin", "X-axis Minimum", value = 0),
br(),
helpText("Specify a histogram bin size between the minimum and maximum of the column specified."),
numericInput("selectedbinsize", "Histogram Bin Size", value = 1),
helpText("Specify a density plot scale factor between the minimum and maximum of the column specified. It is recommended to start with the same value as the recommended histogram bin size above and adjust from there."),
numericInput("selecteddensize", "Density Plot Scale Factor", value = 1)
# sliderInput("selectedbinsize", "Specify Histogram Bin Size", min = 0, max = 10, value = 5, step = 1),
# sliderInput("selecteddensize", "Specify Density Plot Scale Factor", min = 0, max = 10, value = 5, step = 1),
# sliderInput("selectedmax", "Specify X-axis Maximum", min = 0, max = 10, value = 5, step = 1),
# sliderInput("selectedmin", "Specify X-axis Minimum", min = 0, max = 10, value = 5, step = 1)
),
mainPanel(
tabsetPanel(
tabPanel(
title = "Data",
dataTableOutput(outputId = "displaydataTable")
),
tabPanel(
title = "Metrics",
dataTableOutput(outputId = "displaymetrics")
),
tabPanel(
title = "Plot",
plotlyOutput(outputId = "displayplot",
width = "100%",
height = "100%")
)
))
)
)
server <- function(input, output, session) {
xlsxdata <- reactive({
req(input$xlsxfile)
read_excel(input$xlsxfile$datapath)
})
output$displaymetrics <- renderDataTable({
req(input$xlsxfile)
req(input$selectedcolumn)
xlsxdata <- xlsxdata()
summ_stats(xlsxdata[[input$selectedcolumn]])
})
output$displaydataTable <- renderDataTable({xlsxdata()})
output$displaymax <- renderText({
req(input$xlsxfile)
req(input$selectedcolumn)
xlsxdata <- xlsxdata()
selectedcolumn <- as.character(input$selectedcolumn)
maxval <- max(xlsxdata[, selectedcolumn])
paste0("Maximum: ", maxval)
})
output$displaymin <- renderText({
req(input$xlsxfile)
req(input$selectedcolumn)
xlsxdata <- xlsxdata()
selectedcolumn <- as.character(input$selectedcolumn)
minval <- min(xlsxdata[, selectedcolumn])
paste0("Minimum: ", minval)
})
output$displayrechist <- renderText({
req(input$xlsxfile)
req(input$selectedcolumn)
xlsxdata_max <- input$selectedmax
xlsxdata_min <- input$selectedmin
xlsxdata_range <- xlsxdata_max - xlsxdata_min
paste0("Recommended Histogram Bin Size: ", xlsxdata_range * 0.01)
})
output$displayplot <- renderPlotly({
req(input$xlsxfile)
req(input$selectedcolumn)
req(input$selectedbinsize)
xlsxdata <- xlsxdata()
selectedcolumn <- input$selectedcolumn
summarystats <- summ_stats(xlsxdata[[selectedcolumn]])
binwidth <- input$selectedbinsize
binwidth2 <- input$selecteddensize
xmin <- input$selectedmin
xmax <- input$selectedmax
p <-
ggplot(data = xlsxdata, aes_string(x = selectedcolumn)) +
geom_histogram(
aes(color = "Histogram"),
fill = "springgreen4",
size = 0,
binwidth = binwidth) +
geom_density(
# Density plots are usually constrained within [0,1]. However, ggplot
# requires that the y-axis of plots have the same scale. This is a
# workaround to let our density plot display properly.
aes(y = ..density.. * nrow(xlsxdata) * binwidth2, color = "Density Plot"),
fill = "springgreen4",
size = 0,
alpha = 0.3
) +
geom_vline(
aes(xintercept = summarystats[which(summarystats == "Arithmetic Mean"), 2], color = "Arithmetic Mean"),
linetype = "longdash",
size = 0.25,
) +
geom_vline(
aes(xintercept = summarystats[which(summarystats == "Median"), 2], color = "Median"),
linetype = "dotdash",
size = 0.25,
) +
geom_vline(
aes(xintercept = summarystats[which(summarystats == "Mode"), 2], color = "Mode"),
linetype = "dotted",
size = 0.25,
) +
geom_vline(
aes(xintercept = summarystats[which(summarystats == "Geometric Mean"), 2], color = "Geometric Mean"),
linetype = "twodash",
size = 0.25,
) +
geom_vline(
aes(xintercept = summarystats[which(summarystats == "Harmonic Mean"), 2], color = "Harmonic Mean"),
linetype = "dashed",
size = 0.25,
) +
labs(
y = "Count"
) +
scale_x_continuous(
labels = function(x)
format(x, scientific = FALSE),
guide = guide_legend(),
limits = c(xmin, xmax)
) +
scale_color_manual(
name = "",
values = c(
"Histogram" = "springgreen4",
"Density Plot" = "springgreen4",
"Arithmetic Mean" = "red",
"Median" = "blue",
"Mode" = "orange",
"Geometric Mean" = "green",
"Harmonic Mean" = "grey"
)
) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(p) %>%
layout(
legend = list(
orientation = "h", x = 0.5, y = -0.25, xanchor = "center"))
})
observe({
req(input$xlsxfile)
xlsxdata <- xlsxdata()
updateSelectInput(
session = session,
inputId = "selectedcolumn",
choices = colnames(xlsxdata[, sapply(xlsxdata, is.numeric)])
)
})
observe({
req(input$xlsxfile)
req(input$selectedcolumn)
xlsxdata <- xlsxdata()
selectedcolumn <- as.character(input$selectedcolumn)
xlsxdata_max <- range(xlsxdata[, selectedcolumn])[2]
xlsxdata_min <- range(xlsxdata[, selectedcolumn])[1]
xlsxdata_range <- xlsxdata_max - xlsxdata_min
xlsxdata_rec <- (input$selectedmax - input$selectedmin) * 0.01
updateNumericInput(
session = session,
inputId = "selectedbinsize",
max = xlsxdata_range * 0.1,
min = xlsxdata_range * 0.001,
value = xlsxdata_rec,
step = xlsxdata_range * 0.001
)
updateNumericInput(
session = session,
inputId = "selecteddensize",
max = xlsxdata_range * 0.1,
min = xlsxdata_range * 0.0001,
value = xlsxdata_rec,
step = xlsxdata_range * 0.0001
)
# updateSliderInput(
# session = session,
# inputId = "selectedbinsize",
# max = xlsxdata_range * 0.1,
# min = xlsxdata_range * 0.001,
# value = xlsxdata_range * 0.01,
# step = xlsxdata_range * 0.001
# )
# updateSliderInput(
# session = session,
# inputId = "selecteddensize",
# max = xlsxdata_range * 0.1,
# min = xlsxdata_range * 0.0001,
# value = xlsxdata_range * 0.01,
# step = xlsxdata_range * 0.0001
# )
})
observe({
req(input$xlsxfile)
req(input$selectedcolumn)
xlsxdata <- xlsxdata()
selectedcolumn <- as.character(input$selectedcolumn)
xlsxdata_max <- range(xlsxdata[, selectedcolumn])[2]
xlsxdata_min <- range(xlsxdata[, selectedcolumn])[1]
xlsxdata_range <- xlsxdata_max - xlsxdata_min
updateNumericInput(
session = session,
inputId = "selectedmax",
max = xlsxdata_max,
min = xlsxdata_min,
value = xlsxdata_max,
step = xlsxdata_range * 0.001
)
updateNumericInput(
session = session,
inputId = "selectedmin",
max = xlsxdata_max,
min = xlsxdata_min,
value = xlsxdata_min,
step = xlsxdata_range * 0.001
)
# updateSliderInput(
# session = session,
# inputId = "selectedmax",
# max = xlsxdata_max,
# min = xlsxdata_min,
# value = xlsxdata_max,
# step = xlsxdata_range * 0.001
# )
# updateSliderInput(
# session = session,
# inputId = "selectedmin",
# max = xlsxdata_max,
# min = xlsxdata_min,
# value = xlsxdata_min,
# step = xlsxdata_range * 0.001
# )
})
}
shinyApp(ui, server)