I recently drilled down into window functions in SQL, so here’s a quick example comparing some of the syntax differences between SQL and R.
The problem
We’ll start with sequence of 10 orders with an order id and amount spent ($):
library(tidyverse)
library(dbplyr) #for simulating a database
library(slider) #for sliding window functions
sample_orders <- tibble(o_id = 101:110,
spent = round(runif(10, 5, 100), digits = 2))
o_id | spent |
---|---|
101 | 80.31 |
102 | 5.68 |
103 | 13.87 |
104 | 64.34 |
105 | 45.03 |
106 | 12.04 |
107 | 81.52 |
108 | 32.53 |
109 | 20.14 |
110 | 18.65 |
Now let’s say we want to calculate the following:
- the total amount spent
- the cumulative amount spent at each point in time (we assume id’s represent the sequence of events)
- the cumulative average amount spent at each point in time
- a moving average of amount spent at each point in time (+/- 2 orders)
- 4 even groups based on amount spent
- percent rank based on amount spent
How would we do it in SQL? in R?
SQL solution
The SQL code looks like this:
SELECT o_id,
spent,
sum(spent) OVER () AS sum_spent,
sum(spent) OVER (ORDER BY o_id) AS sum_spent_so_far, --assumes ROWS UNBOUNDED PRECEDING
avg(spent) OVER (ORDER BY o_id) AS avg_spent_so_far,
avg(spent) OVER (ORDER BY o_id ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS moving_avg_spent,
ntile(4) OVER (ORDER BY spent) AS ntile,
percent_rank() OVER (ORDER BY spent) AS percent_rank
FROM sample_orders
ORDER BY o_id
Using the dbplyr
package, we can run this SQL query in R to inspect the result.
# set up the sample_orders tibble as a table in a database in memory
sample_orders_db <- tbl_memdb(sample_orders)
# run the query and collect the results back into R
results_db <- tbl(src_memdb(), sql(sql_query)) %>%
collect() %>%
mutate_if(is.double, round, 2) # rounding for good measure
o_id | spent | sum_spent | sum_spent_so_far | avg_spent_so_far | moving_avg_spent | ntile | percent_rank |
---|---|---|---|---|---|---|---|
101 | 80.31 | 374.11 | 80.31 | 80.31 | 33.29 | 4 | 0.89 |
102 | 5.68 | 374.11 | 85.99 | 43 | 41.05 | 1 | 0 |
103 | 13.87 | 374.11 | 99.86 | 33.29 | 41.85 | 1 | 0.22 |
104 | 64.34 | 374.11 | 164.2 | 41.05 | 28.19 | 3 | 0.78 |
105 | 45.03 | 374.11 | 209.23 | 41.85 | 43.36 | 3 | 0.67 |
106 | 12.04 | 374.11 | 221.27 | 36.88 | 47.09 | 1 | 0.11 |
107 | 81.52 | 374.11 | 302.79 | 43.26 | 38.25 | 4 | 1 |
108 | 32.53 | 374.11 | 335.32 | 41.92 | 32.98 | 2 | 0.56 |
109 | 20.14 | 374.11 | 355.46 | 39.5 | 38.21 | 2 | 0.44 |
110 | 18.65 | 374.11 | 374.11 | 37.41 | 23.77 | 2 | 0.33 |
I’ve colored the values based on the range of each column to better see the patterns in the numbers:
sum_spent
is the same value across all rows – this is normally condensed into a single row when summarizedsum_spent_so_far
is a cumulative sum, so it increases with time. The last value is equivalent to the totalsum_spent
avg_spent_so_far
is a cumulative mean. The first value is equivalent to thespent
of the first order, and the last value is equivalent to thesum_spent
divided by the number of orders (374.11/10)moving_avg_spent
smooths out the pattern of the orderspent
ntile
divides our orders into 4 groups based on the amount spentpercent_rank
is also based on amount spent but gives us percents
R (tidyverse) solution
In R, we write things a bit differently. I’ve used a mix of base (sum()
, cumsum()
) and dplyr
functions (cummean()
, ntile()
, percent_rank()
), and brought in slider
to calculate the moving average using purrr
-like syntax.
results_r <- sample_orders %>%
arrange(o_id) %>% # use this if order_by() is the same for all calculations
mutate(sum_spent = sum(spent),
sum_spent_so_far = cumsum(spent),
avg_spent_so_far = cummean(spent),
moving_avg_spent = slide_dbl(spent, mean,
.before = 2, .after = 2,
.complete = FALSE),
ntile = ntile(spent, 4),
percent_rank = percent_rank(spent)) %>%
mutate_if(is.double, round, 2) # rounding for good measure
It looks pretty different – does this really give us the same result? Let’s compare the resulting data frames.
I would normally use dplyr::all_equal()
but the diffdf
package gives us a more detailed report on the differences so we’ll go with that this time. (Props to Sharla’s post on comparing dataframes for pointing me to it.)
diffdf::diffdf(results_db, results_r)
## Warning in diffdf::diffdf(results_db, results_r):
## Not all Values Compared Equal
## Differences found between the objects!
##
## A summary is given below.
##
## Not all Values Compared Equal
## All rows are shown in table below
##
## =============================
## Variable No of Differences
## -----------------------------
## ntile 1
## -----------------------------
##
##
## All rows are shown in table below
##
## ========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ----------------------------------------
## ntile 8 2 3
## ----------------------------------------
We see that there actually is a tiny difference in the ntile
column! In the process of working on this post, I’d stumbled into a slight difference in the way ntile()
distributes remainders across buckets in dplyr versus in a database. (Issue has been filed and already resolved in the dev version of dplyr!)
So, besides the small hiccup with ntile()
, we see that it indeed is the same!
Syntax comparison
For the sake of being extra thorough, let’s examine the syntax differences directly with the table below.
variable | r | sql |
---|---|---|
sum_spent | sum(spent) | sum(spent) OVER () |
sum_spent_so_far | cumsum(spent) | sum(spent) OVER (ORDER BY o_id) |
avg_spent_so_far | cummean(spent) | avg(spent) OVER (ORDER BY o_id) |
moving_avg_spent | slide_dbl(spent, mean, .before = 2, .after = 2, .complete = FALSE) | avg(spent) OVER (ORDER BY o_id ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) |
ntile | ntile(spent, 4) | ntile(4) OVER (ORDER BY spent) |
percent_rank | percent_rank(spent) | percent_rank() OVER (ORDER BY spent) |
We see that R in general is a bit more concise, but SQL’s syntax is repetitive but flexible. With the two keywords OVER()
and ORDER BY
, we can do almost everything.
SQL’s flexibility shines the most in the avg_spent_so_far
versus moving_avg_spent
example. The first calculation assumes ROWS UNBOUNDED PRECEDING
as the window. To move from a cumulative mean to a moving average, we just tweak the window – the rest stays the same.
You may have noticed that the table above is actually missing R’s equivalents of OVER()
and ORDER BY
. Because they did not need to be specified for each variable separately, I was able to extract out all the ORDER BY o_id
s to a single arrange(o_id)
before the mutate()
. For the ranking functions, I expressed the ORDER BY spent
directly in the arguments. If I had needed to, I could have specified the ordering explicitly with order_by()
.
In my book, the ability to extract out common actions makes the tidyverse solution superior for this example. I imagine that with more complex windows with unique order-by’s, SQL may have a chance of winning me over.
Have an alternative solution? Feel free to share in the comments!