I recently drilled down into window functions in SQL, so here’s a quick example comparing some of the syntax differences between SQL and R.
We’ll start with sequence of 10 orders with an order id and amount spent ($):
library(tidyverse) library(dbplyr) #for simulating a database library(slider) #for sliding window functions sample_orders <- tibble(o_id = 101:110, spent = round(runif(10, 5, 100), digits = 2))
Now let’s say we want to calculate the following:
- the total amount spent
- the cumulative amount spent at each point in time (we assume id’s represent the sequence of events)
- the cumulative average amount spent at each point in time
- a moving average of amount spent at each point in time (+/- 2 orders)
- 4 even groups based on amount spent
- percent rank based on amount spent
How would we do it in SQL? in R?
The SQL code looks like this:
SELECT o_id, spent, sum(spent) OVER () AS sum_spent, sum(spent) OVER (ORDER BY o_id) AS sum_spent_so_far, --assumes ROWS UNBOUNDED PRECEDING avg(spent) OVER (ORDER BY o_id) AS avg_spent_so_far, avg(spent) OVER (ORDER BY o_id ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS moving_avg_spent, ntile(4) OVER (ORDER BY spent) AS ntile, percent_rank() OVER (ORDER BY spent) AS percent_rank FROM sample_orders ORDER BY o_id
dbplyr package, we can run this SQL query in R to inspect the result.
# set up the sample_orders tibble as a table in a database in memory sample_orders_db <- tbl_memdb(sample_orders) # run the query and collect the results back into R results_db <- tbl(src_memdb(), sql(sql_query)) %>% collect() %>% mutate_if(is.double, round, 2) # rounding for good measure
I’ve colored the values based on the range of each column to better see the patterns in the numbers:
sum_spentis the same value across all rows – this is normally condensed into a single row when summarized
sum_spent_so_faris a cumulative sum, so it increases with time. The last value is equivalent to the total
avg_spent_so_faris a cumulative mean. The first value is equivalent to the
spentof the first order, and the last value is equivalent to the
sum_spentdivided by the number of orders (374.11/10)
moving_avg_spentsmooths out the pattern of the order
ntiledivides our orders into 4 groups based on the amount spent
percent_rankis also based on amount spent but gives us percents
R (tidyverse) solution
In R, we write things a bit differently. I’ve used a mix of base (
dplyr functions (
percent_rank()), and brought in
slider to calculate the moving average using
results_r <- sample_orders %>% arrange(o_id) %>% # use this if order_by() is the same for all calculations mutate(sum_spent = sum(spent), sum_spent_so_far = cumsum(spent), avg_spent_so_far = cummean(spent), moving_avg_spent = slide_dbl(spent, mean, .before = 2, .after = 2, .complete = FALSE), ntile = ntile(spent, 4), percent_rank = percent_rank(spent)) %>% mutate_if(is.double, round, 2) # rounding for good measure
It looks pretty different – does this really give us the same result? Let’s compare the resulting data frames.
I would normally use
dplyr::all_equal() but the
diffdf package gives us a more detailed report on the differences so we’ll go with that this time. (Props to Sharla’s post on comparing dataframes for pointing me to it.)
## Warning in diffdf::diffdf(results_db, results_r): ## Not all Values Compared Equal
## Differences found between the objects! ## ## A summary is given below. ## ## Not all Values Compared Equal ## All rows are shown in table below ## ## ============================= ## Variable No of Differences ## ----------------------------- ## ntile 1 ## ----------------------------- ## ## ## All rows are shown in table below ## ## ======================================== ## VARIABLE ..ROWNUMBER.. BASE COMPARE ## ---------------------------------------- ## ntile 8 2 3 ## ----------------------------------------
We see that there actually is a tiny difference in the
ntile column! In the process of working on this post, I’d stumbled into a slight difference in the way
ntile() distributes remainders across buckets in dplyr versus in a database. (Issue has been filed and already resolved in the dev version of dplyr!)
So, besides the small hiccup with
ntile(), we see that it indeed is the same!
For the sake of being extra thorough, let’s examine the syntax differences directly with the table below.
|sum_spent||sum(spent)||sum(spent) OVER ()|
|sum_spent_so_far||cumsum(spent)||sum(spent) OVER (ORDER BY o_id)|
|avg_spent_so_far||cummean(spent)||avg(spent) OVER (ORDER BY o_id)|
|moving_avg_spent||slide_dbl(spent, mean, .before = 2, .after = 2, .complete = FALSE)||avg(spent) OVER (ORDER BY o_id ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)|
|ntile||ntile(spent, 4)||ntile(4) OVER (ORDER BY spent)|
|percent_rank||percent_rank(spent)||percent_rank() OVER (ORDER BY spent)|
We see that R in general is a bit more concise, but SQL’s syntax is repetitive but flexible. With the two keywords
ORDER BY, we can do almost everything.
SQL’s flexibility shines the most in the
moving_avg_spent example. The first calculation assumes
ROWS UNBOUNDED PRECEDING as the window. To move from a cumulative mean to a moving average, we just tweak the window – the rest stays the same.
You may have noticed that the table above is actually missing R’s equivalents of
ORDER BY. Because they did not need to be specified for each variable separately, I was able to extract out all the
ORDER BY o_ids to a single
arrange(o_id) before the
mutate(). For the ranking functions, I expressed the
ORDER BY spent directly in the arguments. If I had needed to, I could have specified the ordering explicitly with
In my book, the ability to extract out common actions makes the tidyverse solution superior for this example. I imagine that with more complex windows with unique order-by’s, SQL may have a chance of winning me over.
Have an alternative solution? Feel free to share in the comments!