Window functions - a SQL versus R example

The problem
SQL solution
R (tidyverse) solution
Syntax comparison

I recently drilled down into window functions in SQL, so here’s a quick example comparing some of the syntax differences between SQL and R.

The problem

We’ll start with sequence of 10 orders with an order id and amount spent ($):

library(tidyverse)
library(dbplyr) #for simulating a database
library(slider) #for sliding window functions

sample_orders <- tibble(o_id = 101:110,
                        spent = round(runif(10, 5, 100), digits = 2))

o_id	spent
101	80.31
102	5.68
103	13.87
104	64.34
105	45.03
106	12.04
107	81.52
108	32.53
109	20.14
110	18.65

Now let’s say we want to calculate the following:

the total amount spent
the cumulative amount spent at each point in time (we assume id’s represent the sequence of events)
the cumulative average amount spent at each point in time
a moving average of amount spent at each point in time (+/- 2 orders)
4 even groups based on amount spent
percent rank based on amount spent

How would we do it in SQL? in R?

SQL solution

The SQL code looks like this:

SELECT o_id, 
spent,
sum(spent) OVER () AS sum_spent,
sum(spent) OVER (ORDER BY o_id) AS sum_spent_so_far, --assumes ROWS UNBOUNDED PRECEDING
avg(spent) OVER (ORDER BY o_id) AS avg_spent_so_far,
avg(spent) OVER (ORDER BY o_id ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS moving_avg_spent,
ntile(4) OVER (ORDER BY spent) AS ntile,
percent_rank() OVER (ORDER BY spent) AS percent_rank
FROM sample_orders
ORDER BY o_id

Using the dbplyr package, we can run this SQL query in R to inspect the result.

# set up the sample_orders tibble as a table in a database in memory
sample_orders_db <- tbl_memdb(sample_orders)

# run the query and collect the results back into R
results_db <- tbl(src_memdb(), sql(sql_query)) %>% 
  collect() %>% 
  mutate_if(is.double, round, 2) # rounding for good measure

o_id	spent	sum_spent	sum_spent_so_far	avg_spent_so_far	moving_avg_spent	ntile	percent_rank
101	80.31	374.11	80.31	80.31	33.29	4	0.89
102	5.68	374.11	85.99	43	41.05	1	0
103	13.87	374.11	99.86	33.29	41.85	1	0.22
104	64.34	374.11	164.2	41.05	28.19	3	0.78
105	45.03	374.11	209.23	41.85	43.36	3	0.67
106	12.04	374.11	221.27	36.88	47.09	1	0.11
107	81.52	374.11	302.79	43.26	38.25	4	1
108	32.53	374.11	335.32	41.92	32.98	2	0.56
109	20.14	374.11	355.46	39.5	38.21	2	0.44
110	18.65	374.11	374.11	37.41	23.77	2	0.33

I’ve colored the values based on the range of each column to better see the patterns in the numbers:

sum_spent is the same value across all rows – this is normally condensed into a single row when summarized
sum_spent_so_far is a cumulative sum, so it increases with time. The last value is equivalent to the total sum_spent
avg_spent_so_far is a cumulative mean. The first value is equivalent to the spent of the first order, and the last value is equivalent to the sum_spent divided by the number of orders (374.11/10)
moving_avg_spent smooths out the pattern of the order spent
ntile divides our orders into 4 groups based on the amount spent
percent_rank is also based on amount spent but gives us percents

R (tidyverse) solution

In R, we write things a bit differently. I’ve used a mix of base (sum(), cumsum()) and dplyr functions (cummean(), ntile(), percent_rank()), and brought in slider to calculate the moving average using purrr-like syntax.

results_r <- sample_orders %>% 
  arrange(o_id) %>% # use this if order_by() is the same for all calculations
  mutate(sum_spent = sum(spent),
         sum_spent_so_far = cumsum(spent),
         avg_spent_so_far = cummean(spent),
         moving_avg_spent = slide_dbl(spent, mean, 
                                      .before = 2, .after = 2,
                                      .complete = FALSE),
         ntile = ntile(spent, 4),
         percent_rank = percent_rank(spent)) %>% 
  mutate_if(is.double, round, 2) # rounding for good measure

It looks pretty different – does this really give us the same result? Let’s compare the resulting data frames.

I would normally use dplyr::all_equal() but the diffdf package gives us a more detailed report on the differences so we’ll go with that this time. (Props to Sharla’s post on comparing dataframes for pointing me to it.)

diffdf::diffdf(results_db, results_r)

## Warning in diffdf::diffdf(results_db, results_r): 
## Not all Values Compared Equal

## Differences found between the objects!
## 
## A summary is given below.
## 
## Not all Values Compared Equal
## All rows are shown in table below
## 
##   =============================
##    Variable  No of Differences 
##   -----------------------------
##     ntile            1         
##   -----------------------------
## 
## 
## All rows are shown in table below
## 
##   ========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE 
##   ----------------------------------------
##     ntile          8         2       3    
##   ----------------------------------------

We see that there actually is a tiny difference in the ntile column! In the process of working on this post, I’d stumbled into a slight difference in the way ntile() distributes remainders across buckets in dplyr versus in a database. (Issue has been filed and already resolved in the dev version of dplyr!)

So, besides the small hiccup with ntile(), we see that it indeed is the same!

Syntax comparison

For the sake of being extra thorough, let’s examine the syntax differences directly with the table below.

variable	r	sql
sum_spent	sum(spent)	sum(spent) OVER ()
sum_spent_so_far	cumsum(spent)	sum(spent) OVER (ORDER BY o_id)
avg_spent_so_far	cummean(spent)	avg(spent) OVER (ORDER BY o_id)
moving_avg_spent	slide_dbl(spent, mean, .before = 2, .after = 2, .complete = FALSE)	avg(spent) OVER (ORDER BY o_id ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)
ntile	ntile(spent, 4)	ntile(4) OVER (ORDER BY spent)
percent_rank	percent_rank(spent)	percent_rank() OVER (ORDER BY spent)

We see that R in general is a bit more concise, but SQL’s syntax is repetitive but flexible. With the two keywords OVER() and ORDER BY, we can do almost everything.

SQL’s flexibility shines the most in the avg_spent_so_far versus moving_avg_spent example. The first calculation assumes ROWS UNBOUNDED PRECEDING as the window. To move from a cumulative mean to a moving average, we just tweak the window – the rest stays the same.

You may have noticed that the table above is actually missing R’s equivalents of OVER() and ORDER BY. Because they did not need to be specified for each variable separately, I was able to extract out all the ORDER BY o_ids to a single arrange(o_id) before the mutate(). For the ranking functions, I expressed the ORDER BY spent directly in the arguments. If I had needed to, I could have specified the ordering explicitly with order_by().

In my book, the ability to extract out common actions makes the tidyverse solution superior for this example. I imagine that with more complex windows with unique order-by’s, SQL may have a chance of winning me over.

Have an alternative solution? Feel free to share in the comments!