I recently drilled down into window functions in SQL, so here’s a quick example comparing some of the syntax differences between SQL and R.

## The problem

We’ll start with sequence of 10 orders with an order id and amount spent ($):

```
library(tidyverse)
library(dbplyr) #for simulating a database
library(slider) #for sliding window functions
sample_orders <- tibble(o_id = 101:110,
spent = round(runif(10, 5, 100), digits = 2))
```

o_id | spent |
---|---|

101 | 80.31 |

102 | 5.68 |

103 | 13.87 |

104 | 64.34 |

105 | 45.03 |

106 | 12.04 |

107 | 81.52 |

108 | 32.53 |

109 | 20.14 |

110 | 18.65 |

Now let’s say we want to calculate the following:

- the total amount spent
- the cumulative amount spent at each point in time (we assume id’s represent the sequence of events)
- the cumulative average amount spent at each point in time
- a moving average of amount spent at each point in time (+/- 2 orders)
- 4 even groups based on amount spent
- percent rank based on amount spent

How would we do it in SQL? in R?

## SQL solution

The SQL code looks like this:

```
SELECT o_id,
spent,
sum(spent) OVER () AS sum_spent,
sum(spent) OVER (ORDER BY o_id) AS sum_spent_so_far, --assumes ROWS UNBOUNDED PRECEDING
avg(spent) OVER (ORDER BY o_id) AS avg_spent_so_far,
avg(spent) OVER (ORDER BY o_id ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS moving_avg_spent,
ntile(4) OVER (ORDER BY spent) AS ntile,
percent_rank() OVER (ORDER BY spent) AS percent_rank
FROM sample_orders
ORDER BY o_id
```

Using the `dbplyr`

package, we can run this SQL query in R to inspect the result.

```
# set up the sample_orders tibble as a table in a database in memory
sample_orders_db <- tbl_memdb(sample_orders)
# run the query and collect the results back into R
results_db <- tbl(src_memdb(), sql(sql_query)) %>%
collect() %>%
mutate_if(is.double, round, 2) # rounding for good measure
```

o_id | spent | sum_spent | sum_spent_so_far | avg_spent_so_far | moving_avg_spent | ntile | percent_rank |
---|---|---|---|---|---|---|---|

101 | 80.31 | 374.11 | 80.31 | 80.31 | 33.29 | 4 | 0.89 |

102 | 5.68 | 374.11 | 85.99 | 43 | 41.05 | 1 | 0 |

103 | 13.87 | 374.11 | 99.86 | 33.29 | 41.85 | 1 | 0.22 |

104 | 64.34 | 374.11 | 164.2 | 41.05 | 28.19 | 3 | 0.78 |

105 | 45.03 | 374.11 | 209.23 | 41.85 | 43.36 | 3 | 0.67 |

106 | 12.04 | 374.11 | 221.27 | 36.88 | 47.09 | 1 | 0.11 |

107 | 81.52 | 374.11 | 302.79 | 43.26 | 38.25 | 4 | 1 |

108 | 32.53 | 374.11 | 335.32 | 41.92 | 32.98 | 2 | 0.56 |

109 | 20.14 | 374.11 | 355.46 | 39.5 | 38.21 | 2 | 0.44 |

110 | 18.65 | 374.11 | 374.11 | 37.41 | 23.77 | 2 | 0.33 |

I’ve colored the values based on the range of **each column** to better see the patterns in the numbers:

`sum_spent`

is the same value across all rows – this is normally condensed into a single row when summarized`sum_spent_so_far`

is a cumulative sum, so it increases with time. The last value is equivalent to the total`sum_spent`

`avg_spent_so_far`

is a cumulative mean. The first value is equivalent to the`spent`

of the first order, and the last value is equivalent to the`sum_spent`

divided by the number of orders (374.11/10)`moving_avg_spent`

smooths out the pattern of the order`spent`

`ntile`

divides our orders into 4 groups based on the amount spent`percent_rank`

is also based on amount spent but gives us percents

## R (tidyverse) solution

In R, we write things a bit differently. I’ve used a mix of base (`sum()`

, `cumsum()`

) and `dplyr`

functions (`cummean()`

, `ntile()`

, `percent_rank()`

), and brought in `slider`

to calculate the moving average using `purrr`

-like syntax.

```
results_r <- sample_orders %>%
arrange(o_id) %>% # use this if order_by() is the same for all calculations
mutate(sum_spent = sum(spent),
sum_spent_so_far = cumsum(spent),
avg_spent_so_far = cummean(spent),
moving_avg_spent = slide_dbl(spent, mean,
.before = 2, .after = 2,
.complete = FALSE),
ntile = ntile(spent, 4),
percent_rank = percent_rank(spent)) %>%
mutate_if(is.double, round, 2) # rounding for good measure
```

It looks pretty different – does this really give us the same result? Let’s compare the resulting data frames.

I would normally use `dplyr::all_equal()`

but the `diffdf`

package gives us a more detailed report on the differences so we’ll go with that this time. (Props to Sharla’s post on comparing dataframes for pointing me to it.)

`diffdf::diffdf(results_db, results_r)`

```
## Warning in diffdf::diffdf(results_db, results_r):
## Not all Values Compared Equal
```

```
## Differences found between the objects!
##
## A summary is given below.
##
## Not all Values Compared Equal
## All rows are shown in table below
##
## =============================
## Variable No of Differences
## -----------------------------
## ntile 1
## -----------------------------
##
##
## All rows are shown in table below
##
## ========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ----------------------------------------
## ntile 8 2 3
## ----------------------------------------
```

We see that there actually *is* a tiny difference in the `ntile`

column! In the process of working on this post, I’d stumbled into a slight difference in the way `ntile()`

distributes remainders across buckets in dplyr versus in a database. (Issue has been filed and already resolved in the dev version of dplyr!)

So, besides the small hiccup with `ntile()`

, we see that **it indeed is the same**!

## Syntax comparison

For the sake of being extra thorough, let’s examine the syntax differences directly with the table below.

variable | r | sql |
---|---|---|

sum_spent | sum(spent) | sum(spent) OVER () |

sum_spent_so_far | cumsum(spent) | sum(spent) OVER (ORDER BY o_id) |

avg_spent_so_far | cummean(spent) | avg(spent) OVER (ORDER BY o_id) |

moving_avg_spent | slide_dbl(spent, mean, .before = 2, .after = 2, .complete = FALSE) | avg(spent) OVER (ORDER BY o_id ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) |

ntile | ntile(spent, 4) | ntile(4) OVER (ORDER BY spent) |

percent_rank | percent_rank(spent) | percent_rank() OVER (ORDER BY spent) |

We see that R in general is a bit more concise, but SQL’s syntax is repetitive but flexible. With the two keywords `OVER()`

and `ORDER BY`

, we can do almost everything.

SQL’s flexibility shines the most in the `avg_spent_so_far`

versus `moving_avg_spent`

example. The first calculation assumes `ROWS UNBOUNDED PRECEDING`

as the window. To move from a cumulative mean to a moving average, we just tweak the window – the rest stays the same.

You may have noticed that the table above is actually missing R’s equivalents of `OVER()`

and `ORDER BY`

. Because they did not need to be specified for each variable separately, I was able to extract out all the `ORDER BY o_id`

s to a single `arrange(o_id)`

before the `mutate()`

. For the ranking functions, I expressed the `ORDER BY spent`

directly in the arguments. If I had needed to, I could have specified the ordering explicitly with `order_by()`

.

In my book, the ability to extract out common actions makes the tidyverse solution superior for this example. I imagine that with more complex windows with unique order-by’s, SQL may have a chance of winning me over.

Have an alternative solution? Feel free to share in the comments!