In my recent exploration of window functions, I realized didn’t really know the differences between rank functions. The dplyr documentation lists out six functions, of which I pretty much only use one (row_number()):

• row_number()
• ntile()
• min_rank()
• dense_rank()
• percent_rank()
• cume_dist()

Though the documentation description is relatively clear, it was still hard to grasp exactly how they differed. I found it easier to do the comparison visually.

## Trivia score ranks

Given a toy dataset of trivia scores from two teams, let’s see how the scores rank using the functions above.

library(tidyverse)
theme_set(theme_minimal())

set.seed(20200417) #so I get the same teams each time
trivia_scores <- tibble(team = sample(c("cheetah", "ibex"), 10, replace = TRUE),
score = c(3, 6, 6, 18, 39, 40, 40, 40, 42, 99))

window_rank_example <- trivia_scores %>%
mutate(ntile3 = ntile(score, 3),
min_rank = min_rank(score),
dense_rank = dense_rank(score),
percent_rank = percent_rank(score) %>% round(3),
cume_dist = cume_dist(score)) 
team score ntile3 min_rank dense_rank percent_rank cume_dist
ibex 3 1 1 1 0 0.1
ibex 6 1 2 2 0.111 0.3
ibex 6 1 2 2 0.111 0.3
cheetah 18 1 4 3 0.333 0.4
ibex 39 2 5 4 0.444 0.5
cheetah 40 2 6 5 0.556 0.8
ibex 40 2 6 5 0.556 0.8
ibex 40 3 6 5 0.556 0.8
cheetah 42 3 9 6 0.889 0.9
cheetah 99 3 10 7 1 1

From the patterns above we see that:

• the distances between scores don’t matter (39 to 40 is treated the same way as 42 to 99)
• all except for ntile() clump the three 40’s together into one team
• min_rank() reaches 10 (the total number of scores) whereas dense_rank() reaches 7 (the total number of distinct scores)
• percent_rank() is similar to min_rank() except that (a) it ranges from 0 to 1 and (b) each step is 0.111 instead of 1
• each step of cume_dist() is 1 / # scores (1/10 = 0.1 in this case) and is similar to min_rank() except that all ties get the maximum score

To dig into it even more, here’s another way of visualizing the ranks. Since ranks are only dependent on the order of values and not the value itself (except with repeats), I’ve used row number for the x-axis and added it as a light gray background as well.

We see that with the exception of ntile, all ranking functions group repeated numbers together (I colored the two 6’s in teal and the three 40’s in orange).

Since percent_rank() and cume_dist() reach a maximum of 1, I decided to plot them separately. The light gray background indicates equal increments of 0.1 across all 10 values.

## Ranking per group

Splitting by team requires just one extra line (and one more to make the resulting table look nicer):

window_rank_example_teams <- trivia_scores %>%
group_by(team) %>% # only change needed!
arrange(team, score) %>% # not required but easier to digest visually
mutate(ntile3 = ntile(score, 3),
min_rank = min_rank(score),
dense_rank = dense_rank(score),
percent_rank = percent_rank(score) %>% round(3),
cume_dist = cume_dist(score)) 
team score ntile3 min_rank dense_rank percent_rank cume_dist
cheetah 18 1 1 1 0 0.25
cheetah 40 1 2 2 0.333 0.5
cheetah 42 2 3 3 0.667 0.75
cheetah 99 3 4 4 1 1
ibex 3 1 1 1 0 0.166666666666667
ibex 6 1 2 2 0.2 0.5
ibex 6 2 2 2 0.2 0.5
ibex 39 2 4 3 0.6 0.666666666666667
ibex 40 3 5 4 0.8 1
ibex 40 3 5 4 0.8 1

Now, instead of overall ranks, the scores for each team are ranked separately.

## Reversing ranks

Perhaps you think it’s weird that the lowest trivia scores get the highest rank (1). To reverse the ranking order, we can use either desc() or just add a -.

window_rank_example_reversed <- trivia_scores %>%
mutate(ntile3 = ntile(-score, 3),
min_rank = min_rank(-score),
dense_rank = dense_rank(-score),
percent_rank = percent_rank(-score) %>% round(3),
cume_dist = cume_dist(-score)) 
team score ntile3 min_rank dense_rank percent_rank cume_dist
ibex 3 3 10 7 1 1
ibex 6 3 8 6 0.778 0.9
ibex 6 3 8 6 0.778 0.9
cheetah 18 2 7 5 0.667 0.7
ibex 39 2 6 4 0.556 0.6
cheetah 40 1 3 3 0.222 0.5
ibex 40 1 3 3 0.222 0.5
ibex 40 2 3 3 0.222 0.5
cheetah 42 1 2 2 0.111 0.2
cheetah 99 1 1 1 0 0.1

## Conclusion

And that’s all there is to it! Hope the visual aids help to reinforce the differences in the options.