In my recent exploration of window functions, I realized didn’t really know the differences between rank functions. The dplyr documentation lists out six functions, of which I pretty much only use one (`row_number()`

):

`row_number()`

`ntile()`

`min_rank()`

`dense_rank()`

`percent_rank()`

`cume_dist()`

Though the documentation description is relatively clear, it was still hard to grasp exactly how they differed. I found it easier to do the comparison visually.

## Trivia score ranks

Given a toy dataset of trivia scores from two teams, let’s see how the scores rank using the functions above.

```
library(tidyverse)
theme_set(theme_minimal())
set.seed(20200417) #so I get the same teams each time
trivia_scores <- tibble(team = sample(c("cheetah", "ibex"), 10, replace = TRUE),
score = c(3, 6, 6, 18, 39, 40, 40, 40, 42, 99))
window_rank_example <- trivia_scores %>%
mutate(ntile3 = ntile(score, 3),
min_rank = min_rank(score),
dense_rank = dense_rank(score),
percent_rank = percent_rank(score) %>% round(3),
cume_dist = cume_dist(score))
```

team | score | ntile3 | min_rank | dense_rank | percent_rank | cume_dist |
---|---|---|---|---|---|---|

ibex | 3 | 1 | 1 | 1 | 0 | 0.1 |

ibex | 6 | 1 | 2 | 2 | 0.111 | 0.3 |

ibex | 6 | 1 | 2 | 2 | 0.111 | 0.3 |

cheetah | 18 | 1 | 4 | 3 | 0.333 | 0.4 |

ibex | 39 | 2 | 5 | 4 | 0.444 | 0.5 |

cheetah | 40 | 2 | 6 | 5 | 0.556 | 0.8 |

ibex | 40 | 2 | 6 | 5 | 0.556 | 0.8 |

ibex | 40 | 3 | 6 | 5 | 0.556 | 0.8 |

cheetah | 42 | 3 | 9 | 6 | 0.889 | 0.9 |

cheetah | 99 | 3 | 10 | 7 | 1 | 1 |

From the patterns above we see that:

- the distances between scores don’t matter (39 to 40 is treated the same way as 42 to 99)
- all except for
`ntile()`

clump the three 40’s together into one team `min_rank()`

reaches 10 (the total number of scores) whereas`dense_rank()`

reaches 7 (the total number of*distinct*scores)`percent_rank()`

is similar to`min_rank()`

except that (a) it ranges from 0 to 1 and (b) each step is 0.111 instead of 1- each step of
`cume_dist()`

is 1 / # scores (1/10 = 0.1 in this case) and is similar to`min_rank()`

except that all ties get the*maximum*score

To dig into it even more, here’s another way of visualizing the ranks. Since ranks are only dependent on the *order* of values and not the value itself (except with repeats), I’ve used row number for the x-axis and added it as a light gray background as well.

We see that with the exception of `ntile`

, all ranking functions group repeated numbers together (I colored the two 6’s in teal and the three 40’s in orange).

Since `percent_rank()`

and `cume_dist()`

reach a maximum of 1, I decided to plot them separately. The light gray background indicates equal increments of 0.1 across all 10 values.

## Ranking per group

Splitting by team requires just one extra line (and one more to make the resulting table look nicer):

```
window_rank_example_teams <- trivia_scores %>%
group_by(team) %>% # only change needed!
arrange(team, score) %>% # not required but easier to digest visually
mutate(ntile3 = ntile(score, 3),
min_rank = min_rank(score),
dense_rank = dense_rank(score),
percent_rank = percent_rank(score) %>% round(3),
cume_dist = cume_dist(score))
```

team | score | ntile3 | min_rank | dense_rank | percent_rank | cume_dist |
---|---|---|---|---|---|---|

cheetah | 18 | 1 | 1 | 1 | 0 | 0.25 |

cheetah | 40 | 1 | 2 | 2 | 0.333 | 0.5 |

cheetah | 42 | 2 | 3 | 3 | 0.667 | 0.75 |

cheetah | 99 | 3 | 4 | 4 | 1 | 1 |

ibex | 3 | 1 | 1 | 1 | 0 | 0.166666666666667 |

ibex | 6 | 1 | 2 | 2 | 0.2 | 0.5 |

ibex | 6 | 2 | 2 | 2 | 0.2 | 0.5 |

ibex | 39 | 2 | 4 | 3 | 0.6 | 0.666666666666667 |

ibex | 40 | 3 | 5 | 4 | 0.8 | 1 |

ibex | 40 | 3 | 5 | 4 | 0.8 | 1 |

Now, instead of overall ranks, the scores for each team are ranked separately.

## Reversing ranks

Perhaps you think it’s weird that the lowest trivia scores get the highest rank (1). To reverse the ranking order, we can use either `desc()`

or just add a `-`

.

```
window_rank_example_reversed <- trivia_scores %>%
mutate(ntile3 = ntile(-score, 3),
min_rank = min_rank(-score),
dense_rank = dense_rank(-score),
percent_rank = percent_rank(-score) %>% round(3),
cume_dist = cume_dist(-score))
```

team | score | ntile3 | min_rank | dense_rank | percent_rank | cume_dist |
---|---|---|---|---|---|---|

ibex | 3 | 3 | 10 | 7 | 1 | 1 |

ibex | 6 | 3 | 8 | 6 | 0.778 | 0.9 |

ibex | 6 | 3 | 8 | 6 | 0.778 | 0.9 |

cheetah | 18 | 2 | 7 | 5 | 0.667 | 0.7 |

ibex | 39 | 2 | 6 | 4 | 0.556 | 0.6 |

cheetah | 40 | 1 | 3 | 3 | 0.222 | 0.5 |

ibex | 40 | 1 | 3 | 3 | 0.222 | 0.5 |

ibex | 40 | 2 | 3 | 3 | 0.222 | 0.5 |

cheetah | 42 | 1 | 2 | 2 | 0.111 | 0.2 |

cheetah | 99 | 1 | 1 | 1 | 0 | 0.1 |

## Conclusion

And that’s all there is to it! Hope the visual aids help to reinforce the differences in the options.