Every once in a while I complain on Twitter when I try to mix non-English letters with R. I am certainly not the first person to be frustrated by encoding issues, though I am (maybe too) hopeful that the problems won’t last for much longer. We live in the age of vacuum bots and 3D-printing, so what makes multi-language support so complicated?
Trying to mix Hebrew with #rstats is a bit of a nightmare, but at least it led me to this amazing "String encoding and R" blogpost by @kevin_ushey. It clarifies a lot! https://t.co/uNfFdTynGm— Irene Steves (@i_steves) October 24, 2018
Warning: I have no solutions in this blogpost. I’ve simply amassed my encoding knowledge (mostly from GitHub issues and from explanations/demos at the Tidyverse Developer Day) into a single blogpost.
A history lesson
Once upon a time, computer scientists needed a way to store characters as bits (1’s and 0’s). So, they came up with
a system several systems. In the early 90’s, some developers proposed UTF-8, a system that struck a balance between storage and support for many character sets (alphabets/characters in different languages). Unfortunately, the rise of UTF-8 occurred only after the establishment of core Windows systems, which were based on a different unicode system.1 To this day, Windows does not yet have full UTF-8 support, although Linux-based and web systems have long since hopped on the UTF-8 train.
Encodings in R may not have been so bad had the default encoding in base R not been
native.enc. Rather than forcing UTF-8 on its users, many base R functions translate inputs into the native encoding, whether you ask it to or not. This means that any characters that cannot be represented in the computer’s native encoding become garbled. Those who use multiple languages (and yes, emojis count) quickly find that encoding bugs are–as Joshua Goldberg put it–“quite annoying and a time sink with little value gained after you make it out alive.”
RStudio is HTML
If you right-click almost anywhere in RStudio, you’ll have an
Inspect option available to you. Click it, and a Web Inspector window will pop up. Here’s what the beginning of it looks like:
Okay, so if RStudio runs in HTML and can specify UTF-8, why do we still run into problems?
Reading, wRiting, and pRogramming
R and RStudio do not exist in isolation. Much of the time, we use it to read in files, write to new files, or do various programmatic conversions.
Take this simple example. We want to save a dataframe that includes non-English characters. The function,
write.csv(), takes the system’s native encoding by default, whereas
write_csv() supports only UTF-8.
library(readr) df <- data.frame(food = c("Crêpe", "Spätzle", "Smørrebrød", "חומוס")) write.csv(df, "native.csv", row.names = FALSE) write_csv(df, "utf8.csv")
Everything looks fine when we use the corresponding
read functions on the files, but if we switch them around, we run into problems:
## food ## 1 Crêpe ## 2 Spätzle ## 3 Smørrebrød ## 4 חומוס
## Parsed with column specification: ## cols( ## food = col_character() ## )
## # A tibble: 4 x 1 ## food ## <chr> ## 1 Crêpe ## 2 Spätzle ## 3 Smørrebrød ## 4 חומוס
We can work around these problems using the
fileEncoding parameter in
read.csv() or the
locale parameter in
However, some conversion processes rely on base-R commands that translate to/from native encodings, resulting in “forced round-trips.” Often, there is no workaround unless you dig into C. Many rendering functions, such as
reprex::reprex(), get stuck because of this combination of base R defaults and lack of Windows UTF-8 support.
Beyond functions, packages have a few extra sets of restrictions. Special (non-English) characters are not allowed in package names, nor do they always display properly in search results. Take a look at Colin Fay’s proustr package documentation, for example. The help table of contents is garbled, but the help pages themselves are mostly fine.
Enter right-to-left (RTL) languages
Things sometimes get trickier when you work with right-to-left (RTL) languages. By RTL, I mean that most of the language is written from right to left, but numbers or URL’s or code or whatever else are often still left-to-right (or in English characters). Fortunately, there are standards for bidirectional, or “BiDi”, text. In fact, there are even UTF-8 codes for defining text directions that:
- use “isolates” or “embeddings” to set a base direction and let the BiDi algorithm take it from there
- automatically determine text direction according to the first strongly-typed character (e.g. “a” or “א” but not “!”)
- force a direction with “override” codes that ignore the BiDi algorithm
If you’re like me, reading through these guidelines the first time around doesn’t make much sense, so let’s take a look at how Excel handles a few different scenarios:
There’s a lot going on there, but it becomes clear pretty quickly that even without using English, the combination of Hebrew letters, numbers, and punctuation requires a set of rules for sensible display.
These rules also bleed into data frames and the like. In some ways, the following bug makes sense. After all, if everything was in Hebrew, you’d want columns to be displayed RTL.
As a non-native RTL-er, these issues are a source of frustration but also great fascination for me. I encourage those of you more fluent in RTL languages than I to weigh in on issues related to the IDE, plotting, and elsewhere. If you need guidance on producing a minimal reproducible example of your problem, check out the reprex package or Yihui Xie’s Minimal Reproducible Example Paradox blogpost.
A New Hope
There is beta UTF-8 support on Windows! Does this solve all my problems? No, not yet.
However, UTF-8 efforts from both the Windows and R world (e.g. utf8) are making progress in this domain. In the mean time, the rest of us can continue to file issues, make PRs, and avoid base R functions that ignore our wish for a UTF-8 world.
I may not have managed any PR’s at the Tidyverse Developer Day after the rstudio::conf this year, but I had the opportunity to connect with several patient and kind encoding pros, including Colin Fay, Christophe Dervieux, Yoni Sidi (extra thanks to Yoni for reviewing my draft of this post!), and Kirill Müller. They gave me the motivation to read up on encoding yet again and assemble my thoughts and learnings here in this post.
Check out a fuller history of UTF-8/encoding in this blogpost: Unicode, UTF8 & Character Sets: The Ultimate Guide↩