Every once in a while I complain on Twitter when I try to mix non-English letters with R. I am certainly not the first person to be frustrated by encoding issues, though I am (maybe too) hopeful that the problems won’t last for much longer. We live in the age of vacuum bots and 3D-printing, so what makes multi-language support so complicated?
Trying to mix Hebrew with #rstats is a bit of a nightmare, but at least it led me to this amazing "String encoding and R" blogpost by @kevin_ushey. It clarifies a lot! https://t.co/uNfFdTynGm
— Irene Steves (@i_steves) October 24, 2018
Warning: I have no solutions in this blogpost. I’ve simply amassed my encoding knowledge (mostly from GitHub issues and from explanations/demos at the Tidyverse Developer Day) into a single blogpost.
A history lesson
Once upon a time, computer scientists needed a way to store characters as bits (1’s and 0’s). So, they came up with a system several systems. In the early 90’s, some developers proposed UTF-8, a system that struck a balance between storage and support for many character sets (alphabets/characters in different languages). Unfortunately, the rise of UTF-8 occurred only after the establishment of core Windows systems, which were based on a different unicode system.1 To this day, Windows does not yet have full UTF-8 support, although Linux-based and web systems have long since hopped on the UTF-8 train.
Encodings in R may not have been so bad had the default encoding in base R not been native.enc
. Rather than forcing UTF-8 on its users, many base R functions translate inputs into the native encoding, whether you ask it to or not. This means that any characters that cannot be represented in the computer’s native encoding become garbled. Those who use multiple languages (and yes, emojis count) quickly find that encoding bugs are–as Joshua Goldberg put it–“quite annoying and a time sink with little value gained after you make it out alive.”
RStudio is HTML
If you right-click almost anywhere in RStudio, you’ll have an Inspect
option available to you. Click it, and a Web Inspector window will pop up. Here’s what the beginning of it looks like:
This is HTML!! Granted, we have javascript and other languages embedded within it, but this explains (in part) how the RStudio Server and RStudio Cloud interfaces are able to mimic your local RStudio so exactly. Note also that in the highlighted line above, the character set has been specified as UTF-8.
Okay, so if RStudio runs in HTML and can specify UTF-8, why do we still run into problems?
Reading, wRiting, and pRogramming
R and RStudio do not exist in isolation. Much of the time, we use it to read in files, write to new files, or do various programmatic conversions.
Take this simple example. We want to save a dataframe that includes non-English characters. The function, write.csv()
, takes the system’s native encoding by default, whereas write_csv()
supports only UTF-8.
library(readr)
df <- data.frame(food = c("Crêpe", "Spätzle", "Smørrebrød", "חומוס"))
write.csv(df, "native.csv", row.names = FALSE)
write_csv(df, "utf8.csv")
Everything looks fine when we use the corresponding read
functions on the files, but if we switch them around, we run into problems:
read.csv("utf8.csv")
## food
## 1 Crêpe
## 2 Spätzle
## 3 Smørrebrød
## 4 חומוס
read_csv("native.csv")
## Parsed with column specification:
## cols(
## food = col_character()
## )
## # A tibble: 4 x 1
## food
## <chr>
## 1 Crêpe
## 2 Spätzle
## 3 Smørrebrød
## 4 חומוס
We can work around these problems using the fileEncoding
parameter in read.csv()
or the locale
parameter in read_csv()
.2
However, some conversion processes rely on base-R commands that translate to/from native encodings, resulting in “forced round-trips.” Often, there is no workaround unless you dig into C. Many rendering functions, such as rmarkdown::render()
or reprex::reprex()
, get stuck because of this combination of base R defaults and lack of Windows UTF-8 support.
In fact, many, many encoding issues ultimately drill down to the same few problematic base R functions, which include sink()
, source()
, writeLines()
,3 and format()
.
Beyond functions, packages have a few extra sets of restrictions. Special (non-English) characters are not allowed in package names, nor do they always display properly in search results. Take a look at Colin Fay’s proustr package documentation, for example. The help table of contents is garbled, but the help pages themselves are mostly fine.
Enter right-to-left (RTL) languages
Things sometimes get trickier when you work with right-to-left (RTL) languages. By RTL, I mean that most of the language is written from right to left, but numbers or URL’s or code or whatever else are often still left-to-right (or in English characters). Fortunately, there are standards for bidirectional, or “BiDi”, text. In fact, there are even UTF-8 codes for defining text directions that:
- use “isolates” or “embeddings” to set a base direction and let the BiDi algorithm take it from there
- automatically determine text direction according to the first strongly-typed character (e.g. “a” or “א” but not “!”)
- force a direction with “override” codes that ignore the BiDi algorithm
If you’re like me, reading through these guidelines the first time around doesn’t make much sense, so let’s take a look at how Excel handles a few different scenarios:
There’s a lot going on there, but it becomes clear pretty quickly that even without using English, the combination of Hebrew letters, numbers, and punctuation requires a set of rules for sensible display.
These rules also bleed into data frames and the like. In some ways, the following bug makes sense. After all, if everything was in Hebrew, you’d want columns to be displayed RTL.
As a non-native RTL-er, these issues are a source of frustration but also great fascination for me. I encourage those of you more fluent in RTL languages than I to weigh in on issues related to the IDE, plotting, and elsewhere. If you need guidance on producing a minimal reproducible example of your problem, check out the reprex package or Yihui Xie’s Minimal Reproducible Example Paradox blogpost.
A New Hope
There is beta UTF-8 support on Windows! Does this solve all my problems? No, not yet.
However, UTF-8 efforts from both the Windows and R world (e.g. utf8) are making progress in this domain. In the mean time, the rest of us can continue to file issues, make PRs, and avoid base R functions that ignore our wish for a UTF-8 world.
Thanks
I may not have managed any PR’s at the Tidyverse Developer Day after the rstudio::conf this year, but I had the opportunity to connect with several patient and kind encoding pros, including Colin Fay, Christophe Dervieux, Yoni Sidi (extra thanks to Yoni for reviewing my draft of this post!), and Kirill Müller. They gave me the motivation to read up on encoding yet again and assemble my thoughts and learnings here in this post.
Check out a fuller history of UTF-8/encoding in this blogpost: Unicode, UTF8 & Character Sets: The Ultimate Guide↩
For an in-depth explanation of what read/write functions do in R, take a look at Kevin Ushey’s excellent post on String encoding in R.↩
The
writeLines()
function does, in fact, work if you supply theuseBytes = TRUE
argument, but it is a hack thatxfun::write_utf8()
exploits to alleviate your encoding headaches.↩