Thursday, June 27, 2024

A Sorting Adventure in R

As part of an R application on which I have been working, I need to grab some data from an online spreadsheet keyed by a column of labels, add a column with new values, and print out the labels and new values in a two column array so that it matches the spreadsheet (meaning the order is the same as in the spreadsheet). The construction process involves breaking the original data into chunks, filling in the new values in each chunk separately, and gluing the chunks back together. The spreadsheet is sorted by the label column, so to match the original spreadsheet (allowing a user to view them side by side and see the same labels in the same order), I just needed to sort the output data frame by the label column ... or so I thought.

Since I was using the dplyr library already, and dplyr provides an arrange command to sort rows based on one or more columns, I started out with the following simple code (where df is the two column data frame I created):

df |> dplyr::arrange(Label) |> print()

Unfortunately, the result was not sorted in the same order as the spreadsheet. Some of the labels began with numbers (1000) and in one case a number containing a colon (10:20), and arrange listed them in the opposite order from that of the spreadsheet. Some of the name had funky capitalization (say, "BaNaNa"), and arrange treated capital letters as preceding lower case letters. Interestingly, the base R sort function sorted the labels in the same order that the spreadsheet used. More interestingly, the following hack suggested in a response to a question I posted online also matched the spreadsheet:

df |> dplyr::arrange(-dplyr::desc(Label)) |> print()

The desc function tells arrange to use descending order (the default being ascending order). So -desc tells arrange not to use descending order, meaning use ascending order, which is where we started ... and yet it somehow fixes the ordering problem.

The culprit turns out to be the default locale setting for arrange, which is "C". I'm guessing that means the C programming language. I filed a bug report in the dplyr repository on GitHub and was quickly told that the behavior was correct for locale "C" and that the solution to my problem was to feed arrange the optional argument .locale = "en". That did in fact fix things. The code now produces the expected sort order. Meanwhile, my bug report led to a new one about the difference in sort orders between arrange and desc. Depending on how that report is resolved, the -desc trick may stop working in the future.