Tuesday, April 09, 2019

Hey R, your sorting looks a little off. Why so ?

Sorting strings looks very simple as like most of the other languages it comes as a built-in function to take care of this. But it gets complex, when you don't get what you expect. For example:

> sort(c("app","A","a","Az","APP","AP","aaaa",1,1.01,1.9,.9,"0.9"))

[1] "0.9" "0.9" "1" "1.01" "1.9" "a" "A" "aaaa" "AP" "app" "APP" "Az"

Lets understand the output, the digits looks good in order as we normally expect.
"a" is smaller than "A" and then followed by "aaaa", ok looks good.
"AP" is smaller than "app" ? How is this ?
"aap" is smaller than "APP",this is ok inline with "a" < "A" but "Az" at the end, why?

Its all boils down to encoding that R is using in your setup. You can easily check this up with

>> Sys.getlocale("LC_COLLATE")
[1] "en_US.UTF-8"

This is Ubuntu 17.04 / RStudio default install / version 1.1.463

The same output is seen on Windows 10/ Rstudio default install / version 1.1.453 Here, checking the LC_COLLATE value shows :

> Sys.getlocale("LC_COLLATE")

[1] "English_United States.1252"
This is getting interesting :) UTF and 1252 does more towards language sorting, whereas we are mostly looking for byte value ordering, the way machine understands and we as developers are more familiar now. Hence, choose "C" or "POSIX" as the LC_COLLATE setting.

You can set using, the command below in either OS as:
> Sys.setlocale("LC_COLLATE","C")
[1] "C"

> Sys.setlocale("LC_COLLATE","POSIX")
[1] ""
Warning message:
In Sys.setlocale("LC_COLLATE", "POSIX") :
  OS reports request to set locale to "POSIX" cannot be honored
> Sys.setlocale("LC_COLLATE","C")
[1] "C

> sort(c("app","A","a","Az","APP","AP","aaaa",1,1.01,1.9,.9,"0.9")) 
[1] "0.9" "0.9" "1" "1.01" "1.9" "A" "AP" "APP" "Az" "a" "aaaa" "app" 


Putting down all together for reference:
en_US.ISO8859-1  : 0.9 0.9 1 1.01 1.9 a A aaaa AP app APP Az
en_US.UTF-8         : 0.9 0.9 1 1.01 1.9 a A aaaa AP app APP Az
C                           : 0.9 0.9 1 1.01 1.9 A AP APP Az a aaaa app



** So, whenever sorting goes for a toss or behaving weird or different to your expectations, please check the locale setting. In distributed setup, at times, you need to check on both server and client to debug and be sure.

0 Comments:

Post a Comment

<< Home