Tuesday, April 23, 2019

Lower probability signifies higher information. How come ?

"More unlikely that X would equal x , the more informative would be the message." Is this sounding a bit weird ? Well, its not !

It can be understood with an easy logic that if the no. of outcomes of an event is less, one needs to be more confident in selecting from that lot. More surety, symbolizes more information.

For e.g. in a basket of 12 fruits,there are 3 oranges and 9 apples.
Then,
P(orange) = 3/12
P(apple)=9/12

If I am selecting or getting an orange, that means I need to have more information at hand so that I can pick an item which was more rare or less probable. Choosing an apple means, I can be a bit liberal than in oranges case and I have some room for mistake, hence suggests lesser information than previous case.

Tuesday, April 09, 2019

Hey R, your sorting looks a little off. Why so ?

Sorting strings looks very simple as like most of the other languages it comes as a built-in function to take care of this. But it gets complex, when you don't get what you expect. For example:

> sort(c("app","A","a","Az","APP","AP","aaaa",1,1.01,1.9,.9,"0.9"))

[1] "0.9" "0.9" "1" "1.01" "1.9" "a" "A" "aaaa" "AP" "app" "APP" "Az"

Lets understand the output, the digits looks good in order as we normally expect.
"a" is smaller than "A" and then followed by "aaaa", ok looks good.
"AP" is smaller than "app" ? How is this ?
"aap" is smaller than "APP",this is ok inline with "a" < "A" but "Az" at the end, why?

Its all boils down to encoding that R is using in your setup. You can easily check this up with

>> Sys.getlocale("LC_COLLATE")
[1] "en_US.UTF-8"

This is Ubuntu 17.04 / RStudio default install / version 1.1.463

The same output is seen on Windows 10/ Rstudio default install / version 1.1.453 Here, checking the LC_COLLATE value shows :

> Sys.getlocale("LC_COLLATE")

[1] "English_United States.1252"
This is getting interesting :) UTF and 1252 does more towards language sorting, whereas we are mostly looking for byte value ordering, the way machine understands and we as developers are more familiar now. Hence, choose "C" or "POSIX" as the LC_COLLATE setting.

You can set using, the command below in either OS as:
> Sys.setlocale("LC_COLLATE","C")
[1] "C"

> Sys.setlocale("LC_COLLATE","POSIX")
[1] ""
Warning message:
In Sys.setlocale("LC_COLLATE", "POSIX") :
  OS reports request to set locale to "POSIX" cannot be honored
> Sys.setlocale("LC_COLLATE","C")
[1] "C

> sort(c("app","A","a","Az","APP","AP","aaaa",1,1.01,1.9,.9,"0.9")) 
[1] "0.9" "0.9" "1" "1.01" "1.9" "A" "AP" "APP" "Az" "a" "aaaa" "app" 


Putting down all together for reference:
en_US.ISO8859-1  : 0.9 0.9 1 1.01 1.9 a A aaaa AP app APP Az
en_US.UTF-8         : 0.9 0.9 1 1.01 1.9 a A aaaa AP app APP Az
C                           : 0.9 0.9 1 1.01 1.9 A AP APP Az a aaaa app



** So, whenever sorting goes for a toss or behaving weird or different to your expectations, please check the locale setting. In distributed setup, at times, you need to check on both server and client to debug and be sure.

Thursday, April 04, 2019

To get excel or workbook file sheet names using R xlsx package

Load your workbook or excel file, in my case e.g. name of excel file is "input_4_r.xlsx"

> wb<-loadWorkbook("input_4_r.xlsx")
see the list of files, here it shows 2 sheets in my example case in my example, I have not named first sheet and kept the default but 2nd sheet, I named as "name city" and hence the output below
> getSheets(wb)
$Sheet1
[1] "Java-Object{Name: /xl/worksheets/sheet1.xml - Content Type: application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml}"

$`name city`
[1] "Java-Object{Name: /xl/worksheets/sheet2.xml - Content Type: application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml}"
you can see the names of sheetnames as below
> names(getSheets(wb))
[1] "Sheet1"    "name city"
to get the name of specific index of sheet, e.g. passing [2] in my case for 2nd sheet
> names(getSheets(wb))[2]
[1] "name city"
*** Assumption for above is xlsx package is installed and loaded in R