Saturday, September 21, 2013

Basics of R programming: File management

R is growing rapidly and most preferred language of data analysts according to kdnuggets. It is not only replacing highend softwares like SAS but also MS Excel. Recently I saw one video of Google analytics with R, which is pretty cool. Generally we will deal with txt, csv, tsv or rda extentions of data files. Following simple function can be used to read a CSV file,
 dat <- read.table("Sales.csv",header=TRUE,sep=',')  

  • Here "dat" is data object in which the actual data from file is loaded. 
  • The "read.table" is predefined function from R. 
  • The "Sales.csv" is file name, which is in quotes. 
  • Header helps us to identify if a row/ record with header is present. Value can be TRUE, T, FALSE, F without quotes. 
  • "Sep" indicated how columns are separated from each other. In this particular example we are considering a csv file, so value of sep parameter is a comma. Similarly for tsv it is /t. 
  • Sometime we might like to use "fileEncoding" parameter to read a file. eg fileEncoding = "windows-1252" 
  • We can also add one more parameter called "skip" when first few lines contain description of files and not actual data. eg skip = 10

Similarly we have "write.table" to write R objects in file.
 write.table(dat2,file='SalesCleanedQ4.csv', sep=',')  
Another most common format might be .rda file. It is nothing but R data file used to save R object. Good thing about these files is you dont need a variable to assign.
  • The object will be created with the name of R object which was used to create this rda file. For example, lets say I wrote a R program which will read some csv in "movie" object and then save it as "movieDetails.rda". Then whenever I will use load function with this rda file it will be load in memory with object name "movie". 
  • You can do ls() to see the object created by loading of rda file, if you are not sure what is the object name. 
  • If you think there might be other R objects in memory making it hard to find correspoding object then you can 
    • Do rm(list=ls()), which will clear all previously loaded objects from memory 
    • Or if you dont wish to delete previously loaded obects then you can do ls() before and after loading the rda file. 
  • As the file is loaded with its predefined object name we might like to be thoughtful while saving any rda file about its object name.

This is how we actually save/create rda file from a given R object.
 save(movie, file="movieDetails.rda")  

If you want to read multiple files from a directory you can do,
 set <- list.files("/MyDirectory/MovieData/")  

You can do it for multiple directory files as well,
 set1 <- list.files("/MyDirectory1/MovieData/")  
 set2 <- list.files("MyDirectory2/MovieData/")   
 masterSet <- c(set1, set2)   

To make it easy to read those files from any where we can make a list with absolute address like,
 # Read set 1 files
 set1 = list.files("MyDirectory1/MovieData/“)  
 for (i in 1:length(set1))  
 set1[i] <- paste("/MyDirectory1/MovieData", set1[i], sep='')  

 # Read set 2 files
 set2 = list.files("MyDirectory2/MovieData/“)  
 for (i in 1:length(set2))  
 set2[i] <- paste("/MyDirectory2/MovieData", set2[i], sep='’)  

 # Combine both sets
 masterSet <- c(set1, set2)  

Here instead of just making list of file names we are adding absolute path as prefix to file name. The "paste" function acts as concatenation operator, separated by nothing. (sep = '')

If you want to remove quotes around values, then just use following in write statement,

No comments:

Post a Comment