Saturday, September 28, 2013

R programming: Four power-commands to explore data

This is a simple tutorial demonstrating 4 useful commands to explore data. 

  1. You have R installed
  2. You have this data file 
Go to R command line interface by just entering R in terminal. It should look something like following,

[Click on images to see enlarged version]

You can use following command to load the data.
movies <- read.delim("imdb_top_10000.txt", header=F)

[There is some issue with delimiter in first row of data file. So you can delete it from file or add "skip = 1" as part of command mentioned above.]

head(data, n): This command can give first n records from data object. 

dim(data): This command gives you number of rows and columns in data set.

str(data): This command gives you dimensions (observations and variables) of data set, column names (if no header then Vn), column data type and first few values from each column.

Now it seems from the output above that 3rd column is the year in which movie was released. Lets try to find out how many movies were released in each year according to this data set.

table(data$column) or table(data[ , column-number]) : This gives unique values and count of each value.

Also from str() output we can see 4th column is movie rating on the scale of 10. Lets explore more in terms of mean, median, max, min etc. We can use individual functions or just one function called, summary.

summary(data$column)  or  summary(data[ , column-number]): This command gives mean, median, max, min, 1st quartile and 3rd quartile.

No comments:

Post a Comment