Datavu: Useful Unix commands for exploring data

Wednesday, August 27, 2014

Useful Unix commands for exploring data

While dealing with big genetic data sets I often got stuck with limitation of programming languages in terms of reading big files. Also sometimes it is not convenient to load the data file in Python or R in order to perform few basic checks and exploratory analysis. Unix commands are pretty handy in these scenarios and often takes significantly less time in execution.

Lets consider movie data set from some parallel universe (with random values) for this assignment. There are 8 fields in total,

With few duplicate records

Lets start with few basic examples,

Check column names of file
head -1 movies.csv

We can use cat to display the file but we are interested in column names which is only first row of the file. So we have used head command.

Check number of records in the file
wc -l movies.csv

The command wc gives us number of characters, number of words and number of lines in a file. As we are interested in number of lines we use wc -l.

Check 50th record
head -50 movies.csv | tail -1

The operator | is called pipe and it will forward the output of first command as input of second command. Similar to pipe we have > which will write output to file and >> which will append command output at end of the file.

Find and then remove duplicate elements
## Check duplicate records ##
uniq -d movies.csv
## Remove duplicate records ##
uniq -u movies.csv > uniqMovie.csv
## Crosscheck the new file for duplicates ##
uniq -d uniqMovie.csv

In uniq, -d is used to find duplicates and -u is used to find unique records. So in order to remove duplicates we have selected unique records and redirected those to a new file.

*If we don't want new file we can redirect the output to same file in two steps which will overwrite original file. [Correction based on HN input by CraigJPerry]

We can use a temporary file for this,
uniq -u movies.csv > temp.csv
mv temp.csv movie.csv

**Important thing to note here is uniq wont work if duplicate records are not adjacent. [Addition based on HN inputs]

Display 10 movies with lowest ratings
sort -n -t',' -k3 uniqMovie.csv | head -10
# -n to inform its numerical sorting and not lexical sorting
# -t specifies delimiter which is comma is this case
# -k indicates which column, 3rd in our case

Numerical sorting is 1, 2, 15, 17, 130, 140 and lexical sorting is 1, 130, 140, 15, 17, 2. Note that sort will do lexical sorting of input by default. We can use -r to sort in descending order.

Check number of columns
awk -F "," '{print NF}' uniqMovie.csv | head -1
# -F "," specifies field separator as comma

We use -F to specify the field separator as comma. Then by using pipe | we redirect the output to head and just select first record (which is to avoid printing output on multiple lines). We can crosscheck this result with the output of first command where we used head to check column names. We would prefer awk to find number of columns when we have too many columns or we have no idea about number of columns.

NF stores value of Number of Fields based on the field separator we have provided. Similar to NF we also have NR which stores number of rows and uses \n as row separator.

Filter the data to get only 1st, 2nd, 6th and last column

awk 'BEGIN{FS=",";OFS=",";}{print $1,$2,$6,$NF}' uniqMovie.csv > newMovie.csv

Here we have used built-in variables like FS and OFS similar to NR and NF mentioned in last example. FS is nothing but field separator similar to -F. OFS is output field separator, which will add specified field separator between every column while writing it into new file above. We use $n for referring n-th column.

Create separate file for every month based on release date

tail -n +2 newMovie.csv | awk -F"," '{Month = $4; print > Month"_movies.csv"}' We want a file for every value in 4th column except the column name, which is in the first row. So we use tail command to select all records starting from 2nd one (to exclude the column name).

Here we store all unique values of release month on variable Month and then we print all records associated with that month in a csv file. Also to name every csv file we use Month variable followed by "_movies.csv".

What is the average number of reviews for all movies?
awk -F "," '{colSum+=$2} END { print "Average = ",colSum/NR}' movies.csv

Here the colSum is a variable to store the sum and looped over column of interest specified by $2. Then the average is calculated and displayed.

Next example is based on what we have discussed so far. See if you can do it.

Display total of box office sales of top 20 movies

Hint:

Sort records according to box office sales, then select 20 records with highest sales. Then find sum of the column and then display it on screen

Feel free to suggest corrections, better ways to solve these examples in comments section.

Follow discussion on: Hacker News

33 comments:

VSpikeAugust 27, 2014 at 10:19 AM
A better way to print a specific line than

head -50 movies.csv | tail -1

is

sed '50q;d' movies.csv

It's faster for big files, starts fewer processes, and requires less typing.
ReplyDelete
Replies
UnknownAugust 27, 2014 at 11:04 AM
I don't think you meant to use `uniq -u`. This removes all copies of each line that is duplicated, meaning that The Dark Knight Rises will not appear at all in the output. If you just use `uniq`, one copy of each line will remain. Also, it is worth nothing that `uniq` requires the file to be sorted first, because it only detects duplicate lines that are adjacent.
ReplyDelete
Replies
UnknownAugust 27, 2014 at 11:05 AM
Just a heads up that most of these commands will fail with non-trivial data because CSV files can contain rows that span multiple lines. I wrote csvkit to solve exactly this problem: http://csvkit.readthedocs.org/en/0.8.0/ Similar commands, but it handles CSV format correctly.
ReplyDelete
Replies
UnknownAugust 27, 2014 at 11:50 AM
$$ Check number of records in the file
$$ wc -l movies.csv

Will the returned count include the header line?
ReplyDelete
Replies
Lee WeiAugust 27, 2014 at 12:43 PM
From the manpage: uniq [-c | -d | -u] [-i] [-f num] [-s chars] [input_file [output_file]]

`uniq movies.csv > uniqMovie.csv` You can lose the pipe
ReplyDelete
Replies
Jon ScobieAugust 27, 2014 at 12:54 PM
Just waiting for the perl guys to pipe up to replace the venerable awk :)
ReplyDelete
Replies
Michael StittAugust 27, 2014 at 1:59 PM
Is there any way to find duplicate records, disregarding one of the columns? For example, I have a CSV file with 11 columns and I'd like to find the number of duplicate rows where each column is the same except for the 10th column. Is that possible using a Unix command? I'd love to not have to open the file with Python. Thanks, and nice tutorial!
ReplyDelete
Replies
Greg SwallowAugust 27, 2014 at 4:38 PM
http://tldp.org/LDP/abs/html/string-manipulation.html
ReplyDelete
Replies
NetsanetAugust 27, 2014 at 8:51 PM
I think you meant to say the n-th column? "Filter the data to get only 1st, 2nd, 6th and last record

awk 'BEGIN{FS=",";OFS=",";}{print $1,$2,$6,$NF}' uniqMovie.csv > newMovie.csv
ReplyDelete
Replies
AnonymousAugust 28, 2014 at 7:19 AM
Worth noting here that uniq truncates lines (after 2K, 4K or 8K or more depending on platform). This may cause unexpected results at output when the CSV has many fields.
ReplyDelete
Replies
AnonymousAugust 28, 2014 at 9:23 AM
Take a look at the paste command too. I used it for merging datasets in one project.
ReplyDelete
Replies
OysterAugust 28, 2014 at 4:55 PM
Some notes regarding AWK:

Note that in addition to the Action part (i.e., the stuff between the curly braces), there's the Pattern part, that allows you to select lines and other conditions. For example, to find the number of fields (in AWK parlance, a data file contains records, which in turn consist of fields) in your file (assuming no oddities later on in the file), you can just use AWK, no need for 'tail':

awk -F',' 'NR == 1 { print NF }'

This also comes up in your example where you want to disregard the header line:

awk -F',' 'NR>1 { print > ($4 "_movies.csv") }'

AWK has built-in hash tables, which allows you to avoid the 'sort | uniq' pattern in shell scripting when trying to find unique lines, which do not necessarily need to be sorted. For very large files, sort will start hitting the disk and take O(N*log(N)) time. For AWK, this will be linear time and in-memory, assuming that the number of unique lines doesn't exhaust virtual memory.

awk -F',' '!a[$0]++' movies.csv

Lastly, you can define AWK record and field separators as regular expressions, and not just as fixed characters or fixed strings. This allows you to process messy multi-line data files (such as Microsoft-esque CSV).
ReplyDelete
Replies
OysterAugust 28, 2014 at 4:59 PM
This is also a good anecdote of effective use of AWK (here, the mawk implementation) for processing Big Data, which happened to be faster than even the naive C code:

http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/
ReplyDelete
Replies
AnonymousAugust 29, 2014 at 3:17 PM
Install sqlite3 (<500kb); .mode csv; .read file.csv tablename; munge to your heart's content with good ol' SQ; .output file.csv; done.
ReplyDelete
Replies

Add comment

Labels

Wednesday, August 27, 2014

Useful Unix commands for exploring data

33 comments: