Wednesday, August 27, 2014

Useful Unix commands for exploring data

While dealing with big genetic data sets I often got stuck with limitation of programming languages in terms of reading big files. Also sometimes it is not convenient to load the data file in Python or R in order to perform few basic checks and exploratory analysis. Unix commands are pretty handy in these scenarios and often takes significantly less time in execution.

Lets consider movie data set from some parallel universe (with random values) for this assignment. There are 8 fields in total,




















With few duplicate records


















Lets start with few basic examples,

Check column names of file
head -1 movies.csv
 
We can use cat to display the file but we are interested in column names which is only first row of the file. So we have used head command.

Check number of records in the file
wc -l movies.csv

The command wc gives us number of characters, number of words and number of lines in a file. As we are interested in number of lines we use wc -l.

Check 50th record
head -50 movies.csv | tail -1


The operator | is called pipe and it will forward the output of first command as input of second command. Similar to pipe we have > which will write output to file and >> which will append command output at end of the file.



Find and then remove duplicate elements
## Check duplicate records ##
uniq -d movies.csv
## Remove duplicate records ##
uniq -u movies.csv > uniqMovie.csv
## Crosscheck the new file for duplicates ##
uniq -d uniqMovie.csv

In uniq, -d is used to find duplicates and -u is used to find unique records. So in order to remove duplicates we have selected unique records and redirected those to a new file. 


*If we don't want new file we can redirect the output to same file in two steps which will overwrite original file. [Correction based on HN input by CraigJPerry

We can use a temporary file for this,
uniq -u movies.csv > temp.csv 
mv temp.csv movie.csv 

**Important thing to note here is uniq wont work if duplicate records are not adjacent. [Addition based on HN inputs] 


Display 10 movies with lowest ratings
sort -n -t',' -k3 uniqMovie.csv | head -10
# -n to inform its numerical sorting and not lexical sorting
# -t specifies delimiter which is comma is this case
# -k indicates which column, 3rd in our case

Numerical sorting is 1, 2, 15, 17, 130, 140 and lexical sorting is 1, 130, 140, 15, 17, 2. Note that sort will do lexical sorting of input by default. We can use -r to sort in descending order.




Check number of columns
awk -F "," '{print NF}' uniqMovie.csv | head -1
# -F "," specifies field separator as comma


We use -F to specify the field separator as comma. Then by using pipe | we redirect the output to head and just select first record (which is to avoid printing output on multiple lines). We can crosscheck this result with the output of first command where we used head to check column names. We would prefer awk to find number of columns when we have too many columns or we have no idea about number of columns.

NF stores value of Number of Fields based on the field separator we have provided. Similar to NF we also have NR which stores number of rows and uses \n as row separator.



Filter the data to get only 1st, 2nd, 6th and last column
awk 'BEGIN{FS=",";OFS=",";}{print $1,$2,$6,$NF}' uniqMovie.csv > newMovie.csv 

Here we have used built-in variables like FS and OFS similar to NR and NF mentioned in last example. FS is nothing but field separator similar to -F. OFS is output field separator, which will add specified field separator between every column while writing it into new file above. We use $n for referring n-th column.



Create separate file for every month based on release date 
tail -n +2 newMovie.csv | awk -F"," '{Month = $4; print > Month"_movies.csv"}' We want a file for every value in 4th column except the column name, which is in the first row. So we use tail command to select all records starting from 2nd one (to exclude the column name).

Here we store all unique values of release month on variable Month and then we print all records associated with that month in a csv file. Also to name every csv file we use Month variable followed by "_movies.csv".


What is the average number of reviews for all movies? 
awk -F "," '{colSum+=$2} END { print "Average = ",colSum/NR}' movies.csv

Here the colSum is a variable to store the sum and looped over column of interest specified by $2. Then the average is calculated and displayed.


Next example is based on what we have discussed so far. See if you can do it.

Display total of box office sales of top 20 movies

Hint:
Sort records according to box office sales, then select 20 records with highest sales. Then find sum of the column and then display it on screen
Feel free to suggest corrections, better ways to solve these examples in comments section.

Follow discussion on: Hacker News

33 comments:

  1. A better way to print a specific line than

    head -50 movies.csv | tail -1

    is

    sed '50q;d' movies.csv

    It's faster for big files, starts fewer processes, and requires less typing.

    ReplyDelete
    Replies
    1. Thank you, useful to know!

      Delete
    2. You can also use:

      sed -n 50,50p movies.csv

      (with no quotes needed around the 50,50p )

      Delete
    3. That sed command can be made into a script called body (because it complements head and tail):

      https://news.ycombinator.com/item?id=8234255

      Delete
    4. @Vasudev: Interesting to know but could not replicate it.
      How did you create the file?

      Delete
    5. You mean how did I create the text file that I used to test the body command? Just typed it in (using vim, one word per line), then you do a:
      1G
      which moves the cursor to the 1st line of the file,
      then do a:
      !Gnl -ba
      which pipes the entire file through the nl command, which numbers lines, then nl's output replaces the complete text of the file you are editing. Then save it. See "man nl" and the !G command in vim. The !G command of vim is very powerful because it allows you to pipe (part or all of) the contents of your text file (which can also be a file of program code, like a .py / .rb / .c file), through any Unix command, and the transformed text then replaces the text you were editing. So any text transformation available via a built-in Unix command, or a custom script or program that you write, is available to you, to process your text, without having to leave the vim editor. I use it a lot when programming and writing text.

      On the other hand, if you meant - how did I create the body command - then type this sed command:

      sed -n $1,$2p $3

      into a file called just body - no extension - which you create in some folder in your PATH.
      Then do:

      chmod u+x body

      That makes it executable.
      Then you can run it from any directory like this:

      body 25 65 text_filename
      and the 25 replaces the $1, 65 replaces $2, and text_filename replaces $3.
      The -n tells sed not to print all lines from the input file (its default behavior) and the p after the $2 says to Print the lines that match the pattern - which in this case is a range of line numbers - $1,$2 ==> 25,65 - which makes sed print only those lines from the file.

      HTH.

      Delete
    6. The awk equivalent:

      awk 'NR==50' movies.csv

      Delete
    7. Or simply

      sed -n 50p movies.csv

      Delete
    8. Yes, that's simpler - I forgot that you can specify a single line number instead of a range, though I use it with "sed 1d file" or "sed 10q file" which works like head.

      Delete
  2. I don't think you meant to use `uniq -u`. This removes all copies of each line that is duplicated, meaning that The Dark Knight Rises will not appear at all in the output. If you just use `uniq`, one copy of each line will remain. Also, it is worth nothing that `uniq` requires the file to be sorted first, because it only detects duplicate lines that are adjacent.

    ReplyDelete
    Replies
    1. +1. I tend to go with `sort -u` most of the time as it takes care of sorting and deduping in one command.

      Delete
    2. Yes, I want to repeat a part of Nathan's comment so that it is not lost: You must pass data to `sort` before using `uniq`! This is because uniq only looks at two sequential lines of context at a time, it does not remember state from the rest of the file.

      Delete
  3. Just a heads up that most of these commands will fail with non-trivial data because CSV files can contain rows that span multiple lines. I wrote csvkit to solve exactly this problem: http://csvkit.readthedocs.org/en/0.8.0/ Similar commands, but it handles CSV format correctly.

    ReplyDelete
    Replies
    1. yes, csvkit is a suite of tools that replace UNIX tools and add csv awareness

      another option is to use csvquote (https://github.com/dbro/csvquote) to temporarily replace the commas and newlines inside quotes, then use the UNIX commands as normal, and lastly restore the commas and newlines. the docs give some examples similar to what's in this blog post.

      Delete
  4. $$ Check number of records in the file
    $$ wc -l movies.csv

    Will the returned count include the header line?

    ReplyDelete
    Replies
    1. Yes, it will count header as just another record.

      You can exclude header by,
      $$ tail -n +2 movie.csv | wc -l

      Delete
  5. From the manpage: uniq [-c | -d | -u] [-i] [-f num] [-s chars] [input_file [output_file]]

    `uniq movies.csv > uniqMovie.csv` You can lose the pipe

    ReplyDelete
  6. Just waiting for the perl guys to pipe up to replace the venerable awk :)

    ReplyDelete
  7. Is there any way to find duplicate records, disregarding one of the columns? For example, I have a CSV file with 11 columns and I'd like to find the number of duplicate rows where each column is the same except for the 10th column. Is that possible using a Unix command? I'd love to not have to open the file with Python. Thanks, and nice tutorial!

    ReplyDelete
    Replies
    1. You can do this using `cut` to remove the offending column and passing the result to ` | sort | uniq -d` but be careful, cut does not understand the quoting mechanisms of CSV so if there is any data with quoted commas then it can easily snip the wrong data.

      Better to use something like csvkit if this is anything but quick-and-dirty data parsing.

      Delete
    2. And here is an example that snips out the second from last column and looks for duplicates, assuming that there are no quoted commas in the last two columns:

      sed -e 's/\(.*\),.*,\(.*\)/\1,\2/' < /tmp/report.csv | sort | uniq -d

      This relies on the regex rule of "longest leftmost" so I didn't need to use any character classes of [^,] or similar. If you'd picked a different column it would necessarily be more complicated.

      Delete
  8. http://tldp.org/LDP/abs/html/string-manipulation.html

    ReplyDelete
  9. I think you meant to say the n-th column? "Filter the data to get only 1st, 2nd, 6th and last record

    awk 'BEGIN{FS=",";OFS=",";}{print $1,$2,$6,$NF}' uniqMovie.csv > newMovie.csv

    ReplyDelete
  10. Worth noting here that uniq truncates lines (after 2K, 4K or 8K or more depending on platform). This may cause unexpected results at output when the CSV has many fields.

    ReplyDelete
    Replies
    1. That problem is probably not unique to uniq (pun unintended). It's likely the same old well-known problem that many (most?) Unix command line filters have, of using BUFSIZ (possibly taken directly from the Kernighan and Ritchie book) as the buffer (char array) size for the line variable they use to process lines. And BUFSIZ was set to values like one of 512 bytes, 1K, 2K, 4K etc. so those tools fail on lines longer than BUFSIZ characters. I'm not sure of the latest status of this bug - whether it has been fixed in modern versions of *nix or not.

      Delete
  11. Take a look at the paste command too. I used it for merging datasets in one project.

    ReplyDelete
  12. Some notes regarding AWK:

    Note that in addition to the Action part (i.e., the stuff between the curly braces), there's the Pattern part, that allows you to select lines and other conditions. For example, to find the number of fields (in AWK parlance, a data file contains records, which in turn consist of fields) in your file (assuming no oddities later on in the file), you can just use AWK, no need for 'tail':

    awk -F',' 'NR == 1 { print NF }'

    This also comes up in your example where you want to disregard the header line:

    awk -F',' 'NR>1 { print > ($4 "_movies.csv") }'

    AWK has built-in hash tables, which allows you to avoid the 'sort | uniq' pattern in shell scripting when trying to find unique lines, which do not necessarily need to be sorted. For very large files, sort will start hitting the disk and take O(N*log(N)) time. For AWK, this will be linear time and in-memory, assuming that the number of unique lines doesn't exhaust virtual memory.

    awk -F',' '!a[$0]++' movies.csv

    Lastly, you can define AWK record and field separators as regular expressions, and not just as fixed characters or fixed strings. This allows you to process messy multi-line data files (such as Microsoft-esque CSV).

    ReplyDelete
    Replies
    1. Meant to write, "data files with messy multi-line records".

      Delete
    2. Nice examples. Can you explain how this one works:

      awk -F',' '!a[$0]++' movies.csv

      Not seen it before. I can see that it uses associative arrays, but what does the ! do here? negate something?

      Delete
  13. This is also a good anecdote of effective use of AWK (here, the mawk implementation) for processing Big Data, which happened to be faster than even the naive C code:

    http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/

    ReplyDelete
    Replies
    1. Looks cool, will have to explore it. Thank you!

      Delete
  14. Install sqlite3 (<500kb); .mode csv; .read file.csv tablename; munge to your heart's content with good ol' SQ; .output file.csv; done.

    ReplyDelete