Wednesday, February 6, 2013

Python: Building basic crawler with BeautifulSoup



As big data is becoming buzzword, every organization is trying hard to make the best out of data science. And the actual implementation starts from collecting right data. The data collection can be internal sources, from multiple business units or it can be from external sources required for marketing/sales units. There are many firms in market who collect lot of data from open sources/web and then sell specific and cleaned data to required companies. Due to many programming languages and libraries associated with them its not rocket science to collect this data. So lets discuss how can we build a simple crawler.

There are big frameworks like Scrapy and handy libraries like beautifulSoup. For any crawler we must consider, 

  • How can we get the source (url) 
  • How can we scale it 
  • How can we parse it
  • How can we get data of interest from parsed text


BeautifulSoup is used for second purpose, to scan the big chunk of data from source web pages and to identify what exactly we need. For first purpose we build source urls if the naming convention is static like, 
  • www.abc.com/a-data.php // Contains data about product a 
  • www.abc.com/b-data.php // Contains data about product b
So, if we need data of product p, then we can build url in program
  • url = "http://www.abc.com/" + product + "-data.php"
Another way is to get source urls is using search engine APIs with input keywords of our interests. For example to use google API we can build url like following, which will return search results associated with query. Then we can use each url as source url.
  • url = "https://ajax.googleapis.com/ajax/services/search/web?v=1.0&rsz=8&start=0&q=" + query
No that we are clear with few concepts, lets consider a problem statement. I would like to get financial details of movies released in 2013 from thenumbers.com. First thing is to get source urls. We can get list of all movies from 2013 here. Just copy the table in to excel and use this to get url from linked text. Store urls in excel. 

At this point we have list of source urls stored in an excel file. Python have this library which can help us to read excel files, called xlrd. If we dont have these packages already installed we can simply use,


 pip install beautifulsoup4    
 # Or    
 easy_install beautifulsoup4  

Perform same operations for "xlrd" and "requests". At this point we have source urls and all required python packages installed. Now the program will be something like following,

 # Import all packages we need  
 from bs4 import BeautifulSoup  
 import requests  
 import xlrd  
 import csv  

 # Open the excel file and sheet containing data  
 workbook = xlrd.open_workbook('movies.xlsx')  
 sh = workbook.sheet_by_index(0)  

 # Loop for every entry in a file, which is basically a url  
 # First row is headers so start from second row
  
 for rownum in xrange(2, sh.nrows):   
   # Load the url from excel
   url = sh.row_values(rownum)[1].encode('utf-8','ignore')  
   print url     
   fact = list()  
   # Open in append mode
   f1 = open("movie_table", 'a')  
   writer1 = csv.writer(f1) 
   # Request HTTP response using the url     
   r = requests.get(url)  
   # Use BS4 to format the response
   soup = BeautifulSoup(r.text.encode('utf-8','ignore'))  
   table1 = soup.findAll("div", {"id" : "moviechart"})[0]  
   table1_rows = table1.findAll("tr")  
   for t in table1_rows:  
       fact.append((t.findAll("td")[0].text, t.findAll("td")[1].text))  
       writer1.writerow( (payload, t.findAll("td")[0].text.encode('utf-8','ignore'), t.findAll("td")[1].text.encode('utf-8','ignore')) )  
 print fact      
 f1.close()  

No comments:

Post a Comment