Friday, July 15, 2011

I scrape as a timepass, and its fun

'Scraping' the name itself sounds negative even though in my view Scraping is a real boon to developers when there's no API(REST, XML based, etc) already published.. REST APIs are very widely adopted. Lets discuss something about REST first:
REST is basically like, there are some resources(objects) on a server. Now these resources can be accessed via the same old HTTP GET, POST, DELETE... (maybe after authentication which these days is mostly via OAUTH). In response to the HTTP request the server sends back the JSON string. The gud thing is JSON is quite readable and can be parsed very easily. So the resources/services on a server can be communicated with the clients via the HTTP requests and the JSON responses.
Lets compare that with scraping..
  • When we want to scrape there's again a HTTP request to a server.
  • Server sends some HTML, CSS and JS.
Now there are issues with the HTML and the JS content sent.
HTML issues:
  • Browser seem to handle any hopeless HTML  sent.
  • So the content becomes tough to parse coz people i.e. the HTML content writers seem to be sloppy.
If proper HTML standard is followed by every content writer then its very easy to parse it and use it.. but when the HTML code is gone case we rely on libraries which heuristically parse and give you good searchable objects to enjoy with programmatically. The logic that the libraries would work is fairly simple ..

If browser's can parse hopeless HTML to a degree that the browser still renders bad HTML good on it, implies there is parsing code available with the browser. Mozilla Firefox source code is open. There's lots to learn there..
But to get a quick scrape done we really don't need to look into the browser's HTML parsing code. Instead we've got people who've made great libraries probably looking at the browser's HTML parsing code.
For my scraping needs I've enjoyed with BeautifulSoup when I code in Python and Jsoup when I code in Java. Playing with these libraries is sheer fun.. To get a feel for how good beautifulsoup is check this code that I wrote long time back. This just gets all Indian railway train information data like schedule of different trains and train stoppages from http://www.indianrail.gov.in/ and makes SQL queries to persist the same data.

Ok so now that we've discussed that HTML parsing is more of an issue solved, lets talk about the JavaScript part of the HTTP response..
  • Scraping Static pages is easy as explained above. 
  • Today pages are dynamic. And ajax calls really make the user experience better.
So now if we are scraping a dynamic page we need to understand and render the Javascript code too.. Whoa.. But just think of it in terms of the browser..
"The browser understands JS and Ajax and renders the content received from AJAX calls."
But this case is very different and is far more different in comparison to the HTML issue.. The JS code interacts with the HTML and CSS so now we need to emulate the whole browser itself. That is when COM(Component Object Model) objects enter into the picture. We basically have to emulate the full browser and COM helps in doing that. Check out how to programmatically use Internet Explorer COM objects with some of the python code:
from win32com.client import Dispatch
ie = Dispatch("InternetExplorer.Application") #Do stuff with the ie object
 Ok so scraping the dynamic pages is tougher in comparison to static pages.. But it can be done. 
Also many a times one can get the right data by finding out the URL to which to ajax call goes and then use the response received. 
There's still lot of functionality which can be achieved from many of the static pages still available.. 
For example:
Visualizing realtime stock data is crucial for traders. Now in many of the Stock Tracking websites available the stock data being viewed is some 2 minutes behind the actual time. Also there are websites where a HTTP request to a server for a Stock returns the realtime price of it.. But the problem there is that pages have to be refreshed again and again to check the current price.. Also in one web page in the browser you can check only one Stock at a time.. 
So me and a friend of mine(he is a big market lover :) ) tried to solve the above two issues and these were the minimum functionality that had to be solved:
  1. Choose the companies from NSE that the user wants to track. 
  2. Aggregate all the stock prices of the companies in one page.
  3. Auto refresh the prices via AJAX every 10 seconds or so.
So we made a Simple and basic Realtime stock monitoring application which achieves the above, this uses GWT and the Google App Engine to host which are some interesting things to talk about.. Probably for some other post. Even though the User interface isn't good but still it serves the purpose.

Ok so scraping Static pages can be visualized as an API itself, albeit the fact that the content will change with time.. And the chances that the scraping code breaks when the content changes is always there.. One small tip to avoid breakages to some extent is to use CSS classnames if they exist while scraping, that way hopefully if the page author wants to restyle he/she would change the CSS properties of that class instead of the class name..

Finally Scraping is fun for developers like me coz it helps solve many interesting problems.. But think in terms of  how the search engine giants Google etc. work . They rely big time on scraping and crawling.. Their bots have to literally crawl the whole web and scrape data based on their algos to help their servers index and search.. So scraping itself is a huge revenue generator too but for now I will enjoy the fun part of scraping .. Hopefully some time in the future I scrape to create revenue.. :)


P.S. I scrape just for timepass.. Please dont sue me.. ;)