So how do we go about fixing the problem?
Crawl AJAX Like A Human Would
- DOM event handling and dispatching
- Dynamic DOM content extraction
The Necessary Tools
Watir is a library that enables browser automation using Ruby. It was originally built for IE, but it’s been ported to both Firefox and Safari as well. The Watir API allows you to launch a browser process and then directly extract and click on anchor links from your Ruby application. This application alone makes me want to get more familiar with Ruby.
Crowbar is another interesting tool which uses a headless version of Firefox to render and parse web content. What’s cool is that it provides a web server interface to the browser, so you can issue simple GET or POST requests from any language and then scrape the results as needed. This lets you interact with the browser from even simple command line scripts, using curl or wget.