Watir: a good alternative for web scraping with ruby
Cross-posted from Medium.
I’ve been a fan of web scraping and ruby. I’ve used mechanize in the past to scrape static websites. It’s a great gem with decent documentation and there are also tutorials and blog posts about how to use it around the internet.
But, there is a big problem with mechanize. It uses nokogiri to transform scraped HTML into a page object. But, these days, more and more websites started using Javacript heavy content and some are completely single page apps that only loads minimal HTML initially and later load content with Javascript. Therefore, when scraping mechanize with those sites , we only get empty HTML page objects and the content are not loaded properly.
The solution to that problem is to use a separate browser processes to simulate a true browser environment by using tools like selenium or phantomjs. Capybara gem does exactly that when we wrote acceptance tests with it. And I’ve been using capybara for years now.
Only just today that I realized there is another alternative called waitr. At first, I thought it was a new gem but it’s blog page shows an archives dating back to 2009 so it must have been around for a long time under my radar.
There is a good discussion about the difference between watir and capybara. There are philosophical differences like OO style api vs DSL api but mostly they are providing same functionality. My personal experience with Capybara is not very good when it comes to ajax loaded content because I needed workarounds. So, I was hoping a better experience.
To get started with watir, I need to install both watir
and webdrivers
gem that it uses underneath.
gem install watir webdrivers
Then, I had to require them both. Otherwise, it didn’t work.
|
|
|
|
If I want to find an element by CSS:
|
|
The sweet thing is explicitly waiting for Javascript to render the content:
|
|
In the above example, I used watir’s waiting api to wait for javascript for rendering that content by calling wait_until_present
. And after that I am sure the element will be rendered when the next line of code is executed.
For my current scraping, it was all I wanted because the site I wanted to scrape is using JS and ajax to load the content only after initial page load is done. All I need to do is wait until JS renders the content and scrape from there.
The things that make me fall in love with it at first sight is how simple it is to setup. I don’t wanna do many configuration when writing quick and dirty scripts. Also I don’t have to write annoying wait_for_ajax helpers as well. And selector API is simple and intuitive. I also love the explicit waiting methods it provides.
I think I’ll use it again when I need to scrape things again.