Dissecting the web with ruby and hpricot

Dissecting the web with ruby and hpricot

Let me introduce you to an old friend of mine: Hpricot

Hpricot has saved my life many times when I had to parse (x)html documents and extract information programatically. It features a very nice ruby-ish syntax and a blazingly fast xpath-based parser. Combine it with open-uri, and you’re in for a fun ride creating a data-extracting web spider.

Now to the basics of hpricot. Of course hpricot is available as a gem, just use the gem util, Luke:

gem install hpricot

Then, in your sources:

require 'rubygems'  # not neccessary in ruby 1.9+
require 'hpricot'

Hpricot uses a Hpricot::Doc object to represent the parsed document. It can be initialized with a string or with an IO object such as a file or some url opened by using open-uri:

doc = Hpricot('<p>A simple <b>html</b> doc.</p>')
doc = Hpricot(open('myfile.html').read)
require 'open-uri'
doc = Hpricot(open('http://mydomain.tld/what-a-cool-site.html').read)

So we now have some document opened and a Hpricot::Doc object, let’s do some searching in the document. We’ll be using XPath expressions, for reference, see the w3c xpath page

Suppose we have this html chunk loaded:

 <table class="product_list">
     <tr class="product_row">
         <td class="name">My shiny product</td>
         <td class="price">&euro; 1234</td>
     </tr>
     <tr class="product_row">
         <td class="name">My other shiny product</td>
         <td class="price">&euro; 234</td>
     </tr>
 </table>

We’d like to convert this to a hash containing the product data in the form {“name” => “price”}

Using traditional methods, you’d probably split by table rows and traverse the results splitting/gsubbing/regexping your way through the html code.

With hpricot though, it’s as easy as (highly unoptimized code):

products = {}
doc.search("//table[@class='product_list']/tr").each { |tr_item|
    cur_name = ''
    cur_price = ''
    tr_item.search("//td").each { |td_item|
        cur_name = td_item.innerHTML if td_item.attributes['class'] == 'name'
        cur_price = td_item.innerHTML.split(' ')[1] if td_item.attributes['class'] == 'price'
    }
    products[cur_name] = cur_price
}

Two things to note: a Hpricot node (element) has an innerHTML method which contains the raw string between its opening and closing tags, and an attributes hash by which you can access all tag attributes, like href, title, alt, target, src, class, style, etc.

Now let’s see a bit more concrete example: let’s gather page titles and links from a google search results page.

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'cgi'

USERAGENT = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_2; cs-cz) AppleWebKit/525.13 (KHTML, like Gecko) Version/3.1 Safari/525.13'
GOOGLE_SEARCH_URL_TEMPLATE = 'http://www.google.com/search?q=%SEARCH_STRING%'
Hpricot.buffer_size = 262144

def fetch_search_results(search_string)
  results = []
  doc = Hpricot(open(GOOGLE_SEARCH_URL_TEMPLATE.gsub('%SEARCH_STRING%', CGI::escape(search_string)), 'User-Agent' => USERAGENT).read)
  doc.search('//h3[@class="r"]/a').each { |result|
     results << {:title => result.innerHTML, :url => result.attributes['href']}
  }
  results
end

require 'pp'
pp fetch_search_results('ruby')

for me, it produces:

[{:title=>"<em>Ruby</em> Programming Language",
  :url=>"http://www.ruby-lang.org/"},
 {:title=>"Download <em>Ruby</em>",
  :url=>"http://www.ruby-lang.org/en/downloads/"},
 {:title=>
   "<em>Ruby</em> - The Inspirational Weight Loss Journey on the Style Network <b>...</b>",
  :url=>"http://www.mystyle.com/mystyle/shows/ruby/index.jsp"},
 {:title=>
   "<em>Ruby</em> (programming language) - Wikipedia, the free encyclopedia",
  :url=>"http://en.wikipedia.org/wiki/Ruby_(programming_language)"},
 {:title=>"<em>Ruby</em> - Wikipedia, the free encyclopedia",
  :url=>"http://en.wikipedia.org/wiki/Ruby"},
 {:title=>"<em>Ruby</em> on Rails", :url=>"http://rubyonrails.org/"},
 {:title=>"<em>Ruby&#39;s</em> Diner - rubys.com",
  :url=>"http://www.rubys.com/"},
 {:title=>"<em>Ruby</em> Boutique", :url=>"http://www.rubynz.com/"},
 {:title=>"<em>Ruby</em> Annotation", :url=>"http://www.w3.org/TR/ruby/"},
 {:title=>"[<em>Ruby</em>-Doc.org: Documenting the <em>Ruby</em>  Language]",
  :url=>"http://ruby-doc.org/"}]

Now some explanations for the new things here.

  • USERAGENT : many sites won’t server content unless they see some valid user agents. Rule #1 for web scraping. Find your favorite useragent string on google :)
  • Hpricot.buffer_size : hpricot may run out of buffer space while parsing larger document, use this for prevention
  • CGI::escape : probably I don’t have to explain that string have to be url-encoded before using them in – well, urls :)
  • //h3[@class="r"]/a’ : how did I get this? I fired up a browser and looked at the source of a google search results page.

I hope you’ve found this quick intro useful, I’d really like to hear your opinions and thoughts on this. Oh yes, and hpricot use-cases, too!


 

Related posts:

  1. Handling a huge amount of fulltext searches

Leave a Reply

Additional comments powered by BackType