Mr eel
internets.search(”[text()*=Hpricot Kicks Arse]“)
There are far too many Ruby libraries I hear about and never use… until the one day I actually start playing with them and think “ZOMG this is awesome, why haven’t I used it before?”. I seem to do this a lot. Today, let’s talk about Hpricot.
It is essentially a library for navigating through HTML documents and extracting the contents. It’s the Rolls-Royce of screen-scrapers. It makes it extremely easy to select elements using CSS syntax, grabbing either an array of elements or the first match. Not only that, it’ll let you modify the contents of elements and output your updated document. As you might guess, there are many wicked things we can do with this!
But lets start out by scraping some results from Google. First step is to require Hpricot and Open-URI. We need open URI to actually retrieve the HTML from the server.
require 'rubygems'
require 'hpricot'
require 'open-uri'
Now lets actually open a url and generate a Hpricot document. Let’s search for some information on Behold… The Arctopus.
url = "http://www.google.com.au/search?hl=en&q=behold+the+arctopus&btnG=Search&meta="
doc = open(url) {|f| Hpricot(f)}
We call the open method in open-uri and pass in our URL. You can see I’ve stuck ‘behold+the+arctopus’ inside the query string. We also pass a block to open which gives us access to the HTML document returned. We pass this off to Hpricot, which then returns a Hpricot document. Now the fun begins!
Our Hpricot document has lots of nice methods in it for grabbing elements, their contents and manipulating them. So, let’s say I was looking for You Tube videos of our beloved Prog-Metal-Avant-Totally-Mental instrumentalists. We can do this by looking for an anchor which contains the words ‘YouTube’.
vids = doc.search("a[text()*=YouTube]")
We call the search method on the document and pass in our CSS selector. This returns a collection of matching elements. If we only wanted the first match, we can replace search with at. The selector is a little bit clever. It’s saying give us the anchors with text-nodes that contain ‘YouTube’ in them somewhere. Unlike most browsers, Hpricot supports a huge range of selectors. A bitter, bitter reminder of what we’re missing my friends.
Now, lets actually list out the links. We can loop through the collection and write out attributes for each. We can access the tag attributes for each element using the [] method, which we pass a symbol for the relevant attribute.
vids.each do |vid|
puts vid[:href]
end
The code above would output:
http://www.youtube.com/watch?v=wniXxeTJlyM
http://www.youtube.com/watch?v=t80_eFghMdk
We also have access to the innerHTML of the elements:
vids.each do |vid|
puts vid.inner_html
end
# Outputs
YouTube - <b>Behold... The Arctopus</b> in Guitar one magazine
YouTube - <b>Behold... The Arctopus</b> - Transient Exuberance
Now you can see that our link elements actually have bold tags nested inside of them. Since all the methods we have access to in the doc are available in the elements, we can select children elements with search. For no other reason that serving as a contrived example let’s go, YAAAAAAY!
vids.each do |vid|
puts vid.at("b").inner_html
end
# Outputs
Behold... The Arctopus
Behold... The Arctopus
OK, so not that interesting, but I’m sure you can see how potentially useful Hpricot is. Want to programatically access a site without an API? OH HO HO HO! Hpricot for the motherbitchin’ win.
Comments