Friday, August 13, 2010

Getting Started with Nokogiri

The XPath language was written to easily traverse an XML tree structure, but we can use it with HTML trees as well. Here’s a sample program for extracting search result links from a google search. We’ll use XPath to find the data we want, and then pick apart the XPath syntax:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.google.com/search?q=doughnuts"))
doc.xpath('//h3/a').each do |node|
  puts node.text
The XPath used in this program is:
In English, this XPath says:
Find all “a” tags with a parent tag whose name is “h3″
Thus, our program finds all “a” tags with “h3″ parents, loops over them, and prints out the text content.
XPath works like a directory structure where the leading “/” indicates the root of the tree. Slashes separate the tag matching information. When there’s nothing between slashes, it’s a sort of wild card—meaning “any tag matches”. The “h3″ and “a” are tag name matchers, and only match when the tag name matches.


Anonymous said...

Ok, can you, please, suggest how to find all content separated by a tag with nokogiri?

For example

I need to get array of pieces separated with br tag.

byungjin said...

I'm so sorry for my late response. you can get the array of pieces by belows

doc.xpath('//h3/a').each do |node|
puts node.text

I think you might change 'h3' tag to 'br' tag.
Thank you.

Post a Comment

Design by Free WordPress Themes | Bloggerized by Lasantha - Premium Blogger Themes | Macys Printable Coupons