Getting Started with Nokogiri ~ TwinrainbO

Friday, August 13, 2010

Getting Started with Nokogiri

6:42 PM

byungjin

The XPath language was written to easily traverse an XML tree structure, but we can use it with HTML trees as well. Here’s a sample program for extracting search result links from a google search. We’ll use XPath to find the data we want, and then pick apart the XPath syntax:

require 'open-uri'
require 'nokogiri'
 
doc = Nokogiri::HTML(open("http://www.google.com/search?q=doughnuts"))
doc.xpath('//h3/a').each do |node|
  puts node.text
end

The XPath used in this program is:

//h3/a

In English, this XPath says:

Find all “a” tags with a parent tag whose name is “h3″

Thus, our program finds all “a” tags with “h3″ parents, loops over them, and prints out the text content.

XPath works like a directory structure where the leading “/” indicates the root of the tree. Slashes separate the tag matching information. When there’s nothing between slashes, it’s a sort of wild card—meaning “any tag matches”. The “h3″ and “a” are tag name matchers, and only match when the tag name matches.

Posted in: Programming,Rails

2 comments:

Anonymous said...: Ok, can you, please, suggest how to find all content separated by a tag with nokogiri?

For example
"text
text"

I need to get array of pieces separated with br tag.; August 17, 2010 at 1:59 AM
byungjin said...: I'm so sorry for my late response. you can get the array of pieces by belows

doc.xpath('//h3/a').each do |node|
puts node.text
end

I think you might change 'h3' tag to 'br' tag.
Thank you.; November 17, 2010 at 5:57 PM

TwinrainbO

Friday, August 13, 2010

Getting Started with Nokogiri

2 comments:

Post a Comment

Categories

Popular Posts

Recent Readers