Ruby XML, XSLT and XPath Tutorial
What is XML?
XML stands for eXtensible Markup Language.
Extensible Markup Language, a subset of Standard General Markup Language, is a markup language used to structure electronic documents to make them structured.
It can be used to tag data, define data types, and is a source language that allows users to define their own markup language. It is well-suited for web transmission, providing a uniform method to describe and exchange structured data that is independent of applications or vendors.
For more details, please refer to our XML Tutorial
XML Parser Structure and APIs
The main parsers for XML are DOM and SAX.
The SAX parser is event-driven, requiring a scan from start to finish of the XML document. During the scan, whenever a syntax structure is encountered, the event handler for that specific syntax structure is called, sending an event to the application.
DOM is Document Object Model parsing, which constructs a hierarchical syntax structure of the document. A DOM tree is built in memory, where nodes of the DOM tree are identified in the form of objects. After the document is parsed, the entire DOM tree is stored in memory.
Parsing and Creating XML in Ruby
In Ruby, parsing XML documents can be done using the REXML library.
REXML is a Ruby XML toolkit, written in pure Ruby, compliant with the XML1.0 specification.
Starting from Ruby version 1.8 and later, REXML is included in the Ruby standard library.
The path for the REXML library is: rexml/document
All methods and classes are encapsulated within the REXML module.
The REXML parser has the following advantages:
- 100% written in Ruby.
- Applicable to both SAX and DOM parsers.
- Lightweight, with less than 2000 lines of code.
- Easy to understand methods and classes.
- Based on the SAX2 API and provides full XPath support.
- Comes with Ruby installation, no separate installation required.
Below is an example XML code, saved as movies.xml:
<collection shelf="New Arrivals">
<movie title="Enemy Behind">
<type>War, Thriller</type>
<format>DVD</format>
<year>2003</year>
<rating>PG</rating>
<stars>10</stars>
<description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
<type>Anime, Science Fiction</type>
<format>DVD</format>
<year>1989</year>
<rating>R</rating>
<stars>8</stars>
<description>A schientific fiction</description>
</movie>
<movie title="Trigun">
<type>Anime, Action</type>
<format>DVD</format>
<episodes>4</episodes>
<rating>PG</rating>
<stars>10</stars>
<description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
<type>Comedy</type>
<format>VHS</format>
<rating>PG</rating>
<stars>2</stars>
<description>Viewable boredom</description>
</movie>
</collection>
DOM Parser
Let's parse XML data first. We start by importing the rexml/document library. Typically, we can include REXML in the top-level namespace:
Example
#!/usr/bin/ruby -w
require 'rexml/document'
include REXML
xmlfile = File.new("movies.xml")
xmldoc = Document.new(xmlfile)
# Get the root element
root = xmldoc.root
puts "Root element : " + root.attributes["shelf"]
# Output movie titles
xmldoc.elements.each("collection/movie") {
|e| puts "Movie Title : " + e.attributes["title"]
}
# Output all movie types
xmldoc.elements.each("collection/movie/type") {
|e| puts "Movie Type : " + e.text
}
# Output all movie descriptions
xmldoc.elements.each("collection/movie/description") {
|e| puts "Movie Description : " + e.text
}
The above example outputs:
Root element : New Arrivals
Movie Title : Enemy Behind
Movie Title : Transformers
Movie Title : Trigun
Movie Title : Ishtar
Movie Type : War, Thriller
Movie Type : Anime, Science Fiction
Movie Type : Anime, Action
Movie Type : Comedy
Movie Description : Talk about a US-Japan war
Movie Description : A schientific fiction
Movie Description : Vash the Stampede!
Movie Description : Viewable boredom
SAX-like Parsing:
SAX Parser
Processing the same data file: movies.xml, using SAX parsing is not recommended for small files. Here is a simple example:
Example
#!/usr/bin/ruby -w
require 'rexml/document'
require 'rexml/streamlistener'
include REXML
class MyListener
include REXML::StreamListener
def tag_start(*args)
puts "tag_start: #{args.map {|x| x.inspect}.join(', ')}"
end
def text(data)
return if data =~ /^\w*$/ # whitespace only
abbrev = data[0..40] + (data.length > 40 ? "..." : "")
puts " text : #{abbrev.inspect}"
end
end
list = MyListener.new
xmlfile = File.new("movies.xml")
Document.parse_stream(xmlfile, list)
The above outputs:
tag_start: "collection", {"shelf"=>"New Arrivals"}
tag_start: "movie", {"title"=>"Enemy Behind"}
tag_start: "type", {}
text : "War, Thriller"
tag_start: "format", {}
tag_start: "year", {}
tag_start: "rating", {}
tag_start: "stars", {}
tag_start: "description", {}
text : "Talk about a US-Japan war"
tag_start: "movie", {"title"=>"Transformers"}
tag_start: "type", {}
text : "Anime, Science Fiction"
tag_start: "format", {}
tag_start: "year", {}
tag_start: "rating", {}
tag_start: "stars", {}
tag_start: "description", {}
text : "A schientific fiction"
tag_start: "movie", {"title"=>"Trigun"}
tag_start: "type", {}
text : "Anime, Action"
tag_start: "format", {}
tag_start: "episodes", {}
tag_start: "rating", {}
tag_start: "stars", {}
tag_start: "description", {}
text : "Vash the Stampede!"
tag_start: "movie", {"title"=>"Ishtar"}
tag_start: "type", {}
tag_start: "format", {}
tag_start: "rating", {}
tag_start: "stars", {}
tag_start: "description", {}
text : "Viewable boredom"
XPath and Ruby
We can use XPath to inspect XML. XPath is a language for finding information in an XML document (see: XPath Tutorial).
XPath, the XML Path Language, is a language used to determine parts of an XML document (a subset of Standard General Markup Language). XPath is based on the tree structure of XML and provides the ability to navigate through nodes in the data structure tree.
Ruby supports XPath through the REXML's XPath class, which is based on tree parsing (Document Object Model).
Example
#!/usr/bin/ruby -w
require 'rexml/document'
include REXML
xmlfile = File.new("movies.xml")
xmldoc = Document.new(xmlfile)
# Information for the first movie
movie = XPath.first(xmldoc, "//movie")
p movie
# Print all movie types
XPath.each(xmldoc, "//type") { |e| puts e.text }
# Get all movie formats, return an array
names = XPath.match(xmldoc, "//format").map {|x| x.text }
p names
The above example outputs:
<movie title='Enemy Behind'> ... </>
War, Thriller
Anime, Science Fiction
Anime, Action
Comedy
["DVD", "DVD", "DVD", "VHS"]
XSLT and Ruby
There are two XSLT parsers in Ruby, briefly described below:
Ruby-Sablotron
This parser is written and maintained by Masayoshi Takahash. It is primarily written for the Linux operating system and requires the following libraries:
- Sablot
- Iconv
- Expat
You can find these libraries at Ruby-Sablotron.
XSLT4R
XSLT4R requires XMLScan operation, which includes the XSLT4R archive, a 100% Ruby module. These modules can be installed using the standard Ruby installation method (i.e., Ruby install.rb).
The syntax for XSLT4R is as follows:
ruby xslt.rb stylesheet.xsl document.xml [arguments]
If you want to use XSLT4R in your application, you can include XSLT and input the required parameters. Here is an example:
Example
require "xslt"
stylesheet = File.readlines("stylesheet.xsl").to_s
xml_doc = File.readlines("document.xml").to_s
arguments = { 'image_dir' => '/....' }
sheet = XSLT::Stylesheet.new( stylesheet, arguments )
# output to StdOut
sheet.apply( xml_doc )
# output to 'str'
str = ""
sheet.output = [ str ]
sheet.apply( xml_doc )
Additional Resources
For the complete REXML parser, please refer to the REXML Parser Documentation.
You can download XSLT4R from the RAA Repository.