Easy Tutorial
❮ Python File Io Python Command Line Arguments ❯

Python3 XML Parsing


What is XML?

XML stands for eXtensible Markup Language, a subset of Standard Generalized Markup Language, designed to markup electronic documents to give them structure. You can learn more about XML through our XML Tutorial.

XML was designed to transport and store data.

XML is a set of rules defining semantic markup, which divides documents into many parts and labels these parts.

It is also a meta-markup language, defining the syntax for other domain-specific, semantic, and structured markup languages.


XML Parsing in Python

Common XML programming interfaces include DOM and SAX, which handle XML files differently and are used in different scenarios.

Python has three methods for parsing XML: SAX, DOM, and ElementTree:

1. SAX (Simple API for XML)

Python's standard library includes a SAX parser. SAX uses an event-driven model, triggering events and calling user-defined callback functions during XML parsing.

2. DOM (Document Object Model)

Parses XML data into a tree in memory, allowing manipulation of the XML through tree operations.

The XML example file movies.xml used in this chapter is as follows:

Example

<collection shelf="New Arrivals">
<movie title="Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A scientific fiction</description>
</movie>
   <movie title="Trigun">
   <type>Anime, Action</type>
   <format>DVD</format>
   <episodes>4</episodes>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
   <type>Comedy</type>
   <format>VHS</format>
   <rating>PG</rating>
   <stars>2</stars>
   <description>Viewable boredom</description>
</movie>
</collection>

Using SAX to Parse XML in Python

SAX is an event-driven API.

Using SAX to parse XML documents involves two parts: parser and event handler.

The parser reads the XML document and sends events to the event handler, such as element start and element end events.

The event handler responds to these events, processing the XML data passed.

To use the SAX method in Python, you need to import the parse function from xml.sax and ContentHandler from xml.sax.handler.

ContentHandler Class Methods

characters(content) method

Invoked when:

The tag can be a start tag or an end tag.

startDocument() method

Called when the document starts.

endDocument() method

Called when the parser reaches the end of the document. startElement(name, attrs) Method

Called when an XML start tag is encountered. The name parameter is the tag's name, and attrs is a dictionary of the tag's attributes.

endElement(name) Method

Called when an XML end tag is encountered.


make_parser Method

The following method creates a new parser object and returns it:

xml.sax.make_parser([parser_list])

Parameter Description:


parse Method

The following method creates a SAX parser and parses an XML document:

xml.sax.parse(xmlfile, contenthandler[, errorhandler])

Parameter Description:


parseString Method

The parseString method creates an XML parser and parses an XML string:

xml.sax.parseString(xmlstring, contenthandler[, errorhandler])

Parameter Description:


Python XML Parsing Example

Example

#!/usr/bin/python3

import xml.sax

class MovieHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.CurrentData = ""
        self.type = ""
        self.format = ""
        self.year = ""
        self.rating = ""
        self.stars = ""
        self.description = ""

    # Element start event
    def startElement(self, tag, attributes):
        self.CurrentData = tag
        if tag == "movie":
            print("*****Movie*****")
            title = attributes["title"]
            print("Title:", title)

    # Element end event
    def endElement(self, tag):
        if self.CurrentData == "type":
            print("Type:", self.type)
        elif self.CurrentData == "format":
            print("Format:", self.format)
        elif self.CurrentData == "year":
            print("Year:", self.year)
        elif self.CurrentData == "rating":
            print("Rating:", self.rating)
        elif self.CurrentData == "stars":
            print("Stars:", self.stars)
        elif self.CurrentData == "description":
            print("Description:", self.description)
        self.CurrentData = ""

    # Character reading event
    def characters(self, content):
        if self.CurrentData == "type":
            self.type = content
        elif self.CurrentData == "format":
            self.format = content
        elif self.CurrentData == "year":
            self.year = content
        elif self.CurrentData == "rating":
            self.rating = content
        elif self.CurrentData == "stars":
            self.stars = content
        elif self.CurrentData == "description":
            self.description = content
self.year = content
elif self.CurrentData == "rating":
   self.rating = content
elif self.CurrentData == "stars":
   self.stars = content
elif self.CurrentData == "description":
   self.description = content

if __name__ == "__main__":

   # Create an XMLReader
   parser = xml.sax.make_parser()
   # Turn off name spaces
   parser.setFeature(xml.sax.handler.feature_namespaces, 0)

   # Overwrite ContextHandler
   Handler = MovieHandler()
   parser.setContentHandler(Handler)

   parser.parse("movies.xml")

The above code produces the following output:

*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Year: 2003
Rating: PG
Stars: 10
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Year: 1989
Rating: R
Stars: 8
Description: A scientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Stars: 10
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Stars: 2
Description: Viewable boredom

For the complete SAX API documentation, please refer to Python SAX APIs.


Parsing XML with xml.dom

The Document Object Model (DOM) is a programming interface recommended by the W3C for processing extensible markup language standards.

A DOM parser, when parsing an XML document, reads the entire document at once and stores all elements in a tree structure in memory. You can then use various functions provided by the DOM to read or modify the content and structure of the document, and also write the modified content back to the XML file.

In Python, xml.dom.minidom is used to parse XML files. Here is an example:

Example

#!/usr/bin/python3

from xml.dom.minidom import parse
import xml.dom.minidom

# Use the minidom parser to open the XML document
DOMTree = xml.dom.minidom.parse("movies.xml")
collection = DOMTree.documentElement
if collection.hasAttribute("shelf"):
   print("Root element : %s" % collection.getAttribute("shelf"))

# Get all movies in the collection
movies = collection.getElementsByTagName("movie")

# Print detailed information for each movie
for movie in movies:
   print("*****Movie*****")
   if movie.hasAttribute("title"):
      print("Title: %s" % movie.getAttribute("title"))

   type = movie.getElementsByTagName('type')[0]
   print("Type: %s" % type.childNodes[0].data)
   format = movie.getElementsByTagName('format')[0]
print("Format: %s" % format.childNodes[0].data)
rating = movie.getElementsByTagName('rating')[0]
print("Rating: %s" % rating.childNodes[0].data)
description = movie.getElementsByTagName('description')[0]
print("Description: %s" % description.childNodes[0].data)

The program execution results are as follows:

Root element: New Arrivals
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Rating: PG
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Rating: R
Description: A scientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Description: Viewable boredom

For the complete DOM API documentation, please refer to the Python DOM APIs.

❮ Python File Io Python Command Line Arguments ❯