the why of lxml

the why of lxml

Today I read an article about libxslt on O'Reilly's xml.com. It demonstrates the power of libxslt; it's a cool library. It also demonstrates why I wrote lxml: writing Python code that correctly uses libxml2/libxslt's bindings directly is difficult.

The example in the article goes like this:

# xsltprocs.py: send an XML source document through a
# pipeline of multiple XSLT stylesheets.

import sys
import libxml2
import libxslt

args = len(sys.argv)

if args <  3:
    print "Pipeline an XML document through a series "
    print "of XSLT stylesheets. Usage:\n"
    print "  xsltprocs.py infile.xml stylesheet1.xsl   [stylesheet2.xsl...]"
    sys.exit(0)

sourceXMLFile = sys.argv[1]
sourceDoc = libxml2.parseFile(sourceXMLFile)

for xsl in range (2,args):
    # Read in stylesheet.
    styleDoc = libxml2.parseFile(sys.argv[xsl])
    style = libxslt.parseStylesheetDoc(styleDoc)
    # Apply stylesheet to sourceDoc, save in result.
    result = style.applyStylesheet(sourceDoc, None)
    # Result becomes new sourceDoc in case we send it
    sourceDoc = result   # through another stylesheet.

print result

style.freeStylesheet()
sourceDoc.freeDoc()

What it does is pipe a single XML document through multiple phases of XSLT transformation. It works, though with my version of libxml2 think the last line should say:

print result.serialize()

as otherwise you don't get the proper XML output as expected. Better yet, it should be serialized through the last XSLT sheet's serialization functionality as it may have things to say about the serialization process.

It however has a memory bug. It doesn't matter in this context, as it's just a script, but it might start to matter quickly in a long-running process. What happens is that at the end of the script, the document and the XSLT sheet are cleaned up manually, but the intermediate results or stylesheets never are.

It's an easy mistake to make. Python programmers aren't supposed to have to worry about manual memory management. I rewrote the script to use lxml:

# xsltprocs.py: send an XML source document through a
# pipeline of multiple XSLT stylesheets.

import sys
from lxml import etree

args = len(sys.argv)

if args <  3:
    print "Pipeline an XML document through a series "
    print "of XSLT stylesheets. Usage:\n"
    print "  xsltprocs.py infile.xml stylesheet1.xsl [stylesheet2.xsl...]"
    sys.exit(0)

sourceXMLFile = sys.argv[1]
sourceDoc = etree.parse(sourceXMLFile)

for xsl in range (2,args):
    # Read in stylesheet.
    styleDoc = etree.parse(sys.argv[xsl])
    style = etree.XSLT(styleDoc)
    # Apply stylesheet to sourceDoc, save in result.
    result = style.apply(sourceDoc)
    # Result becomes new sourceDoc in case we send it
    sourceDoc = result   # through another stylesheet.

print style.tostring(result)

This doesn't look much simpler than the pure libxml2/libxslt example (more involved examples would), but as you see the memory management logic is gone, as lxml takes care of this automatically. Moreover, the memory management logic is correct, or that's a bug in lxml.

Comments

Comments powered by Disqus