Skip to main content

About the disdain for XML among Python programmers

Last december Phillip Eby (PJE) posted a a nice rant. It was widely quoted in other Python-oriented weblogs; people liked especially the rant against XML. It was indeed a very nice rant. It still rankled a bit with me, though, even though I've seen similar things before. This disdain for XML technologies is very common among Python programmers. I posted my own rant in response in a comment on another weblog, hardly a place where it will be seen. So, I'll post a new, edited version of my rant in my shiny new weblog, where it has at least a bit more chance to be read. What's the good of ranting if nobody hears you, after all?

I think a lot of the disdain for XML from Python programmers is nothing more than elitism. Misguided elitism. Another part that I suspect plays a role for some people is finding an excuse not to learn more about it. "See, XML sucks, Python is superior, so me not knowing a lot about it is fine". Finally, it's ignoring that there is more than one way to do the same thing, and each have their own benefits and drawbacks. There are drawbacks to using XML, and there are drawbacks to using pure Python. Yes, there are misapplications of XML; there are probably very many. There is a lot of chaff. That doesn't mean it isn't applied successfully in very many cases.

The only reasons to use XML according to PJE are:

"If you're not implementing an existing XML standard for interoperability reasons, creating some kind of import/export format, or creating some kind of XML editor or processing tool, then Just Don't Do It."

and

"The only exception to this is if your target audience really really needs XML for some strange reason. Like, they refuse to learn Python and will only pay you if you use XML, or if you plan to give them a nice GUI for editing the XML, and the GUI in question is something that somebody else wrote for editing XML and you get to use it for free. There are also other, very rare, architectural reasons to need XML."

The tone of the text gives the impression that the set of problems listed is a small class of problems that can safely be ignored by smug and superior Python programmers. Nothing could be further from the truth -- the class of problems described here is in fact huge.

A large fraction of IT problems has to do with interoperability; solutions very frequently involve tieing a number of systems together. Like it or not, XML is used in a lot of cases where interoperability between disparate systems is required. There are strategic investment concerns (is my data future-proof? Can I hire experts?) and decoupling concerns (I want this Java system to work with this data, and this Python system too, oh, and there's this browser-based Javascript application too). XML can be helpful in these areas and deciding to use it is in many cases an eminently sensible decision, not a "strange reason".

XML is standardized, cross-platform, cross-framework, and cross-programming language. Many of the surrounding specifications are well-implemented in a diversity of programming languages. Using XML, like using the architecture of the web, is a way to manufacture serendipity, to speak with Jon Udell. If your data is in XML, suddenly you can apply query languages, validation frameworks and transformation tools that you couldn't do if you hadn't thought about XML representations. Suddenly you can use your data in ways you couldn't use before.

I can list a few other architectural places where the use of XML can make sense that seem to fall outside of PJE's description, though he intelligently covered himself up by saying there are other rare cases where it might make sense. Here are some possible reasons:

Having a standard, neutral content representation like XML can make sense when you need to decouple parts of your application from each other, because you have either different teams implementing the different parts, or you're implementing the different parts in very different platforms, or both. In addition, making the content representation explicit in the form of XML may aid comprehensibility of the system, as suddenly the data flow becomes a lot more clear and can be treated as something separate.

An example: visual template designers take the XML output of some processing component of the application developed by quite different programmers. They do not need to learn any APIs that may call deep into the backend application and could be used the wrong way; they only need to worry about extracting the information they need for presentation from an XML structure.

Another place where XML can make sense is if you need a domain specific language. These often make sense -- it can force people to think declaratively about a problem instead of finding dirty ad-hoc ways out all the time. Using XML can make your life easier, as the parser is already available, along with many other possible tools. Of course sometimes designing your own grammar from scratch may be worth doing, but many cases it's not necessary and XML is good enough. It's a useful tool in the toolbox in that case.

Of course you could do anything you can do with XML with Python datastructures, and anything you do with XSLT with custom Python transformation code, and so on. You could even build a framework to do so. It would also be a lot of work. And you'll lose out on all the interoperability advantages though, and you're betting your own mind against that of a lot of experts who spent years thinking about this set of problems, and your software against a massive installed base. In certain cases going your own way is still worthwhile (someone will have to improve the world), but in many cases it's not.

So, Python people, resistance is futile. Be assimilated by the XML collective. Feel free to be smarter about it than many: stay sensible and stay critical, to where XML makes sense and where it doesn't, but throw out the disdain that makes you unable to see valid opportunities for its use. Learn from XML technologies.

PJE is very smart. I am sure his view of it all is actually a bit more nuanced than his tone. Looking at Chandler source code (as I suspect he was doing) explains the outburst pretty well. But don't let it fool any Python programmer into thinking they can safely ignore XML because it is not worth considering by someone with a superior programming language like Python. Most likely, he cannot ignore it, and he shouldn't.

lxml relax NG tweaks

The Relax NG support seemed to be working for lxml, until I tried it with a complicated case: a modularized XHTML Relax NG schema.

Turns out the approach I was taking of turning an ElementTree tree into a Relax NG schema is only of limited use. Relax NG schemas often use include to load other schemas from the filesystem or URLs as well, and that wouldn't happen as by then any information of where the original XML document was is lost. I could find no way in the libxml2 APIs to retroactively supply this information -- perhaps I should lobby for its inclusion.

To make it work now, I use a different libxml2 API to load Relax NG from the filesystem directly. You can now supply a file object or path to the RelaxNG constructor.

I suspect the same problem will arise with loading modularized XSLT. I haven't gotten around to investigating that yet.

Update: After some discussions with Daniel Veillard, it turns out my assumptions were wrong, which is good. libxml2 documents do retain the context information as a URL attributes, so this means that it should be able to include the Relax NG modules. It doesn't however, at least sometimes. It works when I start the program in the same directory as the modularized RNG files, but it fails if I start it a directory higher. This may indicate a bug in libxml2 or a further lack of comprehension on my side; I'll try to write some sample code and take it up with the libxml2 developers.

Update (05-01-27): I've now tracked this down to a bug in the libxml2 library. My bug report.

Another update, 5 minutes later: Daniel Veillard has already fixed the bug in libxml2 CVS! It turned out that xmlCopyDoc was indeed not behaving as it should.

benchmarks and lxml

The recent cElementTree release is causing some waves in the Python/XML community. It started when Uche Ugbuji posted The Python Community has too many deceptive XML benchmarks to his blog.

The effbot was not amused, as could be witnessed by his comment on it, and the blog entries:

http://online.effbot.org/2005_01_01_archive.htm#sigh http://online.effbot.org/2005_01_01_archive.htm#faking-it http://online.effbot.org/2005_01_01_archive.htm#faking-it-2 http://online.effbot.org/2005_01_01_archive.htm#faking-it-3

The problem is that Uche unwittingly introduced a benchmark that is rather.. deceptive. He has been testing the time taken by the whole program, including startup and shutdown of the Python interpreter, module importing, and the like, instead of the part where XML processing takes place. Unless you're writing command line scripts or classic CGI web applications, Python startup time is hardly relevant, and shouldn't be part of the measurement.

A while back while developing lxml.etree I was curious what benchmark Fredrik was using. I couldn't find the information on the web, but he told me when I mailed him about it. He was using the simple, obvious strategy which I myself had already been using:

.. imports ..
start = time.time() # time.clock() on windows
.. do the actual work ..
end = time.time()
print end - start

To measure approximate memory usage, he puts in a pause in the program before and after the processing, and checks the process overview on his machine manually.

I've replicated his results with cElementTree and ElementTree fairly well, though my machine is a bit different in its performance characteristics due to platform differences. See other blog entries for more info on this.

For fun, I thought I'd try Uche's benchmark against lxml.etree on this machine. I've also tested it against cElementTree (an older version, I can't keep up with Fredrik's releases; hm, no __version__ string I can find, so don't know what 0.9.x version it is.. reminds me to add one to lxml when the time comes for a release..).

Here's Uche's program adjusted for etree. As you can see, only the import statement needs to change:

import lxml.etree as ElementTree

tree = ElementTree.parse("ot.xml")
for v in tree.findall("//v"):
    text = v.text
    if text.find(u'begat') != -1:
        print text

I've also rewritten it to use xpath instead:

from lxml.import etree as ElementTree

tree = ElementTree.parse("ot.xml")
for text in tree.xpath("//v[contains(., 'begat')]/text()"):
    print text

Since this program is printing stuff, and printing overhead can be large, I've tried a number of tests:

  1. Unix 'time' command, print to stdout on Gnome terminal
  2. Unix 'time' command, redirect output to file
  3. time.time(), print to stdout on Gnome terminal
  4. time.time(), redirect output to file

Here are the results:

                  A      B      C      D
                  --------------------------
cElementTree      1.06s  0.32s  0.9s   0.23s
lxml.etree        1.2s   0.43s  1.1s   0.36s
lxml.etree xpath  0.53s  0.25s  0.42s  0.17s

As you can see from the results, the type of terminal you're printing to matters a lot. In case of the xpath tests, almost half of the time is spent printing to the terminal, and for the other tests the overhead seems to be even more.

Also note that at last I can claim a minor victory over cElementTree on my machine on this particular test! lxml.etree, when using xpath to do the task set, is faster than this version of cElementTree. Of course most of the credit here goes to libxml2's blazingly fast xpath implementation here.

All this shows benchmarks are nice as there are so many to choose from.

Relax NG support, C14N

Some progress over the last few days:

've added basic Relax NG support to lxml.

lxml.etree introduces a new class, lxml.etree.RelaxNG. The class can be given an ElementTree object to construct a Relax NG validator:

>>> f = StringIO('''
...   <element name="a"; xmlns="http://relaxng.org/ns/structure/1.0"
...     <zeroOrMore>
...       <element name="b">
...         <text />
...       <element>
...     <zeroOrMore>
...   <element>
... ''')

>>> relaxng_doc = lxml.etree.parse(f)
>>> relaxng = lxml.etree.RelaxNG(relaxng_doc)

You can then validate some ElementTree document with this. You'll get back true if the document is valid against the Relax NG schema, and false if not:

>>> valid = StringIO('<a><b></b></a>')
>>> doc = lxml.etree.parse(valid)
>>> relaxng.validate(doc)
1

>>> invalid = StringIO('<a><c></c></a>')
>>> doc2 = lxml.etree.parse(invalid)
>>> relaxng.validate(doc2)
0

And in addition, I've improved the c14n support so you can produce canonical XML output for any tree:

>>> f = StringIO('<a><b/></a>')
>>> tree = lxml.etree.parse(f)
>>> f2 = StringIO()
>>> tree.write_c14n(f2)
>>> f2.getvalue()
<a><b></b></a>

The most awesome development so far is that there's a contributor!

He wrote a patch to support XPath extension functions. I still have to review this, which I will try to do soon.

a little bit more lxml performance tweaking

Today I merged the backpointer branch with the lxml trunk, and have been cleaning up a bit more. In particular I've cleaned up some useless extra subclasses that were only necessary to introduce weak reference support to the various classes. I've now removed these subclasses, which cleans things up a bit more.

This also resulted in some performance gains! Not very spectacular ones, but still noticable:

                    nodereg  backptr  cleanups
                    --------------------------

findall('//v')      1.9 s    0.25 s   0.21 s
xpath('//v')        0.76 s   0.21 s   0.19 s
xpath('//v/text()') 0.34 s   0.25 s   0.25 s

Apparent element proxy creation was sped up by a notch. This explains why the //v/text() operation was not affected, as only strings are created as a result of that, not element proxies.

Of course cElementTree is still at 0.13 s for the findall('//v') operation, but as you can see the lxml.etree xpath version is not that far off anymore.

lxml performance progress

Such progress a few days can bring. Just last week the lxml.etree performance figures on ElementTree operations like findall lost out badly to pure Python code. So badly, it was pretty embarassing:

findall('//v') on ot.xml

ElementTree: 0.13 s
cElementTree: 0.11 s
lxml.etree: 1.9 s

All three here are using the same findall implementation (in Python) by the way, and they are throughout these tests. The dismal performance shows the slowness of aspects of the lxml.etree implementation as of last week.

After a refactoring of the way node proxies are maintained and a dumping of the whole weak reference idea in favor of a libxml2 to python backpointer approach, things are looking a lot better:

lxml.etree: 0.25 s

This is actually following an idea by Jim Fulton in a real life conversation in Vienna a few months back. It'd be depressing to know all these smarter people if it wasn't so much fun. :)

My figure is still not as good as (c)ElementTree, but it shows the overall API has sped up by quite a bit.

So,I just managed to speed up lxml.etree find operation by over a factor 7. I suspect the remaining factor 2 or so will be a lot harder, but it's at least reasonable now.

As a side effect, xpath overhead has also gone down quite dramatically. Recall that the other day it was this:

xpath('//v')

lxml.etree: 0.76 s

Not bad, but could be a lot better. After the work of the last few days, this is the new figure:

xpath('//v')

lxml.etree: 0.21

still not as good as even non-C ElementTree on this operation, but the full power of XPath is available.

Somehow my general work today also sped up other things. I'm still figuring out why this is faster, as wrapper overhead is hardly involved at all:

xpath('//v/text()')

lxml.etree: 0.34

And now it's 0.25 seconds!

Finally, to the parse + xpath overhead combined:

>> t = parse('ot.xml')
>> self.t.xpath('(//v)[5].text()')
[u'And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.\n']

This used to take about 0.25 seconds, 3+ meg parse included. Now it's 0.21 seconds. :)

So, while I'm sure things can be improved somewhat more, lxml.etree doesn't need to be embarassed about performance anymore. Perhaps we can embarass Uche Ogbuji into happily eating this statement:

> I know that folks are working on better libxml2
> wrappers, but familiar as I am with the C code,
> I honestly don't believe they can produce
> anything truly Pythonesque without losing all
> the performance gains.

Found on his weblog here: http://www.oreillynet.com/pub/wlg/6224

lxml progress

Since some people seem to be actually reading this and some progress has been made, I thought I'd give a report of what's been happening with lxml.

  • Since last week, I've added a lot more of the ElementTree API, such as the .find() function and friends, by directly using the code from ElementTree.

  • I actually am running the ElementTree and cElementTree test suites now. I still need to disable some tests, but a significant fraction is indeed running.

  • I've improved the way libxml2's parser functionality gets used, in order to implement libxml2's top-level parse() function.

  • I've added XPath support to lxml.etree! An example of what you can do:

    >>> from lxml import etree
    >>> tree = etree.parse('ot.xml')
    >>> tree.xpath('(//v)[5]/text()')
    [u'And God called the light Day, and the darkness he called Night.
     And the evening and the morning were the first day.\n']
    

    or, say, this, modifying the elements returned:

     >>> result = tree.xpath('(//v)[5]')
     >>> result[0].text = 'The day and night verse.'
     >>> tree.xpath('(//v)[5]/text()')
    [u'The day and night verse.']
    
  • I've added the start of XSLT support to lxml.etree. An example:

    test.xslt
    
    <xsl:stylesheet version="1.0"
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <xsl:template match="*" />
      <xsl:template match="/">
        <day><xsl:value-of select="(//v)[5]" /></day>
      </xsl:template>
    </xsl:stylesheet>
    
    >>> from lxml import etree
    >>> style_xml= etree.parse('test.xslt')
    >>> style = etree.XSLT(style_xml)
    >>> ot = etree.parse('ot.xml')
    >>> result = style.apply(ot)
    >>> style.tostring(result)
    u'<?xml version="1.0"?>\n<day>And God called the light Day, and the
    darkness he called Night. And the evening and the morning were the
    first day.\n</day>\n'
    >>> result.getroot().tag
    u'day'
    

A note about performance. I've been in a mail discussion with Fredrik Lundh, the originator of ElementTree, over the past week, doing some performance comparisons.

libxml2 is fast, but a bit of my thunder was stolen away by Fredrik when he released cElementTree. cElementTree is certainly no slacker either, and in some cases even beats the snot out of lxml. Fredrik deserves plenty of kudos for that. A bit of a bummer for me though. :)

In my measurements (your mileage may vary), cElementTree's parsing is about as fast as lxml at parsing jobs. This on the same benchmark Fredrik's been using. cElementTree is more memory efficient though, though lxml.etree is still better than ElementTree and many other Python/XML tools.

Somewhat to my disappointment, cElementTree and even ElementTree are right now a lot faster at .find() and friends than lxml.etree. Since they all use the same Python implementation, this means that lxml.etree's implementation of the ElementTree API is in some ways quite a bit slower than Fredrik's Python implementation! Thinking about it more, this is not a big surprise, as lxml.etree does a lot of heavy lifting to make sure the underlying libxml2 tree is exposed with an ElementTree API, and in addition has to worry about doing memory management with these structures.

All is not lost however; lxml has xpath! libxml2's xpath is pretty fast; while more slow than (c)ElementTree's .findall() in some cases, it's a lot more powerful as well, being a full xpath implementation.

Finally, XSLT seems pretty fast. In a simple test program, I can do a 1000 XSLT transformations in a few seconds, including a reparse of the XSLT stylesheet and document to transform, although granted this was done with a small document.

lxml findall and xpath performance

Update: lxml got quite a bit faster since this entry, see here.

I've been testing findall() performance on etree versus ElementTree/cElementTree. cElementTree and even ElementTree are quite a bit faster than lxml.etree at this stage. Possible causes of performance loss:

  • lxml.etree has to maintain proxy objects over the underlying libxml2 C tree.
  • lxml.etree uses a weak value dictionary to maintain weak references to all proxies in use. This seems particularly slow.
  • There's also a lot of UTF-8 to unicode conversion involved, as libxml2 uses UTF-8 strings throughout, and Python uses double-byte unicode strings.

Unfortunately the Python profilers don't profile C functions called in the extension module, which makes my measuring job somewhat harder.

Anyway:

findall('//v') on ot.xml

ElementTree: 0.13 s
cElementTree: 0.11 s
lxml.etree: 1.9 s

Whoah, not good for lxml.etree to lose out to pure Python that badly!

I also tested libxml2 xpath, which I added to lxml.etree today, and even this is quite a bit slower at simple operations like (//v), somewhat more to my surprise:

xpath('//v')

lxml.etree: 0.76 s

I think in part the large result set slows it down, as Element proxies have to be created for all elements in it.

As an example of that, this is actually faster (as it only makes strings):

xpath('//v/text()')
lxml.etree: 0.34

Of course, xpath is not only about raw performance, but also about features, like this:

>>> t = parse('ot.xml')
>>> self.t.xpath('(//v)[5].text()')
[u'And God called the light Day, and the darkness he called Night.
   And the evening and the morning were the first day.\n']

This happens in about 0.25 seconds, and is not something cElementTree can do with its findall(), though I expect the cElementtree Python equivalent of that would be quite a bit faster, I expect.<

Oh well, it was a bit of a bummer that Fredrik released something insanely much faster just as I was finally getting somewhere with lxml.etree.. :)

lxml parser performance

In a discussion with Fredrik Lundh about his (c)ElementTree parser performance benchmarks on the lxml.etree implementation.

On my work linux/athlon box, with Python 2.3, I get the following figures:

library              time       space
-------------------------------------
ElementTree 1.2.4    1.3 s      14000k
cElementTree 0.8     0.12 s     5500k
etree (trunk)        0.12 s     11200k
readlines            0.08 s     4300k

The memory usage of cElementTree and ElementTree on my box are in the same range as Fredrik's benchmarks. lxml.etree obviously runs behind quite a bit, and little I can do about it as it's mostly libxml2 memory usage.

Note that these only measure parser performance, not anything else. One benefit that cElementTree gets here is that it constructs Python objects right away, while lxml.etree only does this later. This makes Fredrik's figures of course even more impressive. lxml.etree will have to compete in other areas than parser performance...

lxml.etree is getting there

The lxml.etree implementation of ElementTree, on top of libxml2, is getting there now. It features automatic memory management and quite a bit of ElementTree compatibility. Not all of the ElementTree API has been implemented yet, but enough for many use cases.

I did discover in the process of debugging that you need a recent version of libxml2 to make it all work without memory errors; apparently earlier ones, like the version in my debian unstable (2.6.11), contain some bugs still.

I'm testing with libxml2 version 2.6.11 myself, so you may want to try that one too if you want to play with this code. You'll have to modify setup.py to make it use your installation of libxml2 -- the variables to modify are on the top.

So, check out out (svn co http://codespeak.net/lxml/trunk lxml), compile it, and do a 'make test'. And tell me whether the tests pass on your machine!