Skip to main content

lxml 0.9.1 released!

A bit late, but still worth a blog entry: lxml 0.9.1 was released last week! This is a bugfix release following up on the great 0.9 release done a bit before.

With the 0.9 release, lxml is really shaping up. It has a ton of features resulting from the input from many people, and most importantly includes a lot of work done by the extraordinarily cool Stefan Behnel who has been driving a lot of lxml development for a while.

lxml 0.9 contains a lot of new stuff: performance improvements and integration with libxml2's error reporting functionality, custom Element classes and SAX support, and XSLT extension functions, and that's only part of the list, most of which has been done by Stefan.

lxml 0.9 was also the first release where we worked with eggs. If you have the right versions of libxml2 and libxslt installed on your system, you can now use easy_install lxml to install the latest version of lxml (0.9.1 right now) and start working with it. Thanks to eggs contributed by a range of people, we've got eggs for Linux, Mac OS X and Windows.

So, check out lxml!

And here's the lxml cheeseshop entry with the various eggs, source code, and a windows installer as well.

lxml and (c)ElementTree

I saw a blog entry by Julien Anguenot praising the ElementTree+ (and cElementTree in particular) XML processing library, and also contrasted it with lxml, as in "why didn't I use lxml?". Since I created lxml, I thought I'd chip in and give my perspective on how it relates to ElementTree, and also give some context around Julien's statements about lxml in his blog entry.

First I'll agree fully that ElementTree and cElementTree are great! I'll encourage anyone to use them. What's unclear from the blog entry by Julien is that lxml is actually implementing the ElementTree API as well. While there are differences, this essentially means that you can write code against ElementTree and then later on move it to lxml if you need the added features.

lxml has quite a few more added feature above plain (c)ElementTree:

  • full XPath 1.0 support.
  • XSLT support.
  • Relax NG support
  • XML Schema support.
  • parsing and serialization retains namespace prefixes.

and much more. Many XML applications don't need these features. Many do. Julien's application evidently doesn't, which is why it makes sense for him to use ElementTree.

Now as to Julien's reasons as to why he doesn't use lxml. lxml is indeed somewhat less mature than ElementTree, though is stable enough to use in production applications. I'd also claim that it's more stable than the Python libxml2 bindings that it replaces, as those make it extremely easy to shoot yourself in the foot. So, if you're using those and you need the featureset of libxml2, you might want to consider switching to lxml.

The dependency of lxml on libxml2 and libxslt is indeed a problem, though of course it's also lxml's greatest strength - that's where the features come from. ElementTree and cElementTree beat it hands-down on being easier to get started with, especially when they get bundled with Python 2.5. Since lxml only benefits from an adoption of those XML processing APIs I'm very happy they do get bundled - people will find lxml when they need it.

While libxml2 and libxslt are very widespread on systems and lxml is starting to be packaged by linux distributions (it's in Ubuntu), it's still a huge dependency that is problematic in some circumstances. It's not that hard to install a newer version of libxml2 and libxslt somewhere if you want to deploy a server application, but it's certainly not making life easier.

Concerning Julien's Zope 3 note. I think that there has been no decision about Zope 3 not making it a dependency; in fact Zope 3 last I knew has been looking into adopting lxml as a dependency. It hasn't done so yet, and the libxml2 requirements are an important reason why not. Since Zope 3 is going to turn into a component cloud that's separately installable all this may become less relevant in a few more Zope 3 releases. The nature of what's core and what's not is going to become more fuzzy.

I don't think it's correct to state that lxml relies on bleeding edge revisions of libxml2 - the 0.8 release relies on libxml 2.6.16, which was released in late 2004. Newer releases will switch to newer releases of libxml2 eventually, but I wouldn't call our dependency in 2006 on something released in 2004 bleeding edge. Of course, newer versions of libxml2 may have bugfixes (especially in the XML schema validation implementation) and installing them will make lxml benefit from them too.

"Not memory leak free" surprises me a bit as an argument against lxml. We've recently had a few mails on memory leaks - these are the first reports on this that we've received so far. The leaks are plugged in SVN. What we do need is a newer lxml release soon that has these fixes. It's true that lxml has bugs, and it's not surprising there are some bugs with memory as that's the hardest part of wrapping libxml2 properly for Python users, but it's also true that any piece of software has bugs. I hadn't considered that these recent reports were considered to be show stoppers for people like Julien.

iterparse is indeed a very nice feature of ElementTree that I'd love to have added to lxml. No debate about this. Contributors are welcome. :)

Considering performance, I'll be honest. Parsing speed is about equivalent between the two. In the pure API of accessing the tree, cElementTree is faster than lxml. Probably not going to be noticable in most applications, but will matter in some. Then again, lxml offers access to libxml2's blazing fast XPath implementation, so if you're going to search into a tree lxml can keep up with cElementTree just fine, and lxml will occasionally be faster. In general, the features such as XSLT, directly implemented in libxml2/libxslt, are going to be fast.

So, pick lxml if you need the features it adds above cElementTree,, and if you need performance in those features. Don't pick it because it's faster than cElementTree at core ElementTree operations as it's not (though it's certainly not a slowpoke).

Another problem lxml has is that it has a lot of new features and fixes in SVN that should be in a release. We need to get working on a release!

Guido and XML

I think Guido's post on XML is a good occasion to point again to my rant about the disdain for XML among Python programmers, posted almost exactly a year ago on this blog.

I don't think Guido is right on the no XML in my template issue. I don't think he's wrong either. There are about as many ideas about what a templating language should be like as there are Python programmers, and Guido is, obviously, being a python programmer in this. Guido's rejection of XML for his own templating needs should not lead python programmers into believing that they can look at XML technologies in general with disdain.

Yes, XML is often misapplied. It has problems. That doesn't mean XML technologies are always useless, that they should never be considered in any software project, or that XML technologies can be safely ignored.

Empowering the power users

In my last article I talked about the problems that occur on the borderland between content and software, but didn't give enough examples. I figured I'd add some more text about this very important topic.

One response to the problem I describe is to treat this as an either/or choice. You could say: either give someone all the power in the world, and deal with the maintenance risk, or not give it at all. If those are the only choices, then I'd give all the power to programmers only, and no programming powers to end users at all.

But it's not an either/or choice. The problem occurs when you do want to give non-programmers access to software development facilities, so they can tweak and customize a system. You want to, as these power users, customizers, can be very successful creating the things they need. But you don't want to, because what they create tends to be hard to maintain and develop further.

There's a kind of user that's not a programmer and not an end-user. This class of user has traditionally been very important to the Zope community, and has attracted many people to Zope. This class of user wants to do more than an end-user, but they can create trouble, for themselves, for the system, and for those maintaining the system. They're application scripters, site designers, site maintainers, power users.

I think the current Zope 2 ZMI gives this class of user too much raw programming power. Sure, there are security restrictions, but people can build whole web applications with the ZMI as well. Very often, these people are not building a new application at all: they just want to tweak a bit on top of an existing system. But then those tweaks tend to grow in hard to maintain ways rather easily.

How to make the scripter's life easier while not making life too hard on the programmers? There might be focused systems that can make scripters effective. I promised examples.

Formulator. It's a simple form design system. People can develop forms in a web interface. Empowered lots of people. Also gave me (and others at Infrae) lots of problems as Formulator solves only half of what people really want to do -- creating and validating the forms, but not actually doing something with the form input. What people really want to do is send an email, create a database record, or create a new content object. If they want to do that, they suddenly have to grapple with rather a lot of detail, and I'll among those telling them that really if they write code, they should do it on the filesystem and check it into a version control system. But what if there was a system that let them do some of these tasks without exposure to software development?

Another example is site layout and navigation tweaks. Often, when a site is being built, or is small, it's maintained by someone who can tweak page templates, get quite far that way, but isn't a professional programmer. They use simple tools and tend to work through the ZMI.

Perhaps there is a better way to let power users tweak without driving them to copy large page templates, or large amounts of page templates. One answer to this is a limited, domain specific language: CSS. I don't think CSS is all of the answer -- sometimes server side tweaking is necessary. There should be a natural way for scripters to tweak things, empowering them to do what they want, but also without giving programmers a lot of pain if they need to maintain this eventually.

We know of other domain specific languages with less power than the full Python but very focused to particular tasks, such as workflow description languages.

I don't think the full answer is "just put a graphical user interface on it and it'll be easy for the non-programmers". For a while when people complained about the complexity of ZCML in Zope 3, it was sometimes said, oh, we'll build a through the web UI and your life will be easy then. But that's too easy an answer. Python Scripts have a UI in the ZMI, but they can contain a lot of complexity. Philipp, with his article ZCML needs to do less, is thinking about the real problem.

You first need to think about what the UI needs to express before you can design the right UI, graphical or anything. I suspect that for tasks on the borderland between software and content, often a domain specific language is the best UI, or at least the place to start thinking about a graphical UI. I suspect it's often easier to design a UI for a domain specific language than to extract a domain specific language from an existing UI afterward. I did the latter with Formulator XML; and the resulting language is imperfect. Domain specific languages breed conceptual integrity better than UIs.

So what do I want to do? I want to allow people to tweak just a single page template without introducing fragility and maintainability issues to the whole system, without forcing them to have to learn about the system's complexities. I want to look for ways to make ZCML easier. I want to look for ways to express ZCML-like things through the web. I want to try applying domain specific languages more. I want to investigate simpler scripting APIs.

I can't do all of them, or even any of this, on my own, so what I'd like most of all is for Zope community to take a step back for a bit and consider all this from a fresh perspective, so we can avoid falling into patterns only because we're familiar with them, not because they're better. And of course, I'm not preaching revolution, just a reconsideration of the ideals we'll try to evolve towards, in small steps.

The borderland between content and software

Paul Everitt writes:

In the early days of Zope, you could design content "TTW" (through the web). You could answer questions about structure and suddenly, you had new kinds of content -- YOUR content -- that could be added to folders in the system. No programmers were involved, no special login permissions on the server, no database schemas to update.

First, a somewhat snarky question: Paul, if the olden days were so great, why did the Zope community move on from them? I'll go into what I see as some reasons why.

Anyway, I agree that this is a scenario we should support. I also am a software developer with many years of experience developing with Zope, so I know the trouble that this scenario can cause.

Paul is talking about the interesting dividing line between content and software, and between content authors and programmers. Let's keep it clear and call this the borderland, as calling it just software or just content warps our perspective -- we need to see it as both.

Paul also writes:

Alas, later in the history of Zope, the component folks decided that TTW was grotty and should be banished. There were good reasons for this...from their perspective.

I think these reasons are important to make explicit, and should not be swept under the rug, so let me describe some reasons from my personal perspective.

Paul attributes this development to the "component folks". I could be considered one of them, and they're Paul's favorite group of programmers that make his life complicated, but I think Paul will agree that the trend away from TTW was far broader than that. I think that this trend away from TTW to Product development is clear in Zope 2 as a whole, component folks or not, for the last five years or so. The question I asked in the beginning again applies again. Why did this happen? Why'd we move away from the apparently paradisical state Paul describes?

At Infrae, we have a lot of experience with people, end users and scripters writing software/content through the web, using facilities in Zope, Silva, and Formulator. While this is indeed very empowering for the end-user or scripter, it can also lead to an enormous increase in maintenance burden.

This gets noticed by the software developers and system administrators who will have to do the occasional maintenance (such as upgrades). They will have to deal not only with Silva and Zope (or Plone and Zope), but also the software/content that grew around it. This increases the complexity enormously. We can't do away with this extra code that grew around it; it's necessary for deployment of the software in that particular production setting. But we'd sure as hell wish it'd been more maintainable.

Programmers have tools and patterns to manage the complexity of software development and deployment. Editors and IDEs, test driven development, version control systems, release numbering, deployment tools, the works. These things help manage the complexity if an experienced programmer is using them, but a non-programmer or scripter doing development will not use them, and the programmer does often end up having to maintain code created that way.

As a side-discussion but exemplifying the pain: these tools work with the filesystem, but typically don't work with code developed through the web, such as in the ZODB. In the Zope community we've seen lots of attempts to make through the web work together with filesystem tools, with varying success, but I think nobody denies that this is a major pain. A general trend was to develop this stuff as code on the filesystem, as that made life a lot easier. Through the web development helps a class of users, but it can also be seen as a hindrance to the adoption of Zope as a software development platform by software developers. This is also an important audience to us.

So, letting users, customers, develop software as content is an extremely powerful concept. It's one we should support. It's also very dangerous.

The challenge is in reconciling the two. How do we empower users to develop things on the borderland of software and content, while not creating maintainability nightmares? People administrating large systems, as well as professional software developers, as many of us are in the Zope community, need a solution that answers that question.

I don't want to have to deal with solutions anymore that don't, as they're not complete solutions to me. They tend to shift or even increase the complexity on the longer term, something my company may end up paying for, as how do you explain this to a customer? The website works, right? Why is maintaining it so costly?

Let's try to figure this out. It'll likely take some smaller, careful, steps. Ideas are welcome.

Five-based i18n in Silva checked in (PTS Delenda Est)

Last summer the Five project pulled Zope 3's i18n architecture into the Zope 2 world, thanks to work done by Philipp von Weitershausen, Lennart Regebro and others (please forgive me if I forget someone!).

And yesterday, Philipp posted a useful article on his site describing how to use it.

He probably didn't expect that Silva was going to switch over to this codebase the next day! I ran into yet another unicode error in Silva today when I was hacking on it. It was the last straw. PlacelessTranslationService's wild hackery had been screwing up Silva's unicode support up long enough... Like Carthago of old to the ancient Romans, PlacelessTranslationService Delenda Est: it must be destroyed.

Thanks to the great work done by the Zope 3 developers, the Five developers, and the nice article Philipp wrote about it, a few hours later PlacelessTranslationService had been purged from Silva. Lots of custom cruft has been removed from Silva, and removing code while retaining functionality is always good.

This is part of a general trend in the Zope world; people are throwing away their custom code and can start to rely on cleaner solutions developed lower down the framework. This allows for smooth evolution and convergence between frameworks. It increases the size of communities using the same codebase, which makes it more likely people will work on it to improve it. I believe that the Zope community, with Five and other efforts, is only beginning to tap into the full potential of this pattern.

Tramline source code now available

At the Plone conference 2005 I gave a lightning talk about tramline, a lightweight up and download accelator for web applications. Now at last I've found some time to put the source code online. This is not a proper release yet, but it's there for interested people to take a look at it.

What is tramline about? From the readme:

Tramline is a upload and download accelerator that plugs into Apache, using mod_python. Its aim is to make downloading and uploading large media to an application server easy and fast, without overloading the application server with large amounts of binary data.

Tramline integrates into Apache using mod_python. The application server is assumed to sit behind Apache, for instance hooked up using mod_proxy or mod_rewrite.

Tramline takes over uploading and downloading files, handling these within Apache. Only a small configuration change in Apache should be necessary to enable tramline.

The application server remains in complete control over security, page and form rendering, and everything else. Minimal changes are necessary to any application to enable it to work with tramline; in fact it's just setting two response headers in a few places in the code.

Tramline is generic code, but is particularly useful for Zope applications. Zope's object database, the ZODB, has one drawback: it doesn't scale very well when large binary files are put into it. In addition, many appservers have only limited resources available to handle large upload or download processes. Tramline works around both issues by letting Apache and the filesystem handle both.

Pay careful attention to the installation instructions. Tramline currently needs a one-line patch in mod_python (a Python file so that's easy, thankfully). It also needs the latest version of Apache 2.0 (2.0.55), that was released last month.

Here's where to find the source:

http://codespeak.net/svn/rr/tramline/trunk

If you want to give feedback or help out, please do! We have a mailing list, here:

http://codespeak.net/mailman/listinfo/tramline-dev

Update: credit where credit's due: this is of course a project I built for Infrae, like almost all of the code I write. Tramline was conceived by myself and Jan-Wijbrand Kolman.

Zope and scaling down

Ian Bicking posts about what he percieves is a focus of Zope 3 on modeling up-front:

Good development in the beginning means deferring choices as much as possible and focusing on results instead of abstractions. Abstractions should emerge from your functional goals, and if you spend a lot of time modeling in the beginning then you've made premature choices and designed code that you don't yet understand. You haven't just wasted time, you've introduced a liability.

I agree completely with this view of software development. This is how I try to develop software, learned through quite a bit of experience, just like Ian, I'm sure. And luckily enough, it's perfectly possible to do such a style of iterative development on top of Zope 3. I'm not sure what gave Ian the impression that you can't.

At Infrae we've been doing this for a few months now though, and the application we've worked on definitely evolved, sometimes quite drastically, in the face of customer feedback and a coming into focus of sometimes vague requirements. Since Zope 3 tries to stay out of the way of your Python code, you can refactor like you'd do with any piece of Python software. In fact, I talked about just this in an older post:

Jeff Shell already mentioned that Zope 3 makes it easier to build an extensible framework while actually building something useful for a customer; Zope 3 gives a lot of flexibility and extensibility right out of the box without much effort for the application developer. This I think is great news for the long term maintainability and extensibility of Zope 3 applications.

In addition, I can say that extraction of reusable code from Zope 3 projects into reusable libraries is much, much nicer than doing it in Zope 2. That doesn't mean it's actually easy; writing reusable code is always hard, but it's now much more doable. This is one of the coolest things about Zope 3.

[snip]

Framework extraction from practical applications is often the best way to build truly useful reusable components, so Zope 3's vastly improved extractability of reusable components is great news.

See also Jeff Shell's post.

Perhaps Ian gets the wrong idea because of Zope 3's focus on the concept of interfaces. While interfaces are common in the Zope 3 framework (which needs to be pluggable and flexible), Zope 3's component system allows you to register adapters and views for classes just fine. There's no need to start designing lots of interfaces right from the start. It's convenient to define (and evolve) your data layout of some content object in a schema, but I'm sure Ian as the creator of SQLObject wouldn't object too strenuously to this.

Or perhaps Ian gets the idea indeed from my comment on hello world. I used the word "small", but I meant really small; the context was "hello world". Zope 3 doesn't scale down well enough to tiny web applications of the "hello world" scale; the overhead of ZCML and such feels too big then. We need to do work there.

But as soon as there's enough code for your web app to be useful to do anything, the overhead of ZCML and interfaces quickly shrinks. I'm not saying it couldn't be improved further, but to say that Zope 3 does big-design up-front because "hello world" is slightly more difficult than it should be is a bit of a stretch. We're still programming in Python, after all!

Towards a common structure of Zope 3 extensions

Quite a few Zope 3 extensions are starting to appear. This is great. There is all the great work done within the Z3ECM svn repository. There's Infrae's hurry library of little Zope 3 odds and ends. Then there are various Zope 3 extensions written by Zope corporation, such as zc.catalog and zope.formlib. There's also various work done in the Zope 3 base svn repository.

Various patterns are emerging in the way these extensions are structured. I want to suggest a common pattern we all adhere to, and the reasons why. My aim is to suggest a common Pythonic structure, so that we don't do our homegrown Zope thing.

Warning

The word "package" in this text means what you check out from SVN. It's also what you can distribute to others in a tarball. It's what linux distributors use to create their distribution packages. It's also what you can use to create a Python egg (see more later).

The word "python package" in this text means what is importable in Python. It has the __init__.py. It's like a Python module, but bigger.

These packages are not necessarily identical, and in fact I'll argue they shouldn't be identical. When I say "package" I mean the former distribution package, when I say "python package" I'll mean the latter.

Why is a common package layout important?

Developers know where to look when they start using a new package. Distributors and packagers (such as linux distributors) know where to look and what to do. System administrators need to know only a single trick to install Python packages into Zope, not a different one for each package. Eventual metadistributions like what Zope 3 ECM may become will be easier to build.

Furthermore, distutils is now the standard for python packages. This involves a setup.py script the package root which can be used to build and install Python packages. It can also be used to create distributions of Python packages. Distutils presumes having a place where the setup.py can live.

Recently Phillip Eby has been doing a lot of great work with Python eggs, setuptools and easy_install. Briefly:

  • eggs make it easy to distribute Python packages to be installed. It handles dependencies.
  • setuptools makes it easy to create eggs. It also makes it to upload our package into cheeseshop.python where other Python developers can find it.
  • easy_install makes it trivial to install packages and the dependencies by typing a one liner.

We need to structure our Zope 3 packages so they're easy to use with eggs. I expect Zope 3 core will start using eggs pretty soon, so let's prepare our extensions.

Package namespaces

Some packages create their own Python package namespace (hurry, zc) by utilizing a namespace package with an empty __init__.py. Others expect to live within the zope namespace of Zope 3, probably in the hope that this package will one day be core. Some packages just sit in the top level, creating a new namespace all for themselves, other packages cohere under a common namespace.

Recommendations:

  • Being in core is not so important with Zope 3. It's a flexible system. We'll distribute collections of packages, probably using eggs to handle dependencies. Don't use zope as a top level package name unless you're really developing the package inside svn.zope.org/Zope3. zope makes it harder to install as a normal Python package as you need to hack the zope hierarchy and mess about with symlinks. python setup.py install becomes impossible. If your package enters core, you'll probably do more changes anyway that break compatibility than just changing the package namespace.
  • Use a top level namespace package. So, I didn't call my query package query, as I imagine there are other python modules called that way. Instead I used a top level namespace called hurry and put it there. There's probably nothing else that's imported as hurry.query in the Python world.
  • Try to cohere multiple related packages under a shared toplevel namespace package. I've tried to do this with hurry, which has hurry.file, hurry.workflow and hurry.query. This is also to prevent namespace pollution.

Structure of a package

Some packages conflate the concept of distribution package and Python package. Thus, the Python modules are just in the top level of the distribution package, which has an __init__.py.

This is not good if you want your package to be released to the world, or possibly even be picked up by a Linux distributor. When I download a release tarball of some interesting Python extension, I expect to be able to unpack it, and not find all the source right there. No, I expect a nice README.txt, a INSTALL.txt, a setup.py, and perhaps a testrunner and a doc directory. I don't want to be bothered with lots of files of the source code itself.

The source code, that which ends up being importable somehow, that which ends up on the PYTHONPATH somehow, is in a separate subdirectory. This is often called src, like with Zope 3. An alternative structure also frequently used and useful if your package will have everything in a single Python namespace package anyway is to make this Python namespace package the immediate subdirectory.

It's actually the layout of Zope 3 SVN. It's also the layout of, say, Twisted, and PEAK, and CherryPy, and many, many other Python packages.

By using such a structure, it's trivial to create a simple release: you just do an svn export, tar it up, and you're done. It also become easy to create eggs, and the like.

Recommendations:

  • split your source code away from your top level distrubtion package
  • put your Python packages either in a subdirectory called src, or put your single namespace package directly in a subdirectory with the name of your Python namespace package (twisted).
  • Put in a README.txt and a LICENSE.txt at the very least.
  • Strongly consider putting in a setup.py.
  • Let's all investigate eggs and make our own packages work with them.