Category Archives: XML

Python 2/3 Compatible Source and PyXB

I started work on PyXB just over five years ago. At the time, Python 3.0 had just come out, but was far too new to hassle with, so I made Python 2.4 the minimum required version.

In September 2011 people started to hint they’d like Python 3 support, but it looked like it’d be an awful lot of work, and nobody asked officially, so I just kept it in the back of my mind. In June 2012 the noise was getting harder to ignore, so I logged the request but didn’t take it further.

Over the next year or so PyXB’s unicode support got stronger, and I started understanding exactly how much easier it’d be to do XML with a proper distinction between text (i.e., unicode) and data (i.e, octet sequences). Python 2 did this poorly, but the difference is deeply embedded in Python 3. In September 2013 I finally created a branch for Python 3 off the 1.2.3 release. This involved running 2to3 over the source then running a second script to fix the resulting errors. This was good enough to make available for folks who could build from the repository, but couldn’t support packaging a version because converting the source was too complex to run on an end-user’s machine.

While investigating an installation problem that ultimately turned out to be a bug in pip I discovered six. Six is a single module, released under the MIT license, that can be integrated into a Python package to allow the same source code to work under both Python 2 and Python 3. No more running 2to3. No more fixing up the mess 2to3 makes when it changes pyxb.utils.unicode to pyxb.utils.str.

As of today, the next branch of PyXB passes all tests using Python version 2.6 up through 3.4.0rc1 without source-code changes. Well, ok, some unit tests fail because whitespace in formatted XML changed in 2.7; the unittest.TestCase.assertRaisescontext manager feature isn’t handled in 2.6, 3.0, or 3.1; and I haven’t tested 3.0.1 because hacking its configure script so it can build a functional hashlib module on Ubuntu 12.04 isn’t worth the effort. Nonetheless, PyXB itself works fine.

There’s more work to be done. A packaged PyXB includes generated bindings for about 186 namespaces. When building from the repository those can be generated with the same Python that’ll be running them, so they might include Unicode literals which aren’t going to work across the gap where Python 3 didn’t support the unicode prefix (u’text’) until version 3.3. But the big hurdle has been overcome, and the next PyXB release should support all Python versions from 2.6 onward.

Editing XML schemas on Emacs with nXML

While looking into DocBook recently, I discovered that GNU emacs finally has a high-quality XML editing mode that includes validation of the documents. nXML mode is integrated into emacs23, and it comes with RELAX NG grammars to support docbook editing, though only for DocBook 4.2.

For work on PyXB, though, I really need something that handles XML Schema Definition (XSD) documents. emacs nXML doesn’t come with XSD support, but the RELAX NG homepage points to Jeni Tennison’s schema as a candidate.

This is a great start, but when I tried using it with xmllint from libxml2 to validate some schemas supported by PyXB it said they were invalid. There are a variety of subtle issues the original version didn’t quite get right (and a few cases where my example schema were wrong). I’ve updated the schema to fix those issues, and made it available on github.

emacs nXML comes with an XSLT RELAX NG schema, but only for version 1.0. As XSLT 3 is nearly complete at the time I’m writing this, I was hoping to find support to validate against other XSLT versions as well. Turns out Norman Walsh has provided a unified solution for XSLT 1.0, 2.0, and 3.0 on github.

So: To support XSD and XSLT editing with nXML in Emacs 23, I put this in my .emacs file:

;; nXML mode customization
(add-to-list 'auto-mode-alist '("\\.xsd\\'" . xml-mode))
(add-to-list 'auto-mode-alist '("\\.xslt\\'" . xml-mode))
(add-hook 'nxml-mode-hook
	  '(lambda ()
	     (make-local-variable 'indent-tabs-mode)
	     (setq indent-tabs-mode nil)
	     (add-to-list 'rng-schema-locating-files
			  "~/.emacs.d/nxml-schemas/schemas.xml")))

I copied the original schema from the rng4xsd and xslt-relax-ng repositories, and used Trang to convert from the standard RELAX NG XML syntax to the compact syntax used by nXML. Then the following goes into ~/.emacs.d/nxml-schemas/schemas.xml:

<locatingRules xmlns="http://thaiopensource.com/ns/locating-rules/1.0">
  <!-- Extend to support W3C XML Schema Definition Language, which as
       of 1.1 are known as "XSD" rather than "XML Schema" to avoid
       confusion with other XML schema languages such as RelaxNG. -->
  <uri pattern="*.xsd" typeId="XSD"/>
  <namespace ns="http://www.w3.org/2001/XMLSchema" typeId="XSD"/>
  <documentElement localName="schema" typeId="XSD"/>
  <typeId id="XSD 1.0" uri="xsd10.rnc"/>
  <typeId id="XSD" typeId="XSD 1.0"/>

  <!-- Extend to support all three XSLT variants.  These are all in
       the same namespace, but are distinguished by a version
       attribute in the document element.  If unqualified, a catch-all
       version is used. -->
  <uri pattern="*.xsl" typeId="XSLT"/>
  <uri pattern="*.xslt" typeId="XSLT"/>
  <namespace ns="http://www.w3.org/1999/XSL/Transform" typeId="XSLT"/>
  <typeId id="XSLT 1.0" uri="xslt10.rnc"/>
  <typeId id="XSLT 2.0" uri="xslt20.rnc"/>
  <typeId id="XSLT 3.0" uri="xslt30.rnc"/>
  <typeId id="XSLT" uri="xslt.rnc"/>
</locatingRules>

Now when I write my XSD schemas in emacs, the mode line tells me when they’re invalid, and I can use C-c C-n to jump to the error, with the explanation placed in the message line. Very nice.