[MTOS-dev] XML parsing

Timothy Appnel tim at appnel.com
Thu Mar 6 08:00:12 PST 2008


So as I mentioned I have a lot to say since I've been dealing with the
issue of XML processing and MT Users longer than anyone.

Just a few points of clarification here:

* As already mentioned yesterday, having expat is not enough.
XML::Parser requires the source for expat in order to bind with while
the Perl module is being compiled. This is adds to the complications
of getting that module installed though these hurdles pails in
comparison to installing LibXML from source. I know package managers
can help here, but in my experience installing this module is A BEAST.

* Expat is used by and required by LibXML. LibXML is *a lot* more then
a  XML parser. It includes the whole suite of XML specifications such
as DOM, XPath, XSLT and more.

* XML::XPath requires XML::Parser and has some serious XML namespace
flaws that I uncovered recently. (Namespaces are commonly used in
syndication feeds.) The author of XML::XPath has informed me it is not
being maintained (it hasn't been updated for years now) and its
probably better left that way. Every XPath call causes the XML
document to be parsed which is terribly inefficient. I was working
with XML::XPathEngine which is pretty much a port of XML::XPath
including some of the bugs that I traced back to XML::XPath.
XML::XPathEngine module is being maintained and the author was working
on addressing these issues. (I have to check back in with him.) Mark's
Action Streams plugin by virtue of the Web::Scraper package uses this
same module.

* Similarly XML::DOM has been abandoned and is a level or two behind
as I understand it.

* XML::Atom is not 1.0 compliant to my knowledge -- both the
syndication and protocol parts. The syndication issue may have changed
recently. You have to use either LibXML or XML::XPath. So one option
is a bear to install and another is highly inefficient and abandoned
and its not 1.0 compliant (when I last check) for good measure. This
is why I had to finally god ahead and write XML::Atom::Syndication.
For 1.0 compliance and to gain the flexibility/portability that SAX
provides.

* I've talked to a number of developers not using SAX, but LibXML
exclusively. They all have religiously cling to performance as being
more important than portability/accessibility. This drives me bat
shit. While I have no doubt LibXML is more performant, but unless you
are doing a massive amount of XML parsing (most of us are not) I don't
see it making that big a difference. To me, these portability issues
are near show stoppers.

* I have no specific problem with LibXML -- if you need and get it
working great! You go! I just ask with whomever is working in this
space to write portable and adaptable XML processing. These means SAX
and LibXML does have a SAX driver.

* The problem with using SAX is that it doesn't put things in to a DOM
or setup an XPath engine for you. Herein lies the big difference
between Expat and LibXML. Expat is a parser (that LibXML uses) and
LibXML is a full suite. Also LibXML is entirely written in C and
alternative functionality is written in Perl -- some of that as we
discussed has been abandoned.

* This all wouldn't be an issue (over time) if LibXML was part of the
base Perl distribution. The leaders of the Perl community have chosen
to to take an official stance on XML.

Now some points from Mark's message:

>  While we don't parse XML in any of the core functions, we do in:
>
>  * the Atompub server(s)
>  * profile exchange for OpenID commenters

There is also the XML-RPC services and Feeds.App Lite.

>  ship XML::XPath in extlib to have known minimal XML support.

I'm not sure "known" is necessarily true. I seem to recall XML::XPath
will check out, but if XML::Parser and expat are not installed
properly it will blow up the first time it's used.

>  obvious what the problem is; mt-wizard reports that XML::Atom is
>  missing, which isn't precisely the case.

Actually I thought it reports XML::Atom is present though it won't work.

>  we would want to build a sort of "XML::Any" module that encapsulates
>  the selection of an XML library and provides a common API. That's a
>  non-trivial undertaking, so other solutions are welcome.

This is what SAX does already when it comes to parsing. The problem is
after the parse you are left to your own devices. Creating a
XML::XPath::Any module would allow for the full range of needs to be
covered from pure Perl portability to high performance. That still
doesn't address the form of the return of the parse tree -- perl data
structure, objects, DOM?

>  This isn't a pressing issue and isn't a performance focus, but I see
>  us doing much more in the future with Atompub and XML APIs (I
>  certainly am in my hack-day time), so we should plan for it. Thanks in
>  advance for your input!

I agree and that is why I think some type of plan has to be developed.
MT wouldn't be a powerful if it where tied to a specific web server or
database system -- I don't see XML parsers being much different.

<tim/>

-- 
Timothy Appnel
Appnel Solutions
http://appnel.com/


More information about the MTOS-dev mailing list