Anne van Kesteren

XML5

One of my side projects is XML5. Earlier this year I suggested the idea as XML 2.0, but in line with recent “jokes” about HTTP5, SVG5, and CSS5, XML5 makes perfect sense. The idea of XML5 is to provide a revision of XML 1.0, XML 1.1, Namespaces in XML 1.0, Namespaces in XML 1.1, and RFC 3023, that is backwards compatible and introduces HTML-like, although much more sane, error recovery. My implementation handles most features now apart from attribute value normalization (attribute value defaulting and such works) and character encoding sniffing. I’m hosting an XML playground online where you can play around with the implementation. (The xml5 Google code project that has the source code and a specification that doesn’t yet do everything the implementation does.)

Because some people thought this was the case last time, I’ll be clear, there’s no guessing involved. The idea is to provide an unambigious mapping from any byte stream to an XML tree representation. Working on XML5 you slowly start to realize how crazy XML really is:

<!DOCTYPE y [
 <!ENTITY % b '&#37;c;'>
 <!ENTITY % c '&#60;!ENTITY a "x" >'>
 %b;
]>
<y>&a;</y>

Entities are in fact a fricking nightmare:

<!DOCTYPE y [<!ENTITY % a '&#37;b;'><!ENTITY % b '&#37;a;'>%a;]><y/>

My solution to the above problem is having a hard limit on the total amount of references an entity can make. It’s sixteen. This deals with recursion, the million laughs attack, and seems like decent recovery behavior for such an error. Although I don’t think DOCTYPEs should be conforming at all, personally.

Note that XML5 is just an idea, don’t take offense just yet.

Comments

  1. Lack of good well-defined error handling has given HTML many problems. But good error handling doesn't mean giving up, throwing everything away, and doing nothing. Why would anybody think that?

    Although it might just be ignored, it would probably be good to require that implementations notify users that errors occurred. If that was actually implemented, it would give web developers an incentive to produce well-formed XML.

    This is a big change, so while you're at it, why not get rid of doctype declarations? They add much complexity to parsing and are usually only used for versioning anyway (when they're not ignored). RelaxNG and even XSD are much better solutions for what DTDs are intended to do.

    Posted by James Justin Harrell at

  2. What Mr. Harrell said. Just drop the <!DOCTYPE>. Immense numbers of problems vanish. Mind you, the MathML crowd will come over to your house and strangle your kittens. But really, in the whole world they're the only ones who care; if I were XML dictator, I'd drop it in a microsecond.

    Posted by Tim Bray at

  3. As the spec is just a listing of all the states in the FSM, why don't you generate the spec from the code? Or create an XML file from which you generate both the code and the spec.

    Posted by Sjoerd Visscher at

  4. Drop <!DOCTYPE>. It's just archaic, almost never used in real-world applications and adds an unthinkably large amount of complexity to a format that at the face of it, is very straight-forward and easy to implement.

    Posted by Asbjørn Ulsberg at

  5. Sjoerd, the specification was largely generated from the code. But the remaining parts are can’t really be done that way, I think.

    As for DOCTYPE, don’t we need it for backwards compatibility? I was planning to please the MathML crowd (and others) by the way by introducing a large number of predefined named entities. Straight from HTML and MathML. That doesn’t give too much overhead and seems to serve a useful goal.

    Posted by Anne van Kesteren at

  6. Unfortunately DOCTYPE support is probably necessary for back-compat, but one could certainly make it non-conforming and perhaps an optional feature for processors to support (although that may be a bad idea because options are intrinsically a compatibility headache), since in some contexts e.g. browsers, the amount of existing content that relies on the doctype for more than just external entities may be negligible.

    Posted by jgraham at

  7. Can someone tell me why ?
    Why adding error recovery to XML ? Is it because you cannot stop implementors to do it, and so it needs do be defined ?
    What if anyone implements other (better?) error recoveries than those defined in the spec ?

    Posted by David at

  8. Implementors of feed readers are doing error recovery. Mobile vendors are blatently ignoring XML rules despite the fact everyone claiming that XHTML is such a success on mobile and that mobile phones are driving the market now. Well, maybe that’s mostly the W3C. Opera does do XML correctly by the way, also on mobile, although we’re forced now and then by the market to break the rules set forth by the application/xhtml+xml media type (parse it using an HTML parser instead). Having a set of rules these people can follow seems better than having none.

    If implementors want different error recovery than what I described that’s fine and the specification can be amended. The idea is of course that eventually we end up with a stable standard, but until we get there changes can be made.

    Posted by Anne van Kesteren at

  9. The most common bug report I get on my feed parser is XML errors. It gives me a huge incentive to fix the most common "bug" — XML parse errors. Do you want yet another set of error handling rules, like everything else totally undefined, or would you rather there be some spec that can be followed so the error handling is interoperable?

    HTML UAs are forced to reverse engineer one-another to remain compatible with the web, as otherwise pages become reliant on a certain set of behaviours. If this is specified, there is no need to reverse engineer endlessly.

    Posted by Geoffrey Sneddon at

  10. I'm curious what you think of the XML in this article. Both Opera and Firefox consider it well-formed, but IE doesn't. Your XML playground can't handle it either (at least the end result isn't what you get in Opera and Firefox). So is it well-formed, or isn't it?

    Posted by James Holderness at

  11. I think it is, although I’m not a 100% sure. The problem you’re running into with XML5 and I think Internet Explorer too is the "nested" entity limit. For XML5 that’s sixteen at the moment, although I can easily make that limit a bit higher. (The limit is needed to prevent the million laughs attack and recursion.)

    Posted by Anne van Kesteren at

  12. I don't think that's the problem. There's actually very little nested expansion going on. Here's a simpler example:

    <?xml version="1.0" encoding="utf-8"?>
    <!DOCTYPE y [
    <!ENTITY % a '<!ENTITY c "&#37;b;">'>
    <!ENTITY % b 'Hello'>
    %a;
    ]>
    <y>&c;</y>
    

    The problem (as I understand it) is that I'm declaring an entity (c in the example above) that contains a parameter-entity reference (b). This is considered (by IE) to be a violation of the well-formedness constraint, PEs in Internal Subset.

    I'm assuming Firefox and Opera don't notice the violation (or don't consider it a violation) because the entity declaration isn't strictly part of the DTD - it's only declared as a result of another PE reference being expanded (a in the example above).

    PS: IE is vulnerable to the billion laughs attack.

    Posted by James Holderness at

  13. Yes, AFAICT, it violates that well-formedness constraint you're citing, and therefore IE is correct and the others have a bug.

    Posted by zcorpan at

  14. James, ah, yes you’re right. That also explains the result in XML5 better. It simply doesn’t care about parameter references inside markup declarations. Thanks!

    Posted by Anne van Kesteren at

  15. OK, can someone tell me (in non too technical terms) what the "million laughs attack" is? Google only brings up a page of various people asking about it on XML-related discussions.

    Posted by Chris Hester at

  16. It’s actually called the “billion laughs attack” (text/plain so it can be viewed safely), because a million laughs alone doesn’t usually cause serious problems.

    It’s simply an XML document with a DTD subset which defines a series of entities, each expanding to two of the previous entities. The result is that the fully expanded length of these entities grows exponentially. Where &laugh0; is 2 characters long, &laugh10; is 2,048, &laugh20; is 2,097,152, and by &laugh30; you’re looking at 2,147,483,648 characters. A naïve parsing of such an XML document into an in-memory tree is likely to fill up all available memory and then some.

    If it doesn’t, you just need to add a few more lines. Add another hundred lines, and the expansion would be longer than the universe has particles.

    So XML parsers need to limit their entity processing in some fashion in order to avoid denial of service.

    Posted by Aristotle Pagaltzis at

  17. Mind you, the MathML crowd will come over to your house and strangle your kittens. But really, in the whole world they're the only ones who care;

    Actually large parts of the MathML crowd would probably thank you... The main difficulty with using entities in mathml is the DOCTYPE syntax which means you either have to pass fragments around as not well formed, or you put on the doctype and have to keep removing it again. Just losing entities altogether or building in a fixed set would simplify things greatly...

    I think you'd find a lot of non mathml users missing entities if they went, the most common entity I see being used on xml fora is nbsp which is hardly ever used in mathematics.

    Questions about nbsp swamp (or used to swamp, before we got FAQs set up) XSLT lists, as in why they couldn't use the syntax, or why after enabling the entity (or using #160 directly) they get weird accented A before thes space

    Posted by David Carlisle at

  18. XML is hard and surprising enough. This proposal makes it harder, not easier. It only succeeds in sweeping errors under the rugs. The problem with non-draconian error handling--specifically with error correction--is that the errors will be corrected in a way neither authors nor consumers expect. Instead of failing as soon as possible, processes will now fail later and further away from the actual problem. This is a step backwards.

    In HTML the worst that can happen is that the page looks funky. But for the uses to which XML is put? Almost anything could happen.

    Posted by Elliotte Harold at