Anne van Kesteren

HTML Comments

Comments. As promised. Here is the way HTML comments are handled in a text/html environment including error handling. (At the moment in standards compliant mode and in almost standards compliant mode and not in quirks mode.) Unless HTML 5 will define it differently of course. Unlikely given current implementations, but you never know. Weblogs are cat pictures, specifications are utopia and the real world is in between.

A comment start is <!. A comment end is >. No variations. In between you have dashes. Ugh. Dashes are --. Inside a pair of matching dashes a comment end must be treated as a literal and therefore Acid 2 works as it does. Fun, not? Dashes are not strictly necessary on the parser level, they may well be a requirement of the specification though; <! test > is treated as a comment but is not necessarily correct. Per current specifications dashes have to be directly adjacent to a comment start; no whitespace in between. Browsers support whitespace in between. Hell, they even support text as shown above. Again, parsing requirements and document requirements are likely to differ. Anyway, what happens with <!-- -- --> <!-- -->? When the parser reaches the end there is a problem. Dashes are missing. The last > is treated as part of the comment. The browser needs to reparse this. And now it gets interesting. The first < is to be treated as a literal, which makes it essentially, and things that follow it, a text node. That leaves us with a text node <!-- -- --> and a comment containing a space.

That is about it. I hope you find it as confusing as I do. I do get it though. And it sort of makes sense. Firefox handles comments this way in standards compliant mode. Opera is going to handle comments this way. Not right now, but the plan is there. I do not have a clue about Safari. Last time I heard they implemented something funny that makes them pass Acid 2, but does not do the literal < the second time they parse the comment.

Another, more clear, explanation on HTML comment handling.

Comments

  1. Anyway, what happens with <!-- -- --> <!-- -->? When the parser reaches the end there is a problem. Dashes are missing.

    I don't get it. Where (and why) are dashes missing?

    And now it gets interesting. The first < is to be treated as well. That leaves us with a text node <!-- -- --> and a comment containing a space.

    If the first < is treated, wouldn't that start a comment?

    I'm so confused now...

    Posted by Mark Wubben at

  2. A hint. A pair of dashes is -- followed by some characters, followed by -- and nothing else. Dashes are still -- as defined.

    Posted by Anne at

  3. Actually, <! is a markup declaration open (MDO) and > is a markup declaration close (MDC). The comment start and comment end delimiters are actually the dashes.

    The browser needs to reparse this. And now it gets interesting. The first < is to be treated as a literal, which makes it essentially, and things that follow it, a text node. That leaves us with a text node <!-- -- --> and a comment containing a space.

    I have no idea what you're trying to say there, but it doesn't sound right. The first "<" is part of the MDO for the comment, I don't know what you mean by treated as a literal. The first comment contains a space. The second comment contains "> <!". The third comment starts with the last ">" but, as you mentioned earlier, the dashes are missing and, thus, there is no MDC for that erroneous comment declaration.

    Anyway, it is generally advisable that, even for text/html documents, authors stick to the XML comment syntax, which starts with "<!--" and ends with "-->" and does not contain "--" anywhere in between.

    Posted by Lachlan Hunt at

  4. One more thing, the WDG's explanation of HTML comments may be useful for some people.

    Posted by Lachlan Hunt at

  5. Oh, so we're gonna need to encode the right tag closing character in our code examples? :-P

    Posted by Remi at

  6. Lachlan, removed the word SGML from the post. I have defined what I meant so it does not really matter. What it means when a character is to be treated as a literal is that the character is no longer the start of a comment. It has become character data. You can try it in Firefox. Such a thing is called error handling.

    Posted by Anne at

  7. A hint. A pair of dashes is -- followed by some characters, followed by -- and nothing else. Dashes are still -- as defined.

    But wouldn't it be [comment start] -- [commend end][comment start] [comment end] then? Or are there now four dashes, and are four dashes needed to close the comment?

    Posted by Mark Wubben at

  8. Yes, I can see the strange logic but its not something I would try myself under normal circumstances.

    Posted by Robert Wellock at

  9. Ok, now that I've seen the results in Firefox, I understand what your trying to say. I think error handling like this is absolutely insane, but I assume there must be some backwards compatibility reason why comments couldn't be implemented completely correctly.

    At the moment only in standards compliant mode, not in almost standards compliant mode...

    As I understand it, almost standards mode uses the "standards compliant" comment parsing and only the inline box model differs between the two modes.

    Posted by Lachlan Hunt at

  10. One more thing, I noticed in my own testing that Firefox handles comments differently depending on whether there is a space between the MDO and the comment open delimiter. for example:

    
    <! -- a -- --> <!-- -- b -->
    <!-- a -- --> <!-- -- b -->
    

    The first, which is invalid, will output <!-- -- b --> while the second, which is valid, correctly hides both.

    Posted by Lachlan Hunt at

  11. Define completely correctly. What should the resulting DOM look like? Interesting about almost standards compliant mode. I will fix the post.

    The other thing is also very interesting. And confusing.

    Posted by Anne at

  12. @Lachlan: That's reasonable error handling on Firefox's part. According to section 3.2.4 of HTML 4.01, "White space is not permitted between the markup declaration open delimiter("<!") and the comment open delimiter ("--")...." Thus, the SGML rules are ignored in the first HTML comment. The second HTML comment has unmatched comment delimiters. This forces FF to reparse the comment, treating the opening angle bracket ("<") as a literal.

    Posted by Tim Altman at

  13. Anne, I expected it to behave as if the comment continued until it found either "--" followed by ">" (assuming there's still an even number of comment delimiters ("--")) or the end of the document (which ever comes first) and treating everything up to that point as being within the comment.

    Currently, it behaves correctly if it does eventually find the end of the comment (-->), but upon reaching the end of the document and not finding that, going back and reparsing is insane.

    Posted by Lachlan Hunt at

  14. Why is it not outputting the whole string?

    <! -- a -- --> <!-- -- b -->

    This is not a valid comment according to HTML 4.01:

    <! -- a -- -->

    The second HTML comment has unmatched comment delimiters. This forces FF to reparse the comment, treating the opening angle bracket ("<") as a literal.

    With the second HTML comment you mean

    <!-- -- b -->

    So it's quiet clear to me why it's outputting this part of the string, but why not the first one too?

    Posted by Johannes Lichtenberger at

  15. Oh yes, some comments are weird science in Gecko.

    Posted by Mathias at

  16. Why is it not outputting the whole string?

    <! -- a -- --> <!-- -- b -->

    Since <! -- is not a valid comment start, we cannot assume that we're in the middle of a valid comment (and count -- as opening and closing the comment, etc). Instead, the HTML parser just looks for the nearest > and calls that the end of the "comment". Thus, we then look at the 2nd comment, find that it's unterminated, and so that we don't eat up the rest of the document, we just consume it as text (it's malformed anyway, and this behavior imitates what IE does when faced with something like <!-- foo bar baz).

    Oh yes, some comments are weird science in Gecko.

    The case there is a problem where our copy/paste code loses the fact that the comment is supposed to be parsed in strict (standards) mode, so bits and pieces of the comment end up showing in the pasted output, it isn't a problem with the actual parsing, per se.

    Posted by Blake Kaplan at

  17. The dashes game only works when there's nothing between the first pair of dashes and the comment start.

    <! --> foo <!--> bar -->

    Posted by zcorpan at

  18. ...that applies to the doctype declaration aswell. According to SGML rules the following is allowed:

    <!DOCTYPE FOO --> this is a comment -- SYSTEM "foo.dtd">

    Posted by zcorpan at