Subject: | Inconsistent error message returned on invalid UTF-8 character |
This arose from an issue I had with the W3C validator, and I was
referred to here as the error messages are generated by Encode.
Please see two test cases here:
http://arcticforest.com/tmp/test-EDA080.html
http://arcticforest.com/tmp/test-C1AA.html
For the first example, the validator sees the bytes \xED\xA0\x80 and
complains
"utf8 "\xD800" does not map to Unicode".
For the second, the validator sees the bytes \xC1\xAA and complains
"utf8 "\xC1" does not map to Unicode".
The error messages are inconsistent. In the first, the error message
complains about the hypothetical code point \xD800 which the bytes would
otherwise map to, and in the second the error message complains about
the actual byte in the data that wasn't valid. After discussion on the
www-validator@w3.org list we decided that the first error message is at
fault.
Follow conversation here:
http://lists.w3.org/Archives/Public/www-validator/2008May/0115.html
The error message "utf8 "\xD800" does not map to Unicode" is output when
the sequence of bytes \xED\xA0\x80 is encountered, making finding the
source of error difficult as \xD800 doesn't appear in the document,
except in the sense that those bytes would represent \xD800 if it were
otherwise allowed in UTF-8.
The error message should return the actual bytes encountered which
aren't valid, rather than something like \xD800.
Tested on:
W3C Markup Validator 0.8.2,
http://validator.w3.org/
I am sorry that I don't know which Perl and Encode version it's running.