Skip Menu |

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 36348
Status: resolved
Priority: 0/
Queue: Encode

People
Owner: Nobody in particular
Requestors: trr [...] thomasrutter.com
Cc: pali [...] cpan.org
AdminCc:

Bug Information
Severity: Normal
Broken in: (no value)
Fixed in: 2.89



Subject: Inconsistent error message returned on invalid UTF-8 character
This arose from an issue I had with the W3C validator, and I was referred to here as the error messages are generated by Encode. Please see two test cases here: http://arcticforest.com/tmp/test-EDA080.html http://arcticforest.com/tmp/test-C1AA.html For the first example, the validator sees the bytes \xED\xA0\x80 and complains "utf8 "\xD800" does not map to Unicode". For the second, the validator sees the bytes \xC1\xAA and complains "utf8 "\xC1" does not map to Unicode". The error messages are inconsistent. In the first, the error message complains about the hypothetical code point \xD800 which the bytes would otherwise map to, and in the second the error message complains about the actual byte in the data that wasn't valid. After discussion on the www-validator@w3.org list we decided that the first error message is at fault. Follow conversation here: http://lists.w3.org/Archives/Public/www-validator/2008May/0115.html The error message "utf8 "\xD800" does not map to Unicode" is output when the sequence of bytes \xED\xA0\x80 is encountered, making finding the source of error difficult as \xD800 doesn't appear in the document, except in the sense that those bytes would represent \xD800 if it were otherwise allowed in UTF-8. The error message should return the actual bytes encountered which aren't valid, rather than something like \xD800. Tested on: W3C Markup Validator 0.8.2, http://validator.w3.org/ I am sorry that I don't know which Perl and Encode version it's running.
On Mon Jun 02 01:04:11 2008, trr wrote: Show quoted text
> This arose from an issue I had with the W3C validator, and I was > referred to here as the error messages are generated by Encode. > > Please see two test cases here: > > http://arcticforest.com/tmp/test-EDA080.html > http://arcticforest.com/tmp/test-C1AA.html > > For the first example, the validator sees the bytes \xED\xA0\x80 and > complains > "utf8 "\xD800" does not map to Unicode". > For the second, the validator sees the bytes \xC1\xAA and complains > "utf8 "\xC1" does not map to Unicode". > > The error messages are inconsistent. In the first, the error message > complains about the hypothetical code point \xD800 which the bytes would > otherwise map to, and in the second the error message complains about > the actual byte in the data that wasn't valid. After discussion on the > www-validator@w3.org list we decided that the first error message is at > fault. > > Follow conversation here: > http://lists.w3.org/Archives/Public/www-validator/2008May/0115.html > > The error message "utf8 "\xD800" does not map to Unicode" is output when > the sequence of bytes \xED\xA0\x80 is encountered, making finding the > source of error difficult as \xD800 doesn't appear in the document, > except in the sense that those bytes would represent \xD800 if it were > otherwise allowed in UTF-8. > > The error message should return the actual bytes encountered which > aren't valid, rather than something like \xD800. > > Tested on: > W3C Markup Validator 0.8.2, > http://validator.w3.org/ > > I am sorry that I don't know which Perl and Encode version it's running. >
Fixed in Encode 2.89. Error message from UTF-8 decoder now contains escaped invalid bytes.
Also fixed.
On Sat Jul 25 09:19:33 2020, PALI wrote: Show quoted text
> Also fixed.