Bug #36348 for Encode: Inconsistent error message returned on invalid UTF-8 character

Mon Jun 02 01:04:11 2008 trr [...] thomasrutter.com - Ticket created

Subject:

Inconsistent error message returned on invalid UTF-8 character

This arose from an issue I had with the W3C validator, and I was referred to here as the error messages are generated by Encode. Please see two test cases here: http://arcticforest.com/tmp/test-EDA080.html http://arcticforest.com/tmp/test-C1AA.html For the first example, the validator sees the bytes \xED\xA0\x80 and complains "utf8 "\xD800" does not map to Unicode". For the second, the validator sees the bytes \xC1\xAA and complains "utf8 "\xC1" does not map to Unicode". The error messages are inconsistent. In the first, the error message complains about the hypothetical code point \xD800 which the bytes would otherwise map to, and in the second the error message complains about the actual byte in the data that wasn't valid. After discussion on the www-validator@w3.org list we decided that the first error message is at fault. Follow conversation here: http://lists.w3.org/Archives/Public/www-validator/2008May/0115.html The error message "utf8 "\xD800" does not map to Unicode" is output when the sequence of bytes \xED\xA0\x80 is encountered, making finding the source of error difficult as \xD800 doesn't appear in the document, except in the sense that those bytes would represent \xD800 if it were otherwise allowed in UTF-8. The error message should return the actual bytes encountered which aren't valid, rather than something like \xD800. Tested on: W3C Markup Validator 0.8.2, http://validator.w3.org/ I am sorry that I don't know which Perl and Encode version it's running.

Tue Jul 01 16:55:35 2008 DANKOGAI [...] cpan.org - Status changed from 'new' to 'open'

Fri Apr 21 06:50:48 2017 pali [...] cpan.org - Cc PALI added

Fri Apr 21 06:51:38 2017 pali [...] cpan.org - Fixed in 2.89 added

Fri Apr 21 06:53:55 2017 pali [...] cpan.org - Correspondence added

On Mon Jun 02 01:04:11 2008, trr wrote: Show quoted text

> This arose from an issue I had with the W3C validator, and I was > referred to here as the error messages are generated by Encode. > > Please see two test cases here: > > http://arcticforest.com/tmp/test-EDA080.html > http://arcticforest.com/tmp/test-C1AA.html > > For the first example, the validator sees the bytes \xED\xA0\x80 and > complains > "utf8 "\xD800" does not map to Unicode". > For the second, the validator sees the bytes \xC1\xAA and complains > "utf8 "\xC1" does not map to Unicode". > > The error messages are inconsistent. In the first, the error message > complains about the hypothetical code point \xD800 which the bytes would > otherwise map to, and in the second the error message complains about > the actual byte in the data that wasn't valid. After discussion on the > www-validator@w3.org list we decided that the first error message is at > fault. > > Follow conversation here: > http://lists.w3.org/Archives/Public/www-validator/2008May/0115.html > > The error message "utf8 "\xD800" does not map to Unicode" is output when > the sequence of bytes \xED\xA0\x80 is encountered, making finding the > source of error difficult as \xD800 doesn't appear in the document, > except in the sense that those bytes would represent \xD800 if it were > otherwise allowed in UTF-8. > > The error message should return the actual bytes encountered which > aren't valid, rather than something like \xD800. > > Tested on: > W3C Markup Validator 0.8.2, > http://validator.w3.org/ > > I am sorry that I don't know which Perl and Encode version it's running. >

Fixed in Encode 2.89. Error message from UTF-8 decoder now contains escaped invalid bytes.

Sat Jul 25 09:19:33 2020 pali [...] cpan.org - Correspondence added

Also fixed.

Sat Jul 25 09:20:05 2020 DANKOGAI [...] cpan.org - Correspondence added

On Sat Jul 25 09:19:33 2020, PALI wrote: Show quoted text

> Also fixed.

Sat Jul 25 09:20:06 2020 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'