Bug #58167 for libwww-perl: HTTP::Message decoded_content() does not decode ISO-8859-1

Sat Jun 05 16:10:09 2010 PENMA [...] cpan.org - Ticket created

Subject:

HTTP::Message decoded_content() does not decode ISO-8859-1

The documentation for HTTP::Message states that decoded_content() decodes the content and returns a "perl Unicode string", unless the charset is explicitly defined to "none". It does however also not convert any "iso-8859-1" encoded content. This happens both with autodetected and manually specified encoding. This means that iso-8859-1 documents will never be decoded and the function does not return a character string. This is wrong. Code randomly breaks, because it assumes it receives a character string, but it doesn't. Things like uc($content) do not return the correct result. The iso-8859-1 encoding should be removed from the hardcoded list of encoding values for which no decoding will be done, so that this routine returns a character string for these documents too.

Sat Jun 05 16:23:18 2010 DERF [...] cpan.org - Cc DERF added

Thu Sep 23 10:06:01 2010 felix.ostmann [...] thewar.de - Correspondence added

From:

felix.ostmann [...] thewar.de

Very important to fix that hardcoded list! Every document with ISO-8859-1 not working here!

Thu Sep 23 10:06:02 2010 The RT System itself - Status changed from 'new' to 'open'

Thu Sep 23 15:36:13 2010 GAAS [...] cpan.org - Correspondence added

What's not working for your ISO-8859-1 strings? It's not the internal utf8 flag that define "Unicode string". My position is that I should not promote the "Unicode Bug" (see perlunicode(1)). It's a bug if the semantics of a ISO-8859-1 string depends on utf8::upgrade(). The real reason is that not upgrading ISO-8859-1 made the code much simpler when trying to support perl-5.6. Older versions of LWP did upgrade these strings. I also like more efficient code.

Thu Sep 23 16:20:34 2010 PENMA [...] cpan.org - Correspondence added

On Thu Sep 23 15:36:13 2010, GAAS wrote: Show quoted text

> What's not working for your ISO-8859-1 strings?

uc() for example, as I said. Show quoted text

> > It's not the internal utf8 flag that define "Unicode string". My > position is that I should not > promote the "Unicode Bug" (see perlunicode(1)). It's a bug if the > semantics of a ISO-8859-1 > string depends on utf8::upgrade().

I do not know what you are trying to tell me. The documentation states that the function in question "Returns the content with [...] the raw content encoded to perl's Unicode strings.". It obviously fails to do that for iso-8859-1. If the encoding of the document is specified as iso-8859-1, it does never return a unicode/character string. Making things like uc() not work. It also makes it impossible to match \xa0 as a whitespace character. Which is all caused by the returned string not being a character string. Show quoted text

> > The real reason is that not upgrading ISO-8859-1 made the code much > simpler when trying to > support perl-5.6. Older versions of LWP did upgrade these strings.

Luckily 5.6 is even more obsolete as 5.8 is going to hopefully be in near future. Show quoted text

> I also like more efficient code.

I like code that works. What use is fast code that does not work?

Thu Sep 23 16:41:13 2010 GAAS [...] cpan.org - Correspondence added

On Thu Sep 23 16:20:34 2010, PENMA wrote: Show quoted text

> On Thu Sep 23 15:36:13 2010, GAAS wrote:

> > What's not working for your ISO-8859-1 strings?

> > uc() for example, as I said.

Then fix uc(). One way is to add a 'use 5.12.0;' declaration to your code to request sane uc() behaviour.

Thu Sep 23 18:18:00 2010 PENMA [...] cpan.org - Correspondence added

On Thu Sep 23 16:41:13 2010, GAAS wrote: Show quoted text

> Then fix uc(). One way is to add a 'use 5.12.0;' declaration to your > code to request sane uc() > behaviour.

You're probably right. But this does not change the fact that the documentation clearly promises a _Unicode_ string but doesn't return it sometimes. The unicode_strings feature does not make /^\s$/ match "\xa0" (because why should it? It's a 0xa0 octet) while it matches an UTF-8-flagged "\xc2\xa0". Which is probably correct. This was the original problem to start with, I just replaced it with uc() because it showed the same misbehaviour in 5.10. If this isn't going to be unfucked in perl before 5.24, it would probably be nice to have the behaviour of the function match its documentation. Either by documenting this special case in the POD, or by adding a parameter that does convert iso-8859-1 to a Unicode string, too. (I think this unicode_strings feature is the most insane thing I've seen in a new perl version. It also reminds me of 5.6 Unicode support - incompletely done and completely broken. Wait, it's documented that it is broken so it's a feature...)

Fri Sep 24 16:32:03 2010 GAAS [...] cpan.org - Correspondence added

I've applied <http://github.com/gisle/libwww- perl/commit/ced6cbce9df912a98436a800870a86f2503a2d7e>. Don't want to argue this much more. Unicode in Perl is (still) a mess,... and Unicode string is not the same as an utf8::upgraded string.

Fri Sep 24 16:32:03 2010 GAAS [...] cpan.org - Status changed from 'open' to 'resolved'