Bug #61671 for Encode: decode_utf8 is not equivalent to decode("utf8", ...)

Sun Sep 26 12:21:00 2010 info [...] binarus.de - Ticket created

Subject:

decode_utf8 is not equivalent to decode("utf8", ...)

Hi, this is on debian lenny, all packages and kernel stock distribution (dom0 of a XEN environment, but this should not play any role here). uname -a gives: Linux fenrir 2.6.26-2-xen-686 #1 SMP Fri Sep 17 00:54:08 UTC 2010 i686 GNU/Linux perl -v gives: This is perl, v5.10.0 built for i486-linux-gnu-thread-multi From within the CPAN shell, we have installed the newest (as of the time of this writing) versions of the following modules: CGI, CPAN, CPAN::Test::Dummy::Perl5::Build, CPAN::Test::Dummy::Perl5::Make, CPAN::Test::Dummy::Perl5::Make::Zip, Encode, FCGI, Module::Signature, Perl, Test::Simple, YAML The version of Encode is 2.40. As of the time of this writing, the documentation for the Encode module states that $string = decode_utf8($octets [, CHECK]); is equivalent to $string = decode("utf8", $octets [, CHECK]). This is definitely not true. We have a complex web app coded in perl which has been running flawlessly for years, and which used the first variant to decode some URL params which are fed into the application by a HTTP POST request and are encoded in UTF-8. After upgrading the underlying debian distribution (and thus, perl and respective modules) from sarge to lenny, the app failed messing up German umlauts and other international chars. It took us more than two days of debugging until we came to the idea of (illogically in respect to the documentation) replacing the first variant by the second one; from this moment on, the app ran without any flaws again. Since this was very infuriating, we would like to prevent others from suffering the same problem the cause of which can't be deducted by thinking logically alone, and thus are filing a bug now. More precise description of setup: Apache 2.2.9, no default charset configured; Web page containing a form, whole page encoded as UTF-8 and tagged as UTF-8 by http headers and http metatags, form with attribute accept-charset="UTF-8"; A perl script, coded in UTF-8 by itself, all files read and written in UTF-8, including stdin and stdout, and using the CGI module version 3.49 for receiving / decoding the parameters sent by the form submit / POST request; Bowser Firefox 3.6 (Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3) or MS IE8; Problem: decode_utf8 does not decode the parameter names and values from the POST request correctly, decode("utf8", ...) does. Thus, decode_utf8 and decode("utf8", ...) are not equivalent as stated by the docs. Furthermore, for our app, it does not make any difference if we use decode("utf8", ...) or decode("UTF-8", ...) which also seems like a contradiction to the docs, but maybe decode("utf8", ...) and decode("UTF-8", ...) would give different results in another scenario / with other octets. Tagging the problem as important is because of the fact that it took very much man-power and time to find the cause of the failing of our app; no matter if the bug is in the docs or in the module, we think it has to be considered important due to the fact that this special reason for failing scripts or messed chars is very hard to find. Thanks, Peter

Sun Sep 26 19:55:20 2010 DANKOGAI [...] cpan.org - Correspondence added

See http://rt.cpan.org/Public/Bug/Display.html?id=61456 Dan the Maintainer Thereof On Sun Sep 26 12:21:00 2010, Binarus wrote: Show quoted text

> Hi, > > this is on debian lenny, all packages and kernel stock distribution > (dom0 of a XEN environment, but this should not play any role here). > > uname -a gives: Linux fenrir 2.6.26-2-xen-686 #1 SMP Fri Sep 17 00:54:08 > UTC 2010 i686 GNU/Linux > perl -v gives: This is perl, v5.10.0 built for i486-linux-gnu-thread-multi > > From within the CPAN shell, we have installed the newest (as of the time > of this writing) versions of the following modules: > > CGI, CPAN, CPAN::Test::Dummy::Perl5::Build, > CPAN::Test::Dummy::Perl5::Make, CPAN::Test::Dummy::Perl5::Make::Zip, > Encode, FCGI, Module::Signature, Perl, Test::Simple, YAML > > The version of Encode is 2.40. > > As of the time of this writing, the documentation for the Encode module > states that > > $string = decode_utf8($octets [, CHECK]); > > is equivalent to > > $string = decode("utf8", $octets [, CHECK]). > > This is definitely not true. We have a complex web app coded in perl > which has been running flawlessly for years, and which used the first > variant to decode some URL params which are fed into the application by > a HTTP POST request and are encoded in UTF-8. > > After upgrading the underlying debian distribution (and thus, perl and > respective modules) from sarge to lenny, the app failed messing up > German umlauts and other international chars. It took us more than two > days of debugging until we came to the idea of (illogically in respect > to the documentation) replacing the first variant by the second one; > from this moment on, the app ran without any flaws again. > > Since this was very infuriating, we would like to prevent others from > suffering the same problem the cause of which can't be deducted by > thinking logically alone, and thus are filing a bug now. > > More precise description of setup: > > Apache 2.2.9, no default charset configured; > Web page containing a form, whole page encoded as UTF-8 and tagged as > UTF-8 by http headers and http metatags, form with attribute > accept-charset="UTF-8"; > A perl script, coded in UTF-8 by itself, all files read and written in > UTF-8, including stdin and stdout, and using the CGI module version 3.49 > for receiving / decoding the parameters sent by the form submit / POST > request; > Bowser Firefox 3.6 (Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; > rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3) or MS IE8; > > Problem: > > decode_utf8 does not decode the parameter names and values from the POST > request correctly, decode("utf8", ...) does. Thus, decode_utf8 and > decode("utf8", ...) are not equivalent as stated by the docs. > Furthermore, for our app, it does not make any difference if we use > decode("utf8", ...) or decode("UTF-8", ...) which also seems like a > contradiction to the docs, but maybe decode("utf8", ...) and > decode("UTF-8", ...) would give different results in another scenario / > with other octets. > > Tagging the problem as important is because of the fact that it took > very much man-power and time to find the cause of the failing of our > app; no matter if the bug is in the docs or in the module, we think it > has to be considered important due to the fact that this special reason > for failing scripts or messed chars is very hard to find. > > Thanks, > > Peter

Sun Sep 26 19:55:22 2010 The RT System itself - Status changed from 'new' to 'open'

Sun Sep 26 19:55:24 2010 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Sun Sep 26 19:56:48 2010 DANKOGAI [...] cpan.org - Correspondence added

See http://rt.cpan.org/Public/Bug/Display.html?id=61456 Dan the Maintainer Thereof On Sun Sep 26 12:21:00 2010, Binarus wrote: Show quoted text

> Hi, > > this is on debian lenny, all packages and kernel stock distribution > (dom0 of a XEN environment, but this should not play any role here). > > uname -a gives: Linux fenrir 2.6.26-2-xen-686 #1 SMP Fri Sep 17 00:54:08 > UTC 2010 i686 GNU/Linux > perl -v gives: This is perl, v5.10.0 built for i486-linux-gnu-thread-multi > > From within the CPAN shell, we have installed the newest (as of the time > of this writing) versions of the following modules: > > CGI, CPAN, CPAN::Test::Dummy::Perl5::Build, > CPAN::Test::Dummy::Perl5::Make, CPAN::Test::Dummy::Perl5::Make::Zip, > Encode, FCGI, Module::Signature, Perl, Test::Simple, YAML > > The version of Encode is 2.40. > > As of the time of this writing, the documentation for the Encode module > states that > > $string = decode_utf8($octets [, CHECK]); > > is equivalent to > > $string = decode("utf8", $octets [, CHECK]). > > This is definitely not true. We have a complex web app coded in perl > which has been running flawlessly for years, and which used the first > variant to decode some URL params which are fed into the application by > a HTTP POST request and are encoded in UTF-8. > > After upgrading the underlying debian distribution (and thus, perl and > respective modules) from sarge to lenny, the app failed messing up > German umlauts and other international chars. It took us more than two > days of debugging until we came to the idea of (illogically in respect > to the documentation) replacing the first variant by the second one; > from this moment on, the app ran without any flaws again. > > Since this was very infuriating, we would like to prevent others from > suffering the same problem the cause of which can't be deducted by > thinking logically alone, and thus are filing a bug now. > > More precise description of setup: > > Apache 2.2.9, no default charset configured; > Web page containing a form, whole page encoded as UTF-8 and tagged as > UTF-8 by http headers and http metatags, form with attribute > accept-charset="UTF-8"; > A perl script, coded in UTF-8 by itself, all files read and written in > UTF-8, including stdin and stdout, and using the CGI module version 3.49 > for receiving / decoding the parameters sent by the form submit / POST > request; > Bowser Firefox 3.6 (Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; > rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3) or MS IE8; > > Problem: > > decode_utf8 does not decode the parameter names and values from the POST > request correctly, decode("utf8", ...) does. Thus, decode_utf8 and > decode("utf8", ...) are not equivalent as stated by the docs. > Furthermore, for our app, it does not make any difference if we use > decode("utf8", ...) or decode("UTF-8", ...) which also seems like a > contradiction to the docs, but maybe decode("utf8", ...) and > decode("UTF-8", ...) would give different results in another scenario / > with other octets. > > Tagging the problem as important is because of the fact that it took > very much man-power and time to find the cause of the failing of our > app; no matter if the bug is in the docs or in the module, we think it > has to be considered important due to the fact that this special reason > for failing scripts or messed chars is very hard to find. > > Thanks, > > Peter

Sun Sep 26 19:56:49 2010 The RT System itself - Status changed from 'resolved' to 'open'

Sun Sep 26 19:58:31 2010 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Mon Sep 27 03:08:49 2010 info [...] binarus.de - Correspondence added

From:

info [...] binarus.de

Hi Dan, thank you very much for your answer. I don't understand the relationship between the problem described in the reference you gave and the problem described by me. No matter which bugs Encode versions before 2.40 is suffering from: The current version 2.40 still behaves another way than the manual states (decode_utf8(...) is not equivalent to decode("utf8", ...)). In previous versions (some time ago), the equivalence between the two variants indeed seemed to be there. I can't tell exactly when the equivalence was broken; I only can tell that it was there when debian sarge was current (maybe two or three years ago). So I really think that either the Encode manual should be updated, or there is some bug in the code. Until at least the docs are corrected, it will be very hard for script authors to come to the idea of avoiding decode_utf8(...). As I think about it: Since decode_utf8(...) does not deliver the same result as decode("utf8", ...) or decode("UTF-8", ...), we can state that decode_utf8(...) is just buggy. Perhaps it should be removed completely... To stress it once again: decode_utf8(...) wasn't able to decode UTF-8 octets correctly. I am absolutely sure that the octets we have tested were valid UTF-8: at first, they were generated in a carefully crafted setup and came from Firefox 3.6; second, turning on the error behaviour in the decode function (CHECK) did not reveal any error; third, using decode("utf8", ...) or decode("UTF-8", ...) worked as expected, and the same octets suddenly were decoded correctly. Thanks, Peter On Sun Sep 26 19:56:48 2010, DANKOGAI wrote: Show quoted text

> See > > http://rt.cpan.org/Public/Bug/Display.html?id=61456 > > Dan the Maintainer Thereof > > On Sun Sep 26 12:21:00 2010, Binarus wrote:

> > Hi, > > > > this is on debian lenny, all packages and kernel stock distribution > > (dom0 of a XEN environment, but this should not play any role here). > > > > uname -a gives: Linux fenrir 2.6.26-2-xen-686 #1 SMP Fri Sep 17 00:54:08 > > UTC 2010 i686 GNU/Linux > > perl -v gives: This is perl, v5.10.0 built for

i486-linux-gnu-thread-multi Show quoted text

> > > > From within the CPAN shell, we have installed the newest (as of the time > > of this writing) versions of the following modules: > > > > CGI, CPAN, CPAN::Test::Dummy::Perl5::Build, > > CPAN::Test::Dummy::Perl5::Make, CPAN::Test::Dummy::Perl5::Make::Zip, > > Encode, FCGI, Module::Signature, Perl, Test::Simple, YAML > > > > The version of Encode is 2.40. > > > > As of the time of this writing, the documentation for the Encode module > > states that > > > > $string = decode_utf8($octets [, CHECK]); > > > > is equivalent to > > > > $string = decode("utf8", $octets [, CHECK]). > > > > This is definitely not true. We have a complex web app coded in perl > > which has been running flawlessly for years, and which used the first > > variant to decode some URL params which are fed into the application by > > a HTTP POST request and are encoded in UTF-8. > > > > After upgrading the underlying debian distribution (and thus, perl and > > respective modules) from sarge to lenny, the app failed messing up > > German umlauts and other international chars. It took us more than two > > days of debugging until we came to the idea of (illogically in respect > > to the documentation) replacing the first variant by the second one; > > from this moment on, the app ran without any flaws again. > > > > Since this was very infuriating, we would like to prevent others from > > suffering the same problem the cause of which can't be deducted by > > thinking logically alone, and thus are filing a bug now. > > > > More precise description of setup: > > > > Apache 2.2.9, no default charset configured; > > Web page containing a form, whole page encoded as UTF-8 and tagged as > > UTF-8 by http headers and http metatags, form with attribute > > accept-charset="UTF-8"; > > A perl script, coded in UTF-8 by itself, all files read and written in > > UTF-8, including stdin and stdout, and using the CGI module version 3.49 > > for receiving / decoding the parameters sent by the form submit / POST > > request; > > Bowser Firefox 3.6 (Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; > > rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3) or MS IE8; > > > > Problem: > > > > decode_utf8 does not decode the parameter names and values from the POST > > request correctly, decode("utf8", ...) does. Thus, decode_utf8 and > > decode("utf8", ...) are not equivalent as stated by the docs. > > Furthermore, for our app, it does not make any difference if we use > > decode("utf8", ...) or decode("UTF-8", ...) which also seems like a > > contradiction to the docs, but maybe decode("utf8", ...) and > > decode("UTF-8", ...) would give different results in another scenario / > > with other octets. > > > > Tagging the problem as important is because of the fact that it took > > very much man-power and time to find the cause of the failing of our > > app; no matter if the bug is in the docs or in the module, we think it > > has to be considered important due to the fact that this special reason > > for failing scripts or messed chars is very hard to find. > > > > Thanks, > > > > Peter

> >

Mon Sep 27 03:08:51 2010 The RT System itself - Status changed from 'resolved' to 'open'

Mon Sep 27 06:23:13 2010 DANKOGAI [...] cpan.org - Correspondence added

Please send me a test, then. decode_utf8() is now ISOMORPHICALLY IDENTICAL to decode('utf8' …), which wasn't till 2.40. If you don't believe me, check the source. Dan the Maintainer Thereof On Mon Sep 27 03:08:49 2010, Binarus wrote: Show quoted text

> Hi Dan, > > thank you very much for your answer. > > I don't understand the relationship between the problem described in the > reference you gave and the problem described by me. > > No matter which bugs Encode versions before 2.40 is suffering from: The > current version 2.40 still behaves another way than the manual states > (decode_utf8(...) is not equivalent to decode("utf8", ...)). > > In previous versions (some time ago), the equivalence between the two > variants indeed seemed to be there. I can't tell exactly when the > equivalence was broken; I only can tell that it was there when debian > sarge was current (maybe two or three years ago). > > So I really think that either the Encode manual should be updated, or > there is some bug in the code. Until at least the docs are corrected, it > will be very hard for script authors to come to the idea of avoiding > decode_utf8(...). > > As I think about it: Since decode_utf8(...) does not deliver the same > result as decode("utf8", ...) or decode("UTF-8", ...), we can state that > decode_utf8(...) is just buggy. Perhaps it should be removed completely... > > To stress it once again: decode_utf8(...) wasn't able to decode UTF-8 > octets correctly. I am absolutely sure that the octets we have tested > were valid UTF-8: at first, they were generated in a carefully crafted > setup and came from Firefox 3.6; second, turning on the error behaviour > in the decode function (CHECK) did not reveal any error; third, using > decode("utf8", ...) or decode("UTF-8", ...) worked as expected, and the > same octets suddenly were decoded correctly. > > Thanks, > > Peter > > > On Sun Sep 26 19:56:48 2010, DANKOGAI wrote:

> > See > > > > http://rt.cpan.org/Public/Bug/Display.html?id=61456 > > > > Dan the Maintainer Thereof > > > > On Sun Sep 26 12:21:00 2010, Binarus wrote:

> > > Hi, > > > > > > this is on debian lenny, all packages and kernel stock distribution > > > (dom0 of a XEN environment, but this should not play any role here). > > > > > > uname -a gives: Linux fenrir 2.6.26-2-xen-686 #1 SMP Fri Sep 17 00:54:08 > > > UTC 2010 i686 GNU/Linux > > > perl -v gives: This is perl, v5.10.0 built for

> i486-linux-gnu-thread-multi

> > > > > > From within the CPAN shell, we have installed the newest (as of the time > > > of this writing) versions of the following modules: > > > > > > CGI, CPAN, CPAN::Test::Dummy::Perl5::Build, > > > CPAN::Test::Dummy::Perl5::Make, CPAN::Test::Dummy::Perl5::Make::Zip, > > > Encode, FCGI, Module::Signature, Perl, Test::Simple, YAML > > > > > > The version of Encode is 2.40. > > > > > > As of the time of this writing, the documentation for the Encode module > > > states that > > > > > > $string = decode_utf8($octets [, CHECK]); > > > > > > is equivalent to > > > > > > $string = decode("utf8", $octets [, CHECK]). > > > > > > This is definitely not true. We have a complex web app coded in perl > > > which has been running flawlessly for years, and which used the first > > > variant to decode some URL params which are fed into the application by > > > a HTTP POST request and are encoded in UTF-8. > > > > > > After upgrading the underlying debian distribution (and thus, perl and > > > respective modules) from sarge to lenny, the app failed messing up > > > German umlauts and other international chars. It took us more than two > > > days of debugging until we came to the idea of (illogically in respect > > > to the documentation) replacing the first variant by the second one; > > > from this moment on, the app ran without any flaws again. > > > > > > Since this was very infuriating, we would like to prevent others from > > > suffering the same problem the cause of which can't be deducted by > > > thinking logically alone, and thus are filing a bug now. > > > > > > More precise description of setup: > > > > > > Apache 2.2.9, no default charset configured; > > > Web page containing a form, whole page encoded as UTF-8 and tagged as > > > UTF-8 by http headers and http metatags, form with attribute > > > accept-charset="UTF-8"; > > > A perl script, coded in UTF-8 by itself, all files read and written in > > > UTF-8, including stdin and stdout, and using the CGI module version 3.49 > > > for receiving / decoding the parameters sent by the form submit / POST > > > request; > > > Bowser Firefox 3.6 (Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; > > > rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3) or MS IE8; > > > > > > Problem: > > > > > > decode_utf8 does not decode the parameter names and values from the POST > > > request correctly, decode("utf8", ...) does. Thus, decode_utf8 and > > > decode("utf8", ...) are not equivalent as stated by the docs. > > > Furthermore, for our app, it does not make any difference if we use > > > decode("utf8", ...) or decode("UTF-8", ...) which also seems like a > > > contradiction to the docs, but maybe decode("utf8", ...) and > > > decode("UTF-8", ...) would give different results in another scenario / > > > with other octets. > > > > > > Tagging the problem as important is because of the fact that it took > > > very much man-power and time to find the cause of the failing of our > > > app; no matter if the bug is in the docs or in the module, we think it > > > has to be considered important due to the fact that this special reason > > > for failing scripts or messed chars is very hard to find. > > > > > > Thanks, > > > > > > Peter

> > > >

> >

Sat Nov 27 07:49:22 2010 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'