Subject: | decode_utf8 is not equivalent to decode("utf8", ...) |
Hi,
this is on debian lenny, all packages and kernel stock distribution
(dom0 of a XEN environment, but this should not play any role here).
uname -a gives: Linux fenrir 2.6.26-2-xen-686 #1 SMP Fri Sep 17 00:54:08
UTC 2010 i686 GNU/Linux
perl -v gives: This is perl, v5.10.0 built for i486-linux-gnu-thread-multi
From within the CPAN shell, we have installed the newest (as of the time
of this writing) versions of the following modules:
CGI, CPAN, CPAN::Test::Dummy::Perl5::Build,
CPAN::Test::Dummy::Perl5::Make, CPAN::Test::Dummy::Perl5::Make::Zip,
Encode, FCGI, Module::Signature, Perl, Test::Simple, YAML
The version of Encode is 2.40.
As of the time of this writing, the documentation for the Encode module
states that
$string = decode_utf8($octets [, CHECK]);
is equivalent to
$string = decode("utf8", $octets [, CHECK]).
This is definitely not true. We have a complex web app coded in perl
which has been running flawlessly for years, and which used the first
variant to decode some URL params which are fed into the application by
a HTTP POST request and are encoded in UTF-8.
After upgrading the underlying debian distribution (and thus, perl and
respective modules) from sarge to lenny, the app failed messing up
German umlauts and other international chars. It took us more than two
days of debugging until we came to the idea of (illogically in respect
to the documentation) replacing the first variant by the second one;
from this moment on, the app ran without any flaws again.
Since this was very infuriating, we would like to prevent others from
suffering the same problem the cause of which can't be deducted by
thinking logically alone, and thus are filing a bug now.
More precise description of setup:
Apache 2.2.9, no default charset configured;
Web page containing a form, whole page encoded as UTF-8 and tagged as
UTF-8 by http headers and http metatags, form with attribute
accept-charset="UTF-8";
A perl script, coded in UTF-8 by itself, all files read and written in
UTF-8, including stdin and stdout, and using the CGI module version 3.49
for receiving / decoding the parameters sent by the form submit / POST
request;
Bowser Firefox 3.6 (Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB;
rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3) or MS IE8;
Problem:
decode_utf8 does not decode the parameter names and values from the POST
request correctly, decode("utf8", ...) does. Thus, decode_utf8 and
decode("utf8", ...) are not equivalent as stated by the docs.
Furthermore, for our app, it does not make any difference if we use
decode("utf8", ...) or decode("UTF-8", ...) which also seems like a
contradiction to the docs, but maybe decode("utf8", ...) and
decode("UTF-8", ...) would give different results in another scenario /
with other octets.
Tagging the problem as important is because of the fact that it took
very much man-power and time to find the cause of the failing of our
app; no matter if the bug is in the docs or in the module, we think it
has to be considered important due to the fact that this special reason
for failing scripts or messed chars is very hard to find.
Thanks,
Peter