Subject: | decode_utf8 is not equivalent to decode("utf8", ...) |
Hi,
this is on debian lenny, all packages and kernel stock distribution
(dom0 of a XEN environment which should not play any role for this problem).
uname -a gives: Linux fenrir 2.6.26-2-xen-686 #1 SMP Fri Sep 17 00:54:08
UTC 2010 i686 GNU/Linux
perl -v gives: This is perl, v5.10.0 built for i486-linux-gnu-thread-multi
From within the CPAN shell, we have installed the newest (as of the time
of this writing) versions of the following modules:
CGI, CPAN, CPAN::Test::Dummy::Perl5::Build,
CPAN::Test::Dummy::Perl5::Make, CPAN::Test::Dummy::Perl5::Make::Zip,
Encode, FCGI, Module::Signature, Perl, Test::Simple, YAML
The version of Encode is 2.40.
As of the time of this writing, the documentation for the Encode module
states that
$string = decode_utf8($octets [, CHECK]);
is equivalent to
$string = decode("utf8", $octets [, CHECK]).
This is definitely not true. We have a complex web app coded in perl
which has been running flawlessly for years, and which used the first
variant to decode some URL params which are fed into the application by
a HTTP POST request and are encoded in UTF-8.
After upgrading the underlying debian distribution (and thus, perl and
respective modules), the app failed messing up German umlauts and other
international chars. It took us more than two days of debugging until we
came to the idea of (illogically in respect to the documentation)
replacing the first variant by the second one; from this moment on, the
app ran without any flaws again.
Since this was very infuriating, we would like to prevent others from
suffering the same problem the cause of which can't be deducted by
thinking logically alone, and thus are filing a bug now.
More precise description of setup:
Apache 2.2.9, no default charset configured;
Web page containing a form, whole page encoded as UTF-8 and tagged a
UTF-8 by http headers and http metatags, form with attribute
accept-charset="UTF-8";
A perl script, coded in UTF-8 by itself, all files read and written in
UTF-8, including stdin and stdout, and using the CGI module version 3.49
for receiving / decoding the parameters sent by the form submit / POST
request;
Bowser Firefox 3.6 (Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB;
rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3) or MS IE8;
Problem:
decode_utf8 does not decode the parameter names and values from the POST
request correctly, decode("utf8", ...) does. Thus, decode_utf8 and
decode("utf8", ...) are not equivalent as stated by the docs.
Furthermore, for our app, it does not make any difference if we use
decode("utf8", ...) or decode("UTF-8", ...) which also seems like a
contradiction to the docs, but maybe utf8 and UTF-8 would give different
results in another scenario / with other strings.
Tagging the problem as important is because of the fact that it took
very much man-power and time to find the cause of the failing of our
app; no matter if the bug is in the docs or in the module, we think it
has to be considered important due to the fact that this special reason
for failing scripts messing up encodings is very hard to find.