Subject: | CGI.pm - misleading documentation |
Date: | Fri, 5 Feb 2010 15:20:16 +0100 (CET) |
To: | bug-CGI.pm [...] rt.cpan.org |
From: | Helmut Richter <Helmut.Richter [...] lrz.de> |
Hello,
this is not a report on a bug in CGI.pm (in fact it works perfectly although
the documentation warns against a very useful feature!) but in its
documentation. If this is not the right address to send such comments, please
forward.
Mit besten Grüßen / Best regards
Helmut Richter
====================================================
Dr. Helmut Richter Leibniz-Rechenzentrum
Tel: +49-89-35831-8785 Boltzmannstraße 1
Fax: +49-89-35831-9700 85748 Garching / Germany
====================================================
Problem
-------
The documentation as found in http://search.cpan.org/dist/CGI.pm/lib/CGI.pm
says about the -utf8 pragma:
| -utf8
|
| This makes CGI.pm treat all parameters as UTF-8 strings. Use this with care,
| as it will interfere with the processing of binary uploads. It is better to
| manually select which fields are expected to return utf-8 strings and
| convert them using code like this:
|
|
| use Encode;
| my $arg = decode utf8=>param('foo');
I have the following qualms with it:
1. It is not at all obvious what exactly is meant with "treat all parameters
as UTF-8 strings", or what the consequences for the user of CGI.pm are.
The term "UTF-8 string" could mean "binary string containing UTF-8 encoded
data"; this is not meant (and it is very fortunate that this is not what
happens).
2. It is not so that it "interferes with the processing of binary
uploads". Quite the contrary: it is a special feature of the -utf8 pragma
that parameters are decoded from UTF-8 *without* interfering with binary
uploads (I guess by first extracting the binary data and decoding only the
remaining text). At least, I was not able to get any errors into binary
uploads by using the -utf8 pragma which did a correct decoding of the input
form data without touching the binary upload data.
3. The unnecessary work-around in the last line is *not* a functional
substitute for the effect of the -utf8 pragma. If one uses it, one has
still to keep track of the encoding of parameters used as defaults, e.g.
textfield(-name=>'field_name', -value=>'starting value', -size=>50,
-maxlength=>80);
will only work if the string for starting value is ASCII, otherwise it must
be replaced by "encode ('utf8', 'starting value')". Also, comparing input
parameter values with constants can only be done after proper decoding.
All this complicated and error-prone wizardry is unnecessary when using the
-utf8 pragma. There is no reason to warn against it.
Again: there is no need to modify the implementation of CGI.pm -- it does
exactly what is needed. Only the documenation must be updated to tell the user
what CGI.pm really does.
Suggested new wording
---------------------
-utf8
This makes CGI.pm treat all parameters as text strings rather than binary
strings (see *perlunitut* for the distinction), assuming UTF-8 for the
encoding of input/output from/to the form. This is typically used in
conjunction with a <form> tag containing the option 'accept-charset="UTF-8"'
to ensure UTF-8 input from the form and with 'binmode (STDOUT, ":utf8")' to
ensure UTF-8 output to the form, while all handling of the data within the
perl script manipulates only text strings.
CGI.pm does the decoding from the UTF-8 encoded input data, restricting this
decoding to input text as distinct from binary upload data which are left
untouched. Therefore, a ':utf8' layer must *not* be used on STDIN.