Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the CGI CPAN distribution.

Report information
The Basics
Id: 40502
Status: resolved
Priority: 0/
Queue: CGI

People
Owner: MARKSTOS [...] cpan.org
Requestors: ky6uk.kun [...] gmail.com
Cc: Martijn van de Streek (no email address)
AdminCc:

Bug Information
Severity: Important
Broken in: (no value)
Fixed in: (no value)



Subject: escapeHTML unicode bug
for example: use CGI qw(escapeHTML); my $cgi = CGI->new(); $cgi->header(-encode => 'utf-8'); print "ы\n"; print escapeHTML("ы"); result: ы �‹ ------------- CGI-3.42 $ perl -v This is perl, v5.10.0 built for i486-linux-gnu-thread-multi $ uname -a Linux desu 2.6.27-1-generic #1 SMP Sat Aug 23 23:20:09 UTC 2008 i686 GNU/Linux
From: ky6uk [...] mail.rb.ru
Срд. Окт. 29 15:24:48 2008, Ky6uk писал: Show quoted text
> ------------- > CGI-3.42
sry, CGI v3.34 I update and test 3.42 today.
On Wed Oct 29 15:24:48 2008, Ky6uk wrote: Show quoted text
> for example: > > use CGI qw(escapeHTML); > my $cgi = CGI->new(); > $cgi->header(-encode => 'utf-8'); > print "ы\n"; > print escapeHTML("ы"); > > result: > ы > �‹
I'm sorry, I'm not very familiar with UTF-8 and don't see what the bug is here. Could you explain it? Mark
From: martijn [...] vandestreek.net
On Wo. Okt. 29 15:24:48 2008, Ky6uk wrote: Show quoted text
> for example: > > use CGI qw(escapeHTML); > my $cgi = CGI->new(); > $cgi->header(-encode => 'utf-8'); > print "ы\n"; > print escapeHTML("ы"); > > result: > ы > �‹
The result piped through "od -t x1", to show which bytes are present: 0000000 d1 8b 0a d1 26 23 38 32 34 39 3b 0a ^^ ^^ UTF-8 representation of ы ^^ Newline ^^ First byte of UTF-8 representation ^^ ^^ ^^ ^^ ^^ ^^ ^^ ‹ ("‹" from cp1251) ^^ newline It seems like the bug is cause by faulty assumptions about the . If you add "use utf8;" to the demo/test script, you'll see that Perl starts to warn about wide characters in output. This means you'll have to choose which one of the following options you think is best: * Encode::encode('utf8', htmlEscape($string)); # hE gets strings * htmlEscape(Encode::encode('utf8', $string)); # hE gets bytes Personally, I think the cleanest solution is to let htmlEscape accept only (character) strings, as encoding should usually be the last thing you do before writing to output. Having htmlEscape accept only character strings also makes it easier to recognise characters that might need to be escaped: if htmlEscape were to accept bytes, it would have to re-(Encode::)decode the string using the preferred encoding (as specified by the "charset=" header part), figure out which characters needed html-escaping, and then re-(Encode::)encode everything into the source encoding. You can read more about unicode and bytes vs character strings in the "perlunicode" manpage.
Show quoted text
> 0000000 d1 8b 0a d1 26 23 38 32 34 39 3b 0a > ^^ ^^ UTF-8 representation of ы > ^^ Newline > ^^ First byte of UTF-8 representation > ^^ ^^ ^^ ^^ ^^ ^^ ^^ ‹ ("‹" from cp1251) > ^^ newline
This doesn't line up correctly in the RT web interface. Follow the "download" link for correct spacing.
Show quoted text
> Personally, I think the cleanest solution is to let htmlEscape accept > only (character) strings, as encoding should usually be the last thing > you do before writing to output.
This attachment illustrates the point. It only triggers the bug in case 3.
use utf8; use CGI qw(escapeHTML); use Encode qw(encode); my $cgi = CGI->new(); $cgi->header(-encode => 'utf-8'); # Just encode the character as UTF-8 print "1", encode('utf8', "ы\n"); # Encode the output of escapeHTML print "2", encode('utf8', escapeHTML("ы")), $/; # Encode the character to UTF-8 bytes, then give it to escapeHTML (breaks) print "3", escapeHTML(encode('utf8', "ы")), $/;
Show quoted text
> Personally, I think the cleanest solution is to let htmlEscape accept > only (character) strings, as encoding should usually be the last thing > you do before writing to output.
That makes sense to me. Is this just something we document or is there some way we should patch the code to enforce this? Mark
From: martijn [...] vandestreek.net
On Za. Aug. 15 16:12:23 2009, MARKSTOS wrote: Show quoted text
>
> > Personally, I think the cleanest solution is to let htmlEscape accept > > only (character) strings, as encoding should usually be the last thing > > you do before writing to output.
> > That makes sense to me. Is this just something we document or is there > some way we should patch the code to enforce this?
I think just documenting it would be enough. As far as I know there's no easy way to tell if a string is a byte string or a character string. You might want to mention these points: * Either "use utf8;" or "use encoding 'your_favorite_encoding';" This will make sure all string literals are character strings * Encode::encode() before writing to output, or set up a filter using binmode $handle, ":encoding(utf8)" to have it done for you automatically (otherwise print might warn about "wide character in output"; this also makes code portable to Perls that use other internal encodings than UTF-8) * Encode::decode() all incoming strings (eg. from CGI::param) so they become character strings. * Refer to perlunicode Martijn
On Sun Aug 16 02:31:05 2009, MartijnVdS wrote: Show quoted text
> On Za. Aug. 15 16:12:23 2009, MARKSTOS wrote:
> >
> > > Personally, I think the cleanest solution is to let htmlEscape accept > > > only (character) strings, as encoding should usually be the last thing > > > you do before writing to output.
> > > > That makes sense to me. Is this just something we document or is there > > some way we should patch the code to enforce this?
> > I think just documenting it would be enough. As far as I know there's no > easy way to tell if a string is a byte string or a character string. > > You might want to mention these points: > > * Either "use utf8;" or "use encoding 'your_favorite_encoding';" > This will make sure all string literals are character strings > * Encode::encode() before writing to output, or set up a filter using > binmode $handle, ":encoding(utf8)" to have it done for you automatically > (otherwise print might warn about "wide character in output"; this also > makes code portable to Perls that use other internal encodings than UTF-8) > * Encode::decode() all incoming strings (eg. from CGI::param) so they > become character strings. > * Refer to perlunicode
Thank you. Would you mind submitting a formal doc patch for this? You'll receive credit in the "Changes" file for your contribution. Just attaching a "diff" to this ticket would be fine. Mark
From: martijn [...] vandestreek.net
Show quoted text
> Thank you. > > Would you mind submitting a formal doc patch for this? You'll receive > credit in the "Changes" file for your contribution. Just attaching a > "diff" to this ticket would be fine.
I've attached the diff adding my documentation. Proper unicode support might require larger changes: the code has lots of special cases for Latin1/CP1252. If everything was handled as character strings (based on the set encoding/charset), things could be so much easier. But I understand that might be too big an API change for a module as old and as widely used as CGI.pm
--- lib/CGI.pm 2009-08-14 15:33:52.000000000 +0200 +++ lib/CGI.pm.hacked 2009-08-17 08:32:45.414488949 +0200 @@ -5776,6 +5776,13 @@ be replaced by their numeric entities, since CGI.pm has no lookup table for all the possible encodings. +escapeHTML expects the supplied string to be a character string. This means you +should Encode::decode data received from "outside" and Encode::encode your +strings before sending them back outside. To upgrade string literals in your +source to character strings, you can use "use encoding" or "use utf8". See +perlunitut and perlunicode for more information on how Perl handles the +difference between bytes and characters. + The automatic escaping does not apply to other shortcuts, such as h1(). You should call escapeHTML() yourself on untrusted data in order to protect your pages against nasty tricks that people may enter
RT-Send-CC: rhesa [...] cpan.org
Thanks for the doc patch, Martijn, it's in my github repo now. Rhesa, could you peer-review this Unicode issue to confirm that you agree with the proposed resolution? Mark On Mon Aug 17 02:33:47 2009, MartijnVdS wrote: Show quoted text
> > Thank you. > > > > Would you mind submitting a formal doc patch for this? You'll receive > > credit in the "Changes" file for your contribution. Just attaching a > > "diff" to this ticket would be fine.
> > I've attached the diff adding my documentation. > > Proper unicode support might require larger changes: the code has lots > of special cases for Latin1/CP1252. If everything was handled as > character strings (based on the set encoding/charset), things could be > so much easier. But I understand that might be too big an API change for > a module as old and as widely used as CGI.pm
I'm by no means an expert on these issues, but I saw that perlunifaq says not to use "use encoding", so I'd take that little bit out. I think the patch is fine otherwise.
On Ma. Aug. 17 09:08:40 2009, RHESA wrote: Show quoted text
> I'm by no means an expert on these issues, but I saw that perlunifaq > says not to use "use encoding", so I'd take that little bit out. I think > the patch is fine otherwise.
Oops, I hadn't seen that. New patch attached.
--- lib/CGI.pm 2009-08-14 15:33:52.000000000 +0200 +++ lib/CGI.pm.hacked 2009-08-17 15:24:10.882310464 +0200 @@ -5776,6 +5776,13 @@ be replaced by their numeric entities, since CGI.pm has no lookup table for all the possible encodings. +escapeHTML expects the supplied string to be a character string. This means you +should Encode::decode data received from "outside" and Encode::encode your +strings before sending them back outside. If your source code UTF-8 encoded and +you want to upgrade string literals in your source to character strings, you +can use "use utf8". See perlunitut, perlunifaq and perlunicode for more +information on how Perl handles the difference between bytes and characters. + The automatic escaping does not apply to other shortcuts, such as h1(). You should call escapeHTML() yourself on untrusted data in order to protect your pages against nasty tricks that people may enter
This patch has been applied in my github repo now. Thanks.
Subject: Thanks, released
The patch for this ticket has now been released in CGI.pm 3.47, and this ticket is considered resolved. Thanks again for you help to improve CGI.pm! Mark