Bug #40502 for CGI: Needs Doc Patch: escapeHTML unicode handling

Wed Oct 29 15:24:48 2008 ky6uk [...] mail.rb.ru - Ticket created

Subject:

escapeHTML unicode bug

for example: use CGI qw(escapeHTML); my $cgi = CGI->new(); $cgi->header(-encode => 'utf-8'); print "ы\n"; print escapeHTML("ы"); result: ы �‹ ------------- CGI-3.42 $ perl -v This is perl, v5.10.0 built for i486-linux-gnu-thread-multi $ uname -a Linux desu 2.6.27-1-generic #1 SMP Sat Aug 23 23:20:09 UTC 2008 i686 GNU/Linux

Thu Oct 30 06:20:48 2008 ky6uk [...] mail.rb.ru - Correspondence added

From:

ky6uk [...] mail.rb.ru

Срд. Окт. 29 15:24:48 2008, Ky6uk писал: Show quoted text

> ------------- > CGI-3.42

sry, CGI v3.34 I update and test 3.42 today.

Thu Oct 30 06:20:49 2008 The RT System itself - Status changed from 'new' to 'open'

Wed Jul 29 23:18:19 2009 MARKSTOS [...] cpan.org - Taken

Wed Jul 29 23:19:10 2009 MARKSTOS [...] cpan.org - Correspondence added

On Wed Oct 29 15:24:48 2008, Ky6uk wrote: Show quoted text

> for example: > > use CGI qw(escapeHTML); > my $cgi = CGI->new(); > $cgi->header(-encode => 'utf-8'); > print "ы\n"; > print escapeHTML("ы"); > > result: > ы > �‹

I'm sorry, I'm not very familiar with UTF-8 and don't see what the bug is here. Could you explain it? Mark

Wed Jul 29 23:19:49 2009 MARKSTOS [...] cpan.org - Subject changed from 'escapeHTML unicode bug' to 'Needs explanation: escapeHTML unicode bug'

Sat Aug 15 13:43:55 2009 Martijn van de Streek - Correspondence added

From:

martijn [...] vandestreek.net

On Wo. Okt. 29 15:24:48 2008, Ky6uk wrote: Show quoted text

> for example: > > use CGI qw(escapeHTML); > my $cgi = CGI->new(); > $cgi->header(-encode => 'utf-8'); > print "ы\n"; > print escapeHTML("ы"); > > result: > ы > �‹

The result piped through "od -t x1", to show which bytes are present: 0000000 d1 8b 0a d1 26 23 38 32 34 39 3b 0a ^^ ^^ UTF-8 representation of ы ^^ Newline ^^ First byte of UTF-8 representation ^^ ^^ ^^ ^^ ^^ ^^ ^^ ‹ ("‹" from cp1251) ^^ newline It seems like the bug is cause by faulty assumptions about the . If you add "use utf8;" to the demo/test script, you'll see that Perl starts to warn about wide characters in output. This means you'll have to choose which one of the following options you think is best: * Encode::encode('utf8', htmlEscape($string)); # hE gets strings * htmlEscape(Encode::encode('utf8', $string)); # hE gets bytes Personally, I think the cleanest solution is to let htmlEscape accept only (character) strings, as encoding should usually be the last thing you do before writing to output. Having htmlEscape accept only character strings also makes it easier to recognise characters that might need to be escaped: if htmlEscape were to accept bytes, it would have to re-(Encode::)decode the string using the preferred encoding (as specified by the "charset=" header part), figure out which characters needed html-escaping, and then re-(Encode::)encode everything into the source encoding. You can read more about unicode and bytes vs character strings in the "perlunicode" manpage.

Sat Aug 15 13:45:14 2009 Martijn van de Streek - Correspondence added

Show quoted text

> 0000000 d1 8b 0a d1 26 23 38 32 34 39 3b 0a > ^^ ^^ UTF-8 representation of ы > ^^ Newline > ^^ First byte of UTF-8 representation > ^^ ^^ ^^ ^^ ^^ ^^ ^^ ‹ ("‹" from cp1251) > ^^ newline

This doesn't line up correctly in the RT web interface. Follow the "download" link for correct spacing.

Sat Aug 15 14:02:46 2009 Martijn van de Streek - Correspondence added

Show quoted text

> Personally, I think the cleanest solution is to let htmlEscape accept > only (character) strings, as encoding should usually be the last thing > you do before writing to output.

This attachment illustrates the point. It only triggers the bug in case 3.

use utf8; use CGI qw(escapeHTML); use Encode qw(encode); my $cgi = CGI->new(); $cgi->header(-encode => 'utf-8'); # Just encode the character as UTF-8 print "1", encode('utf8', "Ñ\n"); # Encode the output of escapeHTML print "2", encode('utf8', escapeHTML("Ñ")), $/; # Encode the character to UTF-8 bytes, then give it to escapeHTML (breaks) print "3", escapeHTML(encode('utf8', "Ñ")), $/;

Sat Aug 15 16:12:23 2009 MARKSTOS [...] cpan.org - Correspondence added

Show quoted text

> Personally, I think the cleanest solution is to let htmlEscape accept > only (character) strings, as encoding should usually be the last thing > you do before writing to output.

That makes sense to me. Is this just something we document or is there some way we should patch the code to enforce this? Mark

Sun Aug 16 02:31:05 2009 Martijn van de Streek - Correspondence added

From:

martijn [...] vandestreek.net

On Za. Aug. 15 16:12:23 2009, MARKSTOS wrote: Show quoted text

>

> > Personally, I think the cleanest solution is to let htmlEscape accept > > only (character) strings, as encoding should usually be the last thing > > you do before writing to output.

> > That makes sense to me. Is this just something we document or is there > some way we should patch the code to enforce this?

I think just documenting it would be enough. As far as I know there's no easy way to tell if a string is a byte string or a character string. You might want to mention these points: * Either "use utf8;" or "use encoding 'your_favorite_encoding';" This will make sure all string literals are character strings * Encode::encode() before writing to output, or set up a filter using binmode $handle, ":encoding(utf8)" to have it done for you automatically (otherwise print might warn about "wide character in output"; this also makes code portable to Perls that use other internal encodings than UTF-8) * Encode::decode() all incoming strings (eg. from CGI::param) so they become character strings. * Refer to perlunicode Martijn

Sun Aug 16 02:32:37 2009 Martijn van de Streek - Cc MartijnVdS added

Sun Aug 16 18:44:07 2009 MARKSTOS [...] cpan.org - Subject changed from 'Needs explanation: escapeHTML unicode bug' to 'Needs Doc Patch: escapeHTML unicode handling'

Sun Aug 16 18:45:43 2009 MARKSTOS [...] cpan.org - Correspondence added

On Sun Aug 16 02:31:05 2009, MartijnVdS wrote: Show quoted text

> On Za. Aug. 15 16:12:23 2009, MARKSTOS wrote:

> >

> > > Personally, I think the cleanest solution is to let htmlEscape accept > > > only (character) strings, as encoding should usually be the last thing > > > you do before writing to output.

> > > > That makes sense to me. Is this just something we document or is there > > some way we should patch the code to enforce this?

> > I think just documenting it would be enough. As far as I know there's no > easy way to tell if a string is a byte string or a character string. > > You might want to mention these points: > > * Either "use utf8;" or "use encoding 'your_favorite_encoding';" > This will make sure all string literals are character strings > * Encode::encode() before writing to output, or set up a filter using > binmode $handle, ":encoding(utf8)" to have it done for you automatically > (otherwise print might warn about "wide character in output"; this also > makes code portable to Perls that use other internal encodings than UTF-8) > * Encode::decode() all incoming strings (eg. from CGI::param) so they > become character strings. > * Refer to perlunicode

Thank you. Would you mind submitting a formal doc patch for this? You'll receive credit in the "Changes" file for your contribution. Just attaching a "diff" to this ticket would be fine. Mark

Mon Aug 17 02:33:47 2009 Martijn van de Streek - Correspondence added

From:

martijn [...] vandestreek.net

Show quoted text

> Thank you. > > Would you mind submitting a formal doc patch for this? You'll receive > credit in the "Changes" file for your contribution. Just attaching a > "diff" to this ticket would be fine.

I've attached the diff adding my documentation. Proper unicode support might require larger changes: the code has lots of special cases for Latin1/CP1252. If everything was handled as character strings (based on the set encoding/charset), things could be so much easier. But I understand that might be too big an API change for a module as old and as widely used as CGI.pm

--- lib/CGI.pm 2009-08-14 15:33:52.000000000 +0200 +++ lib/CGI.pm.hacked 2009-08-17 08:32:45.414488949 +0200 @@ -5776,6 +5776,13 @@ be replaced by their numeric entities, since CGI.pm has no lookup table for all the possible encodings. +escapeHTML expects the supplied string to be a character string. This means you +should Encode::decode data received from "outside" and Encode::encode your +strings before sending them back outside. To upgrade string literals in your +source to character strings, you can use "use encoding" or "use utf8". See +perlunitut and perlunicode for more information on how Perl handles the +difference between bytes and characters. + The automatic escaping does not apply to other shortcuts, such as h1(). You should call escapeHTML() yourself on untrusted data in order to protect your pages against nasty tricks that people may enter

Mon Aug 17 07:56:03 2009 MARKSTOS [...] cpan.org - Correspondence added

RT-Send-CC:

rhesa [...] cpan.org

Thanks for the doc patch, Martijn, it's in my github repo now. Rhesa, could you peer-review this Unicode issue to confirm that you agree with the proposed resolution? Mark On Mon Aug 17 02:33:47 2009, MartijnVdS wrote: Show quoted text

> > Thank you. > > > > Would you mind submitting a formal doc patch for this? You'll receive > > credit in the "Changes" file for your contribution. Just attaching a > > "diff" to this ticket would be fine.

> > I've attached the diff adding my documentation. > > Proper unicode support might require larger changes: the code has lots > of special cases for Latin1/CP1252. If everything was handled as > character strings (based on the set encoding/charset), things could be > so much easier. But I understand that might be too big an API change for > a module as old and as widely used as CGI.pm

Mon Aug 17 09:08:40 2009 rhesa [...] cpan.org - Correspondence added

I'm by no means an expert on these issues, but I saw that perlunifaq says not to use "use encoding", so I'd take that little bit out. I think the patch is fine otherwise.

Mon Aug 17 09:24:34 2009 Martijn van de Streek - Correspondence added

On Ma. Aug. 17 09:08:40 2009, RHESA wrote: Show quoted text

> I'm by no means an expert on these issues, but I saw that perlunifaq > says not to use "use encoding", so I'd take that little bit out. I think > the patch is fine otherwise.

Oops, I hadn't seen that. New patch attached.

--- lib/CGI.pm 2009-08-14 15:33:52.000000000 +0200 +++ lib/CGI.pm.hacked 2009-08-17 15:24:10.882310464 +0200 @@ -5776,6 +5776,13 @@ be replaced by their numeric entities, since CGI.pm has no lookup table for all the possible encodings. +escapeHTML expects the supplied string to be a character string. This means you +should Encode::decode data received from "outside" and Encode::encode your +strings before sending them back outside. If your source code UTF-8 encoded and +you want to upgrade string literals in your source to character strings, you +can use "use utf8". See perlunitut, perlunifaq and perlunicode for more +information on how Perl handles the difference between bytes and characters. + The automatic escaping does not apply to other shortcuts, such as h1(). You should call escapeHTML() yourself on untrusted data in order to protect your pages against nasty tricks that people may enter

Mon Aug 17 20:32:45 2009 MARKSTOS [...] cpan.org - Correspondence added

This patch has been applied in my github repo now. Thanks.

Mon Aug 17 20:32:46 2009 MARKSTOS [...] cpan.org - Status changed from 'open' to 'patched'

Wed Sep 09 22:07:14 2009 MARKSTOS [...] cpan.org - Correspondence added

Subject:

Thanks, released

The patch for this ticket has now been released in CGI.pm 3.47, and this ticket is considered resolved. Thanks again for you help to improve CGI.pm! Mark

Wed Sep 09 22:07:14 2009 The RT System itself - Status changed from 'patched' to 'open'

Wed Sep 09 22:07:17 2009 MARKSTOS [...] cpan.org - Status changed from 'open' to 'resolved'

Fri May 23 14:29:27 2014 The RT System itself - Queue changed from CGI.pm to CGI

Bug #40502 for CGI: Needs Doc Patch: escapeHTML unicode handling

Preferred bug tracker