On Wo. Okt. 29 15:24:48 2008, Ky6uk wrote:
Show quoted text> for example:
>
> use CGI qw(escapeHTML);
> my $cgi = CGI->new();
> $cgi->header(-encode => 'utf-8');
> print "ы\n";
> print escapeHTML("ы");
>
> result:
> ы
> �‹
The result piped through "od -t x1", to show which bytes are present:
0000000 d1 8b 0a d1 26 23 38 32 34 39 3b 0a
^^ ^^ UTF-8 representation of ы
^^ Newline
^^ First byte of UTF-8 representation
^^ ^^ ^^ ^^ ^^ ^^ ^^ ‹ ("‹" from cp1251)
^^ newline
It seems like the bug is cause by faulty assumptions about the . If you
add "use utf8;" to the demo/test script, you'll see that Perl starts to
warn about wide characters in output.
This means you'll have to choose which one of the following options you
think is best:
* Encode::encode('utf8', htmlEscape($string)); # hE gets strings
* htmlEscape(Encode::encode('utf8', $string)); # hE gets bytes
Personally, I think the cleanest solution is to let htmlEscape accept
only (character) strings, as encoding should usually be the last thing
you do before writing to output.
Having htmlEscape accept only character strings also makes it easier to
recognise characters that might need to be escaped: if htmlEscape were
to accept bytes, it would have to re-(Encode::)decode the string using
the preferred encoding (as specified by the "charset=" header part),
figure out which characters needed html-escaping, and then
re-(Encode::)encode everything into the source encoding.
You can read more about unicode and bytes vs character strings in the
"perlunicode" manpage.