Subject: | Apache::Request: support for %uXXXX escape sequence not usable |
As it seems, Apache::Request tries to understand URL-escaped sequences
in the form %uXXXX (what you got e.g. from CGI::Ajax when dealing with
unicode characters with codepoints > 255). The problem is that there's no
indication in the parsed values whether the bytes should be interpreted
as iso-8859-1 or utf-8.
So for instance, if the QUERY_STRING looks like
something=%fc%u00fc
then Apache::Request parses it into a mixture of iso-8859-1 and utf-8
bytes:
"\374\303\274"
when it should be "\374\374" or "\x{fc}\x{fc}".
I don't know of a good solution how to solve this. The unescaping is
done in ap_unescape_url_u in apache_request.c, a part of the code
which only deals with C strings and not Perl SVs (in the latter case
we could set the SvUTF8 if needed). Maybe a possible solution would
be a global flag which indicates which encoding to use in this
function. If this global flag is set to "iso-8859-1", then a warning
would occur if %uXXXX with XXXX > 0xFF occurs (since this could not
be represented in iso-8859-1). If the global flag is set to "utf-8"
then escape sequences %80 .. %ff would cause to be converted into
utf-8 sequences. When converting the values to Perl SVs, the
data should be flagged properly as utf-8. To be backward compatible,
the flag should probably be set to "iso-8859-1" by default.
Regards,
Slaven