Bug #24724 for libapreq: Apache::Request: support for %uXXXX escape sequence not usable

Subject:

Apache::Request: support for %uXXXX escape sequence not usable

As it seems, Apache::Request tries to understand URL-escaped sequences in the form %uXXXX (what you got e.g. from CGI::Ajax when dealing with unicode characters with codepoints > 255). The problem is that there's no indication in the parsed values whether the bytes should be interpreted as iso-8859-1 or utf-8. So for instance, if the QUERY_STRING looks like something=%fc%u00fc then Apache::Request parses it into a mixture of iso-8859-1 and utf-8 bytes: "\374\303\274" when it should be "\374\374" or "\x{fc}\x{fc}". I don't know of a good solution how to solve this. The unescaping is done in ap_unescape_url_u in apache_request.c, a part of the code which only deals with C strings and not Perl SVs (in the latter case we could set the SvUTF8 if needed). Maybe a possible solution would be a global flag which indicates which encoding to use in this function. If this global flag is set to "iso-8859-1", then a warning would occur if %uXXXX with XXXX > 0xFF occurs (since this could not be represented in iso-8859-1). If the global flag is set to "utf-8" then escape sequences %80 .. %ff would cause to be converted into utf-8 sequences. When converting the values to Perl SVs, the data should be flagged properly as utf-8. To be backward compatible, the flag should probably be set to "iso-8859-1" by default. Regards, Slaven