The URI->as_iri method seems to produce both character strings and byte sequences depending on the input of punycode URIs. This makes dealing with the output difficult when trying to sensibly combine it with other strings. It seems to me that the difference depends on whether the decoded punycode value only contains codepoints that can be represented in latin-1. The attached test script shows the decoding of two punycode URIs:
http://www.hestebedgård.dk/
http://✪df.ws/
Using Devel::Peek, it can be seen that "hestebedgård" is represented as a byte sequence with U+00e5 being represented as the single byte 0xE5 with the SV lacking the UTF8 flag. On the other hand, "✪df" is represented as a UTF8-flagged character string with the first character correctly encoded as \x{272a}.
I believe the attached patch solves this problem, but I'm not sure if it might break any other cases, or if there's a better way of forcing the decoded unicode string to have the UTF8 flag.
Subject: | iri_encoding.diff |
--- URI/_punycode.pm.orig 2013-06-09 10:14:14.000000000 +0400
+++ URI/_punycode.pm.new 2013-06-11 11:49:19.000000000 +0400
@@ -86,7 +86,11 @@
warn join " ", map sprintf('%04x', $_), @output if $DEBUG;
$i++;
}
- return join '', map chr, @output;
+ my $uri = join '', map chr, @output;
+ use Encode;
+ my $octets = encode('UTF-8', $uri, Encode::FB_CROAK);
+ $uri = decode('UTF-8', $octets, Encode::FB_CROAK);
+ return $uri;
}
sub encode_punycode {
Subject: | iri_encoding.pl |
#!/usr/bin/perl
use strict;
use warnings;
use Devel::Peek;
use URI;
my $latin1 = URI->new('http://www.xn--hestebedgrd-58a.dk/')->as_iri;
my $utf8 = URI->new('http://xn--df-oiy.ws/')->as_iri;
Dump($latin1);
Dump($utf8);