Bug #86064 for URI: URI->as_iri can produce both bytes and characters, depending on input

Tue Jun 11 16:16:19 2013 gwilliams [...] cpan.org - Ticket created

The URI->as_iri method seems to produce both character strings and byte sequences depending on the input of punycode URIs. This makes dealing with the output difficult when trying to sensibly combine it with other strings. It seems to me that the difference depends on whether the decoded punycode value only contains codepoints that can be represented in latin-1. The attached test script shows the decoding of two punycode URIs: http://www.hestebedgård.dk/ http://✪df.ws/ Using Devel::Peek, it can be seen that "hestebedgård" is represented as a byte sequence with U+00e5 being represented as the single byte 0xE5 with the SV lacking the UTF8 flag. On the other hand, "✪df" is represented as a UTF8-flagged character string with the first character correctly encoded as \x{272a}. I believe the attached patch solves this problem, but I'm not sure if it might break any other cases, or if there's a better way of forcing the decoded unicode string to have the UTF8 flag.

Subject:

iri_encoding.diff

--- URI/_punycode.pm.orig 2013-06-09 10:14:14.000000000 +0400 +++ URI/_punycode.pm.new 2013-06-11 11:49:19.000000000 +0400 @@ -86,7 +86,11 @@ warn join " ", map sprintf('%04x', $_), @output if $DEBUG; $i++; } - return join '', map chr, @output; + my $uri = join '', map chr, @output; + use Encode; + my $octets = encode('UTF-8', $uri, Encode::FB_CROAK); + $uri = decode('UTF-8', $octets, Encode::FB_CROAK); + return $uri; } sub encode_punycode {

Subject:

iri_encoding.pl

#!/usr/bin/perl use strict; use warnings; use Devel::Peek; use URI; my $latin1 = URI->new('http://www.xn--hestebedgrd-58a.dk/')->as_iri; my $utf8 = URI->new('http://xn--df-oiy.ws/')->as_iri; Dump($latin1); Dump($utf8);

Tue Jun 11 16:20:09 2013 gwilliams [...] cpan.org - Subject changed from (no value) to 'URI->as_iri can produce both bytes and characters, depending on input'

Wed Jun 12 15:25:30 2013 GAAS [...] cpan.org - Correspondence added

On Tue Jun 11 16:16:19 2013, GWILLIAMS wrote:
Show quoted text

> The URI->as_iri method seems to produce both character strings and
> byte sequences depending on the input of punycode URIs. This makes
> dealing with the output difficult when trying to sensibly combine
> it with other strings.

There should not really be an semantic difference between utf8::upgraded or utf8::downgraded strings. If you have problems combining the result with other strings there is something else that's not quite right. The simplest way to upgrade is to just call:

utf8::upgrade($iri);

I don't really think $url->as_iri should change. At least I would like to see a stronger argument before we do.

Wed Jun 12 15:25:30 2013 The RT System itself - Status changed from 'new' to 'open'

Sat Jun 15 09:30:57 2013 gwilliams [...] cpan.org - Correspondence added

On Wed Jun 12 15:25:30 2013, GAAS wrote: Show quoted text

> There should not really be an semantic difference between > utf8::upgraded or > utf8::downgraded strings. If you have problems combining the result > with other > strings there is something else that's not quite right. The simplest > way to > upgrade is to just call: > > utf8::upgrade($iri); > > I don't really think $url->as_iri should change. At least I would like > to see a > stronger argument before we do.

That's a fair point. The problem may be more complex than I thought. I believe the problem I'm facing now (related to a bug-report I received for RDF::Trine) is that the string ends up being passed to a system library via XS that expects UTF8 encoded data, and has trouble with the latin-1. Moreover, the punycode spec as well as the documentation for as_iri talk explicitly about unicode strings, so I'm not sure why the appropriate place to make the utf8::upgrade call wouldn't be in the as_iri implementation. Thoughts? thanks, .greg

Wed Dec 24 18:52:15 2014 dr [...] jones.dk - Correspondence added

Subject:	[rt.cpan.org #86064] utf8::upgraded input produce utf8::downgraded output
Date:	Thu, 25 Dec 2014 00:51:54 +0100
To:	bug-URI [...] rt.cpan.org
From:	Jonas Smedegaard <dr [...] jones.dk>

Hi Gisle, Comparing this bugreport with https://github.com/kasei/perl-iri/issues/2 (understanding far better now than when I followed along a year ago), it occurs to me that in this conversation it is not clear that URI module degrades already utf8::upgraded strings. Perhaps that is the "stronger argument" that you sought back then? This demonstrates the degradation (based on above IRI conversation): use URI; use Devel::Peek; my $value = "http://www.hestebedg\x{e5}rd.dk/#frag"; utf8::upgrade($value); print STDERR "Raw value: "; Dump($value); my $uri = URI->new($value); print STDERR "URI as_iri: "; Dump($uri->as_iri); Regards, - Jonas P.S. "Hestebedgård" is a farm turned into a museum, located on the island of Orø where I live. I hit bugs in RDF::Trine when challenging myself to learn RDF by semantically modelling public facilities on my island - leading e.g. to http://data.biks.dk/hours/ ...in case you are curious and do not grok scandinavian language (as your name and interest in non-ASCII characters indicates). -- * Jonas Smedegaard - idealist & Internet-arkitekt * Tlf.: +45 40843136 Website: http://dr.jones.dk/ [x] quote me freely [ ] ask before reusing [ ] keep private

Download signature.asc
application/pgp-signature 949b

Message body not shown because it is not plain text.