Bug #44226 for URI-Find: Support Unicode in URLs and domains

Mon Mar 16 20:10:49 2009 mschwern [...] cpan.org - Ticket created

CC:	miyagawa [...] bulknews.net
Subject:	Support Unicode in URLs and domains

URI::Find should support Unicode in the domains and in the path part. Since URI.pm already supports this it should be a matter of updating the cheater regex used by URI::Find.

Mon Mar 16 21:47:14 2009 mschwern [...] cpan.org - Correspondence added

If anyone's interested there's the IDNA branch on github which has most of this done. Just needs to keep URI from escaping things and solve the foo:<1.2.3.4> problem.

Mon Mar 16 21:47:15 2009 mschwern [...] cpan.org - Status changed from 'new' to 'open'

Tue Feb 19 04:18:57 2013 kas [...] fi.muni.cz - Correspondence added

From:

kas [...] fi.muni.cz

Hello, On Tue Mar 17 02:47:14 2009, MSCHWERN wrote: Show quoted text

> If anyone's interested there's the IDNA branch on github which has most > of this done. Just needs to keep URI from escaping things and solve the > foo:<1.2.3.4> problem.

as far as I can see, the IDNA branch works for me. Would you mind to merge it and make a release? There is a merge conflict in t/Find.t which can be easily solved by keeping the tests added by both branches. Thanks! -Yenya

Mon Sep 09 12:25:35 2013 kas [...] fi.muni.cz - Correspondence added

On Tue Feb 19 10:18:57 2013, YENYA wrote: Show quoted text

> Hello, > > On Tue Mar 17 02:47:14 2009, MSCHWERN wrote:

> > If anyone's interested there's the IDNA branch on github which has most > > of this done. Just needs to keep URI from escaping things and solve the > > foo:<1.2.3.4> problem.

> > as far as I can see, the IDNA branch works for me. Would you mind to > merge it and make a release? There is a merge conflict in t/Find.t which > can be easily solved by keeping the tests added by both branches.

Well, the IDNA branch has one problem - the escape_func is called with correct second argument, but incorrect first argument. The first argument is result of the uri_unescape() call near the end of _is_uri() function, and uri_unescape() returns bytes instead of characters. A clean solution would be to have some kind of uri_unescape_utf8() inside URI/Escape.pm, but it is also possible to wrap the uri_unescape() call inside Encode::decode(). Patch attached - please review and maybe apply.

Subject:

URI-Find-utf8.patch

diff --git a/URI/Find.pm b/URI/Find.pm index 6b7e4a5..dd87cf7 100644 --- a/perllib/URI/Find.pm +++ b/perllib/URI/Find.pm @@ -19,6 +19,7 @@ use constant NO => !YES; use Carp qw(croak); use URI::URL; use URI::Escape; +use Encode; require URI; @@ -524,7 +525,7 @@ sub _is_uri { else { # Its a URI. # URI is a bit too overzealous about escaping. # XXX but this means we unescape already escaped URLs and lose round tripping - return uri_unescape($uri); + return Encode::decode('utf8', uri_unescape($uri), Encode::FB_QUIET); } }

Wed Jul 02 16:47:57 2014 mschwern [...] cpan.org - Correspondence added

20140702 just released supports Unicode domains.

This was a patch independent from the work in the IDNA branch. It works and this issue is so old I don't remember what the problems were, so I'm closing this ticket down. Feel free to re-open more issues and apply work from the IDNA branch if there's problems.

We're using Github now for issues.
https://github.com/schwern/URI-Find/issues

Wed Jul 02 16:47:58 2014 mschwern [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #44226 for URI-Find: Support Unicode in URLs and domains

Preferred bug tracker