Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the URI-Find CPAN distribution.

Report information
The Basics
Id: 44226
Status: resolved
Priority: 0/
Queue: URI-Find

People
Owner: Nobody in particular
Requestors: mschwern [...] cpan.org
Cc: miyagawa [...] bulknews.net
AdminCc:

Bug Information
Severity: Wishlist
Broken in: (no value)
Fixed in: (no value)



CC: miyagawa [...] bulknews.net
Subject: Support Unicode in URLs and domains
URI::Find should support Unicode in the domains and in the path part. Since URI.pm already supports this it should be a matter of updating the cheater regex used by URI::Find.
If anyone's interested there's the IDNA branch on github which has most of this done. Just needs to keep URI from escaping things and solve the foo:<1.2.3.4> problem.
From: kas [...] fi.muni.cz
Hello, On Tue Mar 17 02:47:14 2009, MSCHWERN wrote: Show quoted text
> If anyone's interested there's the IDNA branch on github which has most > of this done. Just needs to keep URI from escaping things and solve the > foo:<1.2.3.4> problem.
as far as I can see, the IDNA branch works for me. Would you mind to merge it and make a release? There is a merge conflict in t/Find.t which can be easily solved by keeping the tests added by both branches. Thanks! -Yenya
On Tue Feb 19 10:18:57 2013, YENYA wrote: Show quoted text
> Hello, > > On Tue Mar 17 02:47:14 2009, MSCHWERN wrote:
> > If anyone's interested there's the IDNA branch on github which has most > > of this done. Just needs to keep URI from escaping things and solve the > > foo:<1.2.3.4> problem.
> > as far as I can see, the IDNA branch works for me. Would you mind to > merge it and make a release? There is a merge conflict in t/Find.t which > can be easily solved by keeping the tests added by both branches.
Well, the IDNA branch has one problem - the escape_func is called with correct second argument, but incorrect first argument. The first argument is result of the uri_unescape() call near the end of _is_uri() function, and uri_unescape() returns bytes instead of characters. A clean solution would be to have some kind of uri_unescape_utf8() inside URI/Escape.pm, but it is also possible to wrap the uri_unescape() call inside Encode::decode(). Patch attached - please review and maybe apply.
Subject: URI-Find-utf8.patch
diff --git a/URI/Find.pm b/URI/Find.pm index 6b7e4a5..dd87cf7 100644 --- a/perllib/URI/Find.pm +++ b/perllib/URI/Find.pm @@ -19,6 +19,7 @@ use constant NO => !YES; use Carp qw(croak); use URI::URL; use URI::Escape; +use Encode; require URI; @@ -524,7 +525,7 @@ sub _is_uri { else { # Its a URI. # URI is a bit too overzealous about escaping. # XXX but this means we unescape already escaped URLs and lose round tripping - return uri_unescape($uri); + return Encode::decode('utf8', uri_unescape($uri), Encode::FB_QUIET); } }
20140702 just released supports Unicode domains.

This was a patch independent from the work in the IDNA branch.  It works and this issue is so old I don't remember what the problems were, so I'm closing this ticket down.  Feel free to re-open more issues and apply work from the IDNA branch if there's problems.

We're using Github now for issues.
https://github.com/schwern/URI-Find/issues