Bug #107856 for PathTools: abs2rel problem with unicode paths

Mon Oct 19 11:13:57 2015 HAKONH [...] cpan.org - Ticket created

Subject:

abs2rel problem with unicode paths

Using Perl version 5.20.1 on a Linux laptop. When running the following script: use feature qw(say); use strict; use utf8; use warnings; use Env qw(HOME); use File::Spec::Functions qw(abs2rel); my $tdir = 'ø'; my $path = "$HOME/$tdir/b/æ"; my $base = "$HOME/$tdir"; chdir $base; binmode STDOUT, ":utf8"; say abs2rel( $path, $base ); say abs2rel( $path ); I get output: b/æ ../ø/b/æ Expected output: b/æ ../ø/b/æ Assumed problem: Line 409 in Unix.pm ( https://metacpan.org/source/SMUELLER/PathTools-3.47/lib/File/Spec/Unix.pm ) $base = $self->_cwd() unless defined $base and length $base; calls Cwd::getcwd() which returns bytes, this causes $base not to be recognized as a prefix for $path.. Fix: _cwd() should return unicode in this case.

Mon Oct 19 11:16:20 2015 HAKONH [...] cpan.org - Correspondence added

Show quoted text

> > Expected output: > > b/æ > ../ø/b/æ >

Sorry that was a typo, should be: Expected output: b/æ b/æ

Mon Oct 19 12:13:44 2015 ether [...] cpan.org - Correspondence added

On 2015-10-19 08:13:57, HAKONH wrote: Show quoted text

> $base = $self->_cwd() unless defined $base and length $base; > > calls Cwd::getcwd() which returns bytes, this causes $base not to be > recognized as a prefix for $path.. > > Fix: _cwd() should return unicode in this case.

I'm not sure that the code should do any utf8 decoding of filenames, at least not without being requested too -- there is no standardization for filesystems to use a specific encoding (some use UTF-16, some use latin1, some use utf-8..) and there is no way for us to tell which one is in use.

Mon Oct 19 12:13:45 2015 The RT System itself - Status changed from 'new' to 'open'

Mon Oct 19 15:04:19 2015 kwilliams [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #107856] abs2rel problem with unicode paths
Date:	Mon, 19 Oct 2015 14:04:06 -0500
To:	"bug-PathTools [...] rt.cpan.org" <bug-PathTools [...] rt.cpan.org>
From:	Ken Williams <kwilliams [...] cpan.org>

Filesystems use encodings at all? I thought they just used byte sequences. On Mon, Oct 19, 2015 at 11:13 AM, Karen Etheridge via RT < bug-PathTools@rt.cpan.org> wrote: Show quoted text

> Queue: PathTools > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=107856 > > > On 2015-10-19 08:13:57, HAKONH wrote: >

> > $base = $self->_cwd() unless defined $base and length $base; > > > > calls Cwd::getcwd() which returns bytes, this causes $base not to be > > recognized as a prefix for $path.. > > > > Fix: _cwd() should return unicode in this case.

> > I'm not sure that the code should do any utf8 decoding of filenames, at > least not without being requested too -- there is no standardization for > filesystems to use a specific encoding (some use UTF-16, some use latin1, > some use utf-8..) and there is no way for us to tell which one is in use. >

Mon Oct 19 15:05:57 2015 HAKONH [...] cpan.org - Correspondence added

Maybe the function should then croak if the user uses the one-argument call and $path has the UTF-8 flag set? Since in this case unexpected results may occur as shown.. Accordingly, a workaround seems to be to encode $path before passing it on: my $encode_flags = Encode::FB_CROAK | Encode::LEAVE_SRC; $path = Encode::encode( 'UTF-8', $path, $encode_flags ); say Encode::decode( 'UTF-8', abs2rel( $path ), $encode_flags ); Ouput: b/æ