Bug #28407 for Archive-Tar: Unicode and Archive::Tar

Fri Jul 20 16:13:45 2007 MSCHILLI [...] cpan.org - Ticket created

Subject:

Unicode and Archive::Tar - documentation patch

Archive::Tar uses byte semantics for any data read from files or written to them. This is not a problem if you only deal with existing files and never look at their content or work solely with byte strings. But if you use Unicode strings with character semantics, some additional steps need to be taken. For example, if you add a Unicode string like # Problem $tar->add_data('file.txt', "Euro: \x{20AC}"); then there will be a problem later when the tarfile gets written out to disk via $tar->write(): Wide character in print at .../Archive/Tar.pm line 1014. The data was added as a Unicode string and when writing it out to disk, the :utf8 line discipline wasn't set by Archive::Tar, so Perl tried to convert the string to ISO-8859 and failed. The written file now contains garbage. There's probably no easy fix for this (unless you want to add Encode.pm as a prerequisite and check if you receive a Unicode string), so I thought a documentation patch might help people circumvent this and related problems. Attached.

Subject:

patch.txt

--- Archive-Tar-1.32/lib/Archive/Tar.pm Thu May 24 04:21:42 2007 +++ Archive-Tar-1.32.patched/lib/Archive/Tar.pm Fri Jul 20 12:58:36 2007 @@ -1451,6 +1451,56 @@ __END__ +=head1 Handling Unicode Strings + +C<Archive::Tar> uses byte semantics for any files it reads from or writes +to disk. This is not a problem if you only deal with files and never +look at their content or work solely with byte strings. But if you use +Unicode strings with character semantics, some additional steps need +to be taken. + +For example, if you add a Unicode string like + + # Problem + $tar->add_data('file.txt', "Euro: \x{20AC}"); + +then there will be a problem later when the tarfile gets written out +to disk via C<$tar->write()>: + + Wide character in print at .../Archive/Tar.pm line 1014. + +The data was added as a Unicode string and when writing it out to disk, +the C<:utf8> line discipline wasn't set by C<Archive::Tar>, so Perl +tried to convert the string to ISO-8859 and failed. The written file +now contains garbage. + +For this reason, Unicode strings need to be converted to UTF-8-encoded +bytestrings before they are handed off to C<add_data()>: + + use Encode; + my $data = "Accented character: \x{20AC}"; + $data = encode('utf8', $data); + + $tar->add_data('file.txt', $data); + +A opposite problem occurs if you extract a UTF8-encoded file from a +tarball. Using C<get_content()> on the C<Archive::Tar::File> object +will return its content as a bytestring, not as a Unicode string. + +If you want it to be a Unicode string (because you want character +semantics with operations like regular expression matching), you need +to decode the UTF8-encoded content and have Perl convert it into +a Unicode string: + + use Encode; + my $data = $tar->get_content(); + # Make it a Unicode string + $data = decode('utf8', $data); + +There is no easy way to provide this functionality in C<Archive::Tar>, +because a tarball can contain many files, and each of which could be +encoded in a different way. + =head1 GLOBAL VARIABLES =head2 $Archive::Tar::FOLLOW_SYMLINK

Wed Jul 25 09:41:56 2007 kane [...] cpan.org - Correspondence added

Thanks, applied under the FAQ section.

Wed Jul 25 09:41:59 2007 The RT System itself - Status changed from 'new' to 'open'

Wed Jul 25 09:42:03 2007 kane [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #28407 for Archive-Tar: Unicode and Archive::Tar - documentation patch