Subject: | Unicode and Archive::Tar - documentation patch |
Archive::Tar uses byte semantics for any data read from files or
written to them. This is not a problem if you only deal with existing
files and never look at their content or work solely with byte
strings. But if you use Unicode strings with character semantics, some
additional steps need to be taken.
For example, if you add a Unicode string like
# Problem
$tar->add_data('file.txt', "Euro: \x{20AC}");
then there will be a problem later when the tarfile gets written out
to disk via $tar->write():
Wide character in print at .../Archive/Tar.pm line 1014.
The data was added as a Unicode string and when writing it out to disk,
the :utf8 line discipline wasn't set by Archive::Tar, so Perl
tried to convert the string to ISO-8859 and failed. The written file
now contains garbage.
There's probably no easy fix for this (unless you want to add Encode.pm
as a prerequisite and check if you receive a Unicode string), so I
thought a documentation patch might help people circumvent this and
related problems.
Attached.
Subject: | patch.txt |
--- Archive-Tar-1.32/lib/Archive/Tar.pm Thu May 24 04:21:42 2007
+++ Archive-Tar-1.32.patched/lib/Archive/Tar.pm Fri Jul 20 12:58:36 2007
@@ -1451,6 +1451,56 @@
__END__
+=head1 Handling Unicode Strings
+
+C<Archive::Tar> uses byte semantics for any files it reads from or writes
+to disk. This is not a problem if you only deal with files and never
+look at their content or work solely with byte strings. But if you use
+Unicode strings with character semantics, some additional steps need
+to be taken.
+
+For example, if you add a Unicode string like
+
+ # Problem
+ $tar->add_data('file.txt', "Euro: \x{20AC}");
+
+then there will be a problem later when the tarfile gets written out
+to disk via C<$tar->write()>:
+
+ Wide character in print at .../Archive/Tar.pm line 1014.
+
+The data was added as a Unicode string and when writing it out to disk,
+the C<:utf8> line discipline wasn't set by C<Archive::Tar>, so Perl
+tried to convert the string to ISO-8859 and failed. The written file
+now contains garbage.
+
+For this reason, Unicode strings need to be converted to UTF-8-encoded
+bytestrings before they are handed off to C<add_data()>:
+
+ use Encode;
+ my $data = "Accented character: \x{20AC}";
+ $data = encode('utf8', $data);
+
+ $tar->add_data('file.txt', $data);
+
+A opposite problem occurs if you extract a UTF8-encoded file from a
+tarball. Using C<get_content()> on the C<Archive::Tar::File> object
+will return its content as a bytestring, not as a Unicode string.
+
+If you want it to be a Unicode string (because you want character
+semantics with operations like regular expression matching), you need
+to decode the UTF8-encoded content and have Perl convert it into
+a Unicode string:
+
+ use Encode;
+ my $data = $tar->get_content();
+ # Make it a Unicode string
+ $data = decode('utf8', $data);
+
+There is no easy way to provide this functionality in C<Archive::Tar>,
+because a tarball can contain many files, and each of which could be
+encoded in a different way.
+
=head1 GLOBAL VARIABLES
=head2 $Archive::Tar::FOLLOW_SYMLINK