Subject: | Tie::File ruins UTF-8 encoded files |
Date: | Tue, 27 May 2014 20:19:08 +0000 |
To: | perl5-porters [...] perl.org |
From: | Mark Jason Dominus <mjd [...] plover.com> |
It is not safe to use Tie::File to operate on a file with a multibyte
encoding, such as UTF-8. This has been a defect in Tie::File since it
was first written. I was aware at the time that this would be a
problem, but at the time it seemed too hard to fix. I know a little
bit more now, and the problem has become considerably more urgent
since 2005. It should be fixed. I would like to fix this, but I need
some help.
The essential problem is as follows. Tie::File holds an array that
records the tell() position at which each record in the file begins,
as reported by seek(). Tie::File makes the not-completely-warranted
assumption that these offsets are actually byte counts, but I think
this is a minor issue.
The major issue this this: Suppose that record 3 begins at tell()
offset 100, and record 4 begins at tell() offset 150. In some places
Tie::File then concludes from this that record 3 is 50 bytes long,
which I think is generally correct.
But in other places Tie::File will calculate the length of record 3 by
reading it in and calling length() on the result. Since what is
wanted is the length in bytes, this is wrong except for legacy
encodings where each character is one byte.
Now suppose Tie::File is asked to replace record 3 with some string
$s. Tie::File uses length() on $s to find out how many bytes $s is,
which is wrong, and it uses length() on record 3 similarly wrongly.
Then it compares these two wrong lengths to decide whether record 3
can be overwritten with $s in-place, or whether the tail of the file
needs to be copied upwards (if $s is shorter than the old record 3) or
downwards (if it is longer). All these calculations are being done
with character lengths, but they should be done with byte lengths.
Since they are done wrongly, Tie::File mangles the data file when it
modifies the record.
On review, I see the following relevant ten-year-old comment:
# length($oldrec) here is not consistent with text mode TODO XXX BUG
The enclosed program demonstrates the problem. It writes a correct
UTF-8 encoded file, copies it to stdout, then uses Tie::File to modify
a record in the middle, then copies the resulting carbled file to
stdout.
What I think Tie::File needs to do is to find out how many bytes $s
will occupy once written to the file, and use that in place of
length($s) in its length calculations. When I wrote Tie::File in 2005
there was no way to do this, Encode not having been invented. But I
think now I can use Encode to transform $s to a suitably-encoded byte
string, and then use the existing Tie::File machinery to write the
byte string to the file. But I'm not sure this is correct; I need
someone with some domain knowledge to help me make the right changes.
Also it seems to me that to choose the correct encoding, Tie::File
needs some way to interrogate the filehandle it is given, if it is
given one, and it needs to allow an "encoding => ...." option to be
supplied in the tie() call when it is given a filename. There may be
other encoding-related options it should support that I have no
thought of.
----------------------------------------------------------------
#!/usr/bin/perl
use Tie::File;
use Fcntl;
{ open my($fh), ">:raw", "tf-test-data" or die $!;
print $fh "Fl\303\274ghafen
Chinese:\344\270\255\345\234\213\345\223\262\345\255\270\346\233\270\351\233\273\345\255\220\345\214\226\350\250\210\345\212\203
Potat\303\270es\n";
}
binmode(STDOUT, ":utf8");
{ open my($fh), "+<:utf8", "tf-test-data"
or die $!;
print while <$fh>;
print "-------\n";
}
{
open my($fh), "+<:utf8", "tf-test-data";
my @A;
tie @A, Tie::File => $fh or die;
$A[1] = "octopus";
}
{ open my($fh), "+<:utf8", "tf-test-data"
or die $!;
print while <$fh>;
print "-------\n";
}
__DATA__