Bug #96014 for Tie-File: Tie::File ruins UTF-8 encoded files

Tue May 27 17:10:00 2014 mjd [...] plover.com - Ticket created

Subject:	Tie::File ruins UTF-8 encoded files
Date:	Tue, 27 May 2014 20:19:08 +0000
To:	perl5-porters [...] perl.org
From:	Mark Jason Dominus <mjd [...] plover.com>

It is not safe to use Tie::File to operate on a file with a multibyte encoding, such as UTF-8. This has been a defect in Tie::File since it was first written. I was aware at the time that this would be a problem, but at the time it seemed too hard to fix. I know a little bit more now, and the problem has become considerably more urgent since 2005. It should be fixed. I would like to fix this, but I need some help. The essential problem is as follows. Tie::File holds an array that records the tell() position at which each record in the file begins, as reported by seek(). Tie::File makes the not-completely-warranted assumption that these offsets are actually byte counts, but I think this is a minor issue. The major issue this this: Suppose that record 3 begins at tell() offset 100, and record 4 begins at tell() offset 150. In some places Tie::File then concludes from this that record 3 is 50 bytes long, which I think is generally correct. But in other places Tie::File will calculate the length of record 3 by reading it in and calling length() on the result. Since what is wanted is the length in bytes, this is wrong except for legacy encodings where each character is one byte. Now suppose Tie::File is asked to replace record 3 with some string $s. Tie::File uses length() on $s to find out how many bytes $s is, which is wrong, and it uses length() on record 3 similarly wrongly. Then it compares these two wrong lengths to decide whether record 3 can be overwritten with $s in-place, or whether the tail of the file needs to be copied upwards (if $s is shorter than the old record 3) or downwards (if it is longer). All these calculations are being done with character lengths, but they should be done with byte lengths. Since they are done wrongly, Tie::File mangles the data file when it modifies the record. On review, I see the following relevant ten-year-old comment: # length($oldrec) here is not consistent with text mode TODO XXX BUG The enclosed program demonstrates the problem. It writes a correct UTF-8 encoded file, copies it to stdout, then uses Tie::File to modify a record in the middle, then copies the resulting carbled file to stdout. What I think Tie::File needs to do is to find out how many bytes $s will occupy once written to the file, and use that in place of length($s) in its length calculations. When I wrote Tie::File in 2005 there was no way to do this, Encode not having been invented. But I think now I can use Encode to transform $s to a suitably-encoded byte string, and then use the existing Tie::File machinery to write the byte string to the file. But I'm not sure this is correct; I need someone with some domain knowledge to help me make the right changes. Also it seems to me that to choose the correct encoding, Tie::File needs some way to interrogate the filehandle it is given, if it is given one, and it needs to allow an "encoding => ...." option to be supplied in the tie() call when it is given a filename. There may be other encoding-related options it should support that I have no thought of. ---------------------------------------------------------------- #!/usr/bin/perl use Tie::File; use Fcntl; { open my($fh), ">:raw", "tf-test-data" or die $!; print $fh "Fl\303\274ghafen Chinese:\344\270\255\345\234\213\345\223\262\345\255\270\346\233\270\351\233\273\345\255\220\345\214\226\350\250\210\345\212\203 Potat\303\270es\n"; } binmode(STDOUT, ":utf8"); { open my($fh), "+<:utf8", "tf-test-data" or die $!; print while <$fh>; print "-------\n"; } { open my($fh), "+<:utf8", "tf-test-data"; my @A; tie @A, Tie::File => $fh or die; $A[1] = "octopus"; } { open my($fh), "+<:utf8", "tf-test-data" or die $!; print while <$fh>; print "-------\n"; } __DATA__

Tue May 27 17:11:42 2014 MJD [...] cpan.org - Taken

Tue May 27 17:11:43 2014 MJD [...] cpan.org - Severity Critical added

Tue May 27 17:11:43 2014 MJD [...] cpan.org - Broken in 0.98 added

Sat Jan 18 22:51:44 2020 TODDR [...] cpan.org - Correspondence added

Ticket migrated to github as https://github.com/toddr/Tie-File/issues/8

Sat Jan 18 22:51:44 2020 The RT System itself - Status changed from 'new' to 'open'

Sat Jan 18 22:51:44 2020 TODDR [...] cpan.org - Status changed from 'open' to 'resolved'