Subject: | STOP_AT_PARTIAL forced on for renew()ed encodings (for some anyway) |
This causes PerlIO::encoding to loop when a partial character is found at eof.
PerlIO::encoding calls into Encode roughly like the loop in the following code:
use strict;
use Encode qw(encode decode);
use constant BUFSIZ => 8192;
my $flags = Encode::PERLQQ(); # this can't be zero
my $encoding = "UTF-8";
my $filesrc = "\x{100c}" x 10000;
my $filedata = encode($encoding, $filesrc, Encode::FB_CROAK);
#chop $filedata; # LINE A
my $expect = decode($encoding, (my $temp = $filedata), $flags);
my $enc = Encode::find_encoding("UTF-8")
or die;
# this seems to have been added for PerlIO::encoding
my $dup = $enc->renew
or die;
my $out = "";
my $buf = "";
while (length $filedata || length $buf) {
# refill the buffer from the file
my $fillsize = BUFSIZ - length($buf);
$buf .= substr($filedata, 0, $fillsize, "");
my $eof = $filedata eq "";
# current behaviour
my $mflags = $flags | Encode::STOP_AT_PARTIAL; # LINE B
# try to avoid looping over a partial at eof
# my $mflags = $eof ? $flags : $flags | Encode::STOP_AT_PARTIAL; # LINE C
# decode our buffer, consuming some/all of it
my $result = $dup->decode($buf, $mflags);
print length $buf, "\n";
$out .= $result;
}
print $out eq $expect ? "ok\n" : "not ok\n";
This works fine if there's no partial at eof, but if there is (uncomment the chop at LINE A) it will loop until terminated.
Ok, that's fine since we pass in STOP_AT_PARTIAL, but even if we make that conditional on eof (uncomment LINE C and comment LINE B) it continues to loop.
For UTF-8 at least this occurs because Method_decode() in Encode.xs passes a true value to process_utf8() for the stop_at_partial parameter if the encoding has been "renewed".
If I replace in Method_decode():
s = process_utf8(aTHX_ dst, s, e, check_sv, 0, strict_utf8(aTHX_ obj), renewed);
with:
s = process_utf8(aTHX_ dst, s, e, check_sv, 0, strict_utf8(aTHX_ obj), 0);
the modified code above works correctly.
This particular change would break PerlIO::encoding on older perls[1], but it would be useful if Encode could provide a way for PerlIO::encoding to prevent that behaviour.
I understand the renew() is needed to ensure the state of the encoding is kept, eg. for byte ordering for UTF-16 encodings, so I don't think that can be removed.
This might be the cause of https://rt.cpan.org/Ticket/Display.html?id=124094
Any ideas?
Tony
[1] at least for files that don't end with a partially encoded character