Skip Menu |

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 129086
Status: open
Priority: 0/
Queue: Encode

People
Owner: Nobody in particular
Requestors: TONYC [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: STOP_AT_PARTIAL forced on for renew()ed encodings (for some anyway)
This causes PerlIO::encoding to loop when a partial character is found at eof. PerlIO::encoding calls into Encode roughly like the loop in the following code: use strict; use Encode qw(encode decode); use constant BUFSIZ => 8192; my $flags = Encode::PERLQQ(); # this can't be zero my $encoding = "UTF-8"; my $filesrc = "\x{100c}" x 10000; my $filedata = encode($encoding, $filesrc, Encode::FB_CROAK); #chop $filedata; # LINE A my $expect = decode($encoding, (my $temp = $filedata), $flags); my $enc = Encode::find_encoding("UTF-8") or die; # this seems to have been added for PerlIO::encoding my $dup = $enc->renew or die; my $out = ""; my $buf = ""; while (length $filedata || length $buf) { # refill the buffer from the file my $fillsize = BUFSIZ - length($buf); $buf .= substr($filedata, 0, $fillsize, ""); my $eof = $filedata eq ""; # current behaviour my $mflags = $flags | Encode::STOP_AT_PARTIAL; # LINE B # try to avoid looping over a partial at eof # my $mflags = $eof ? $flags : $flags | Encode::STOP_AT_PARTIAL; # LINE C # decode our buffer, consuming some/all of it my $result = $dup->decode($buf, $mflags); print length $buf, "\n"; $out .= $result; } print $out eq $expect ? "ok\n" : "not ok\n"; This works fine if there's no partial at eof, but if there is (uncomment the chop at LINE A) it will loop until terminated. Ok, that's fine since we pass in STOP_AT_PARTIAL, but even if we make that conditional on eof (uncomment LINE C and comment LINE B) it continues to loop. For UTF-8 at least this occurs because Method_decode() in Encode.xs passes a true value to process_utf8() for the stop_at_partial parameter if the encoding has been "renewed". If I replace in Method_decode(): s = process_utf8(aTHX_ dst, s, e, check_sv, 0, strict_utf8(aTHX_ obj), renewed); with: s = process_utf8(aTHX_ dst, s, e, check_sv, 0, strict_utf8(aTHX_ obj), 0); the modified code above works correctly. This particular change would break PerlIO::encoding on older perls[1], but it would be useful if Encode could provide a way for PerlIO::encoding to prevent that behaviour. I understand the renew() is needed to ensure the state of the encoding is kept, eg. for byte ordering for UTF-16 encodings, so I don't think that can be removed. This might be the cause of https://rt.cpan.org/Ticket/Display.html?id=124094 Any ideas? Tony [1] at least for files that don't end with a partially encoded character
On Mon Apr 08 21:00:51 2019, TONYC wrote: Show quoted text
> This particular change would break PerlIO::encoding on older perls[1], > but it would be useful if Encode could provide a way for > PerlIO::encoding to prevent that behaviour. > > I understand the renew() is needed to ensure the state of the encoding > is kept, eg. for byte ordering for UTF-16 encodings, so I don't think > that can be removed. > > This might be the cause of > https://rt.cpan.org/Ticket/Display.html?id=124094 > > Any ideas?
Possible solutions: a) add a stop_at_partial parameter to renew() that defaults to 1, so older perls will see the old behaviour, and new perls can supply zero to make the STOP_AT_PARTIAL flag significant b) add a REALLY_NO_STOP_AT_PARTIAL flag that overrides the renew controlled flag c) make it dependent on perl version. c) would be hard to test b) is just ugly I think a) is the best solution, it can be tested in any perl version. If that makes sense to you I can work on a patch. Tony
On Wed Apr 17 17:13:31 2019, TONYC wrote: Show quoted text
> On Mon Apr 08 21:00:51 2019, TONYC wrote:
> > This particular change would break PerlIO::encoding on older > > perls[1], > > but it would be useful if Encode could provide a way for > > PerlIO::encoding to prevent that behaviour. > > > > I understand the renew() is needed to ensure the state of the > > encoding > > is kept, eg. for byte ordering for UTF-16 encodings, so I don't think > > that can be removed. > > > > This might be the cause of > > https://rt.cpan.org/Ticket/Display.html?id=124094 > > > > Any ideas?
> > Possible solutions: > > a) add a stop_at_partial parameter to renew() that defaults to 1, so > older perls will see the old behaviour, and new perls can supply zero > to make the STOP_AT_PARTIAL flag significant > > b) add a REALLY_NO_STOP_AT_PARTIAL flag that overrides the renew > controlled flag > > c) make it dependent on perl version. > > c) would be hard to test > > b) is just ugly > > I think a) is the best solution, it can be tested in any perl version. > > If that makes sense to you I can work on a patch. > > Tony
I don't know who you meant by 'you', but since no one responded, I'll assume it includes me. It does make sense to me. Note that recent perls have some added infrastructure for partial character handling. These include is_utf8_fixed_width_buf_flags() and is_utf8_valid_partial_char_flags()
Subject: Re: [rt.cpan.org #129086] STOP_AT_PARTIAL forced on for renew()ed encodings (for some anyway)
Date: Wed, 4 Dec 2019 10:45:31 +1100
To: Karl Williamson via RT <bug-Encode [...] rt.cpan.org>
From: tonyc [...] cpan.org
On Tue, Dec 03, 2019 at 06:09:14PM -0500, Karl Williamson via RT wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=129086 > > > On Wed Apr 17 17:13:31 2019, TONYC wrote:
> > On Mon Apr 08 21:00:51 2019, TONYC wrote:
> > > This particular change would break PerlIO::encoding on older > > > perls[1], > > > but it would be useful if Encode could provide a way for > > > PerlIO::encoding to prevent that behaviour. > > > > > > I understand the renew() is needed to ensure the state of the > > > encoding > > > is kept, eg. for byte ordering for UTF-16 encodings, so I don't think > > > that can be removed. > > > > > > This might be the cause of > > > https://rt.cpan.org/Ticket/Display.html?id=124094 > > > > > > Any ideas?
> > > > Possible solutions: > > > > a) add a stop_at_partial parameter to renew() that defaults to 1, so > > older perls will see the old behaviour, and new perls can supply zero > > to make the STOP_AT_PARTIAL flag significant > > > > b) add a REALLY_NO_STOP_AT_PARTIAL flag that overrides the renew > > controlled flag > > > > c) make it dependent on perl version. > > > > c) would be hard to test > > > > b) is just ugly > > > > I think a) is the best solution, it can be tested in any perl version. > > > > If that makes sense to you I can work on a patch. > > > > Tony
> > > I don't know who you meant by 'you', but since no one responded, I'll assume it includes me. It does make sense to me. Note that recent perls have some added infrastructure for partial character handling. These include is_utf8_fixed_width_buf_flags() and is_utf8_valid_partial_char_flags()
Well, the maintainers of Encode, I don't know if that includes you. I don't know if this behavior is limited to UTF-8, it's been a long while since I looked at this in detail. Tony