Bug #126280 for PAR-Packer: 90-rt122949.t fails when "Use Unicode UTF-8 for worldwide language support" is enabled

Wed Aug 15 19:56:57 2018 XENU [...] cpan.org - Ticket created

Subject:

90-rt122949.t fails when "Use Unicode UTF-8 for worldwide language support" is enabled

Windows 10 has recently introduced a new feature - "Use Unicode UTF-8 for worldwide language support" checkbox in Region Settings. Enabling it globally changes codepage to 65001 and makes everything use UTF-8. When this checkbox is enabled, 90-rt122949.t fails in the following way: ok 110 - successfully ran "C:\Users\xenu\AppData\Local\Temp\CKZfGeKF8p\packed.exe ab" not ok 111 # Failed test at t/90-rt122949.t line 77. # got: '$VAR1 = [ # "a\357\277\275b" # ]; # ' # expected: '$VAR1 = [ # "a\205b" # ]; # ' # Looks like you failed 1 test of 111. "\357\277\275" is a REPLACEMENT CHARACTER. It seems that when the UTF-8 checkbox is enabled, bytes that aren't valid UTF-8 are being replaced with that character. "\x{85}" obviously isn't a valid UTF-8 character. I think that the failing testcase should either be removed or replaced with something that is valid UTF-8.

Thu Aug 16 05:11:13 2018 RSCHUPP [...] cpan.org - Correspondence added

On 2018-08-15 19:56:57, XENU wrote: Show quoted text

> "\357\277\275" is a REPLACEMENT CHARACTER. It seems that when the UTF- > 8 checkbox is enabled, bytes that aren't valid UTF-8 are being > replaced with that character. "\x{85}" obviously isn't a valid UTF-8 > character.

Nope, "\x{85}" is a valid Unicode code point (there's no such thing as a "UTF-8 character"), cf. http://www.unicode.org/charts/PDF/U0080.pdf For backgroud information, we're in a murky Windows area here: when you call the C-level function (somewhere in the guts of PAR::Packer) spawnvp(P_WAIT, "some.exe", argv) you have to actually manipulate the strings in argv[] so that some.exe actually sees the original argv in its main(argc, argv) The most obvious gotcha is when some argv[i] contains blanks, e.g. "foo bar quux", which will arrive at some.exe as *three* separate elements of argv[], "foo", "bar", "quux". See Win32::ShellQuote for details, that's where I stole most of the test cases from. Anyway, a 100% solution is probably not possible and "\x{85}", while legal Unicode, isn't a very relevant test case - it's a control char ("NEXT LINE"). So there may be a reason why Microsoft treats it differently under "Use Unicode UTF-8 for worldwide language support". Let's replace this test case with some more relevant cases uses of strings with non-ASCII chars: [ qq[german umlaute \x{E4}\x{F6}\x{FC}] ], [ qq[chinese zhongwen \x{4E2D}\{6587}] ], Can you rerun the failing test with these modifications under "Use Unicode..."? Cheers, Roderich

Thu Aug 16 05:11:14 2018 The RT System itself - Status changed from 'new' to 'open'

Thu Aug 16 06:22:19 2018 me [...] xenu.pl - Correspondence added

Subject:	Re: [rt.cpan.org #126280] 90-rt122949.t fails when "Use Unicode UTF-8 for worldwide language support" is enabled
Date:	Thu, 16 Aug 2018 12:22:00 +0200
To:	bug-PAR-Packer [...] rt.cpan.org
From:	Tomasz Konojacki <me [...] xenu.pl>

On Thu, 16 Aug 2018 05:11:14 -0400 "Roderich Schupp via RT" <bug-PAR-Packer@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=126280 > > > On 2018-08-15 19:56:57, XENU wrote:

> > "\357\277\275" is a REPLACEMENT CHARACTER. It seems that when the UTF- > > 8 checkbox is enabled, bytes that aren't valid UTF-8 are being > > replaced with that character. "\x{85}" obviously isn't a valid UTF-8 > > character.

> > Nope, "\x{85}" is a valid Unicode code point (there's no such thing as a > "UTF-8 character"), cf. http://www.unicode.org/charts/PDF/U0080.pdf

Of course U+0085 exists, but it's irrelevant because in this case we're talking about raw bytes. And by "UTF-8 character" I meant "UTF-8 encoded codepoint". "\xc2\x85" (or Encode::encode("UTF-8", "\x85")) would work fine, I have tested that. Show quoted text

> For backgroud information, we're in a murky Windows area here: > when you call the C-level function (somewhere in the guts of PAR::Packer) > > spawnvp(P_WAIT, "some.exe", argv) > > you have to actually manipulate the strings in argv[] so that some.exe > actually sees the original argv in its > > main(argc, argv) > > The most obvious gotcha is when some argv[i] contains blanks, e.g. > "foo bar quux", which will arrive at some.exe as *three* separate elements of argv[], > "foo", "bar", "quux". See Win32::ShellQuote for details, that's where I stole > most of the test cases from. > > Anyway, a 100% solution is probably not possible and "\x{85}", while legal Unicode, > isn't a very relevant test case - it's a control char ("NEXT LINE"). So there may > be a reason why Microsoft treats it differently under "Use Unicode UTF-8 for worldwide language support". > Let's replace this test case with some more relevant cases uses of strings > with non-ASCII chars: > > [ qq[german umlaute \x{E4}\x{F6}\x{FC}] ], > [ qq[chinese zhongwen \x{4E2D}\{6587}] ], > > Can you rerun the failing test with these modifications under "Use Unicode..."?

Both of them fail: ok 110 - successfully ran "C:\Users\xenu\AppData\Local\Temp\qn5gz65wHX\packed.exe german umlaute " not ok 111 # Failed test at t\90-rt122949.t line 79. # got: '$VAR1 = [ # "german umlaute \357\277\275\357\277\275\357\277\275" # ]; # ' # expected: '$VAR1 = [ # "german umlaute \344\366\374" # ]; # ' Wide character in print at C:/Strawberry/perl/lib/Test2/Formatter/TAP.pm line 144. ok 112 - successfully ran "C:\Users\xenu\AppData\Local\Temp\qn5gz65wHX\packed.exe chinese zhongwen ??" not ok 113 # Failed test at t\90-rt122949.t line 79. # got: '$VAR1 = [ # "chinese zhongwen \344\270\255\346\226\207" # ]; # ' # expected: '$VAR1 = [ # "chinese zhongwen \x{4e2d}\x{6587}" # ]; # ' # Looks like you failed 2 tests of 113. However, if I replace them with qq[german umlaute \xc3\xa4\xc3\xb6\xc3\xbc] and qq[chinese zhongwen \xe4\xb8\xab\xe6\x96\x87] the test passes. Show quoted text

> > Cheers, Roderich

Thu Aug 16 07:48:58 2018 RSCHUPP [...] cpan.org - Correspondence added

On 2018-08-16 06:22:19, me@xenu.pl wrote: Show quoted text

> Of course U+0085 exists, but it's irrelevant because in this case > we're > talking about raw bytes.

You're right, I once again got fooled by the idiotic Unicode handling in Perl. Can you add use Encode qw( encode ); near the top of 90-rt122949.t and replace the "...\x{85}..." test case with [ encode("UTF-8", qq[qq[a\x{85}b]) ], [ encode("UTF-8", qq[qq[smiley \x{263A}]) ], [ encode("UTF-8", qq[german umlaute \x{E4}\x{F6}\x{FC}]) ], [ encode("UTF-8", qq[chinese zhongwen \x{4E2D}\x{6587}]) ], and rerun the failing test both with and without "Use Unicode..."? Cheers, Roderich

Thu Aug 16 08:05:30 2018 me [...] xenu.pl - Correspondence added

Subject:	Re: [rt.cpan.org #126280] 90-rt122949.t fails when "Use Unicode UTF-8 for worldwide language support" is enabled
Date:	Thu, 16 Aug 2018 14:05:09 +0200
To:	bug-PAR-Packer [...] rt.cpan.org
From:	Tomasz Konojacki <me [...] xenu.pl>

On Thu, 16 Aug 2018 07:48:59 -0400 "Roderich Schupp via RT" <bug-PAR-Packer@rt.cpan.org> wrote: Show quoted text

> You're right, I once again got fooled by the idiotic Unicode handling in Perl.

Yeah, I hate it too. Show quoted text

> Can you add > > use Encode qw( encode ); > > near the top of 90-rt122949.t and replace the "...\x{85}..." test case with > > [ encode("UTF-8", qq[qq[a\x{85}b]) ], > [ encode("UTF-8", qq[qq[smiley \x{263A}]) ], > [ encode("UTF-8", qq[german umlaute \x{E4}\x{F6}\x{FC}]) ], > [ encode("UTF-8", qq[chinese zhongwen \x{4E2D}\x{6587}]) ], > > and rerun the failing test both with and without "Use Unicode..."?

You made a small typo (doubled 'qq['), but after fixing it, it passes both with checkbox enabled and disabled. Show quoted text

> > Cheers, Roderich

Thu Aug 16 08:13:13 2018 RSCHUPP [...] cpan.org - Correspondence added

On 2018-08-16 08:05:30, me@xenu.pl wrote: Show quoted text

> You made a small typo (doubled 'qq['), but after fixing it, it passes > both with checkbox enabled and disabled.

Thanx for testing, fix will be in the next release of PAR::Packer. Cheers, Roderich

Thu Aug 16 08:14:45 2018 RSCHUPP [...] cpan.org - Status changed from 'open' to 'resolved'