Skip Menu |

This queue is for tickets about the XML-PYX CPAN distribution.

Report information
The Basics
Id: 77505
Status: new
Priority: 0/
Queue: XML-PYX

People
Owner: Nobody in particular
Requestors: mmirate [...] gmx.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Encoding mixture problem with "pyx" program
Date: Mon, 28 May 2012 10:37:32 -0400
To: bug-XML-PYX [...] rt.cpan.org
From: "Milo Mirate" <mmirate [...] gmx.com>
When using the "pyx" program from XML-PYX-0.07 on perl v5.14.2 and Linux 3.3.7-1-ARCH on i686, with input piped in containing a variety of entities but no literal non-ASCII characters, each entity is output as UTF-8 only when that entity is not representable using Latin-1. This is a problem because:    bash$ pyx 2>/dev/null <<<'<root>&#x2668;</root>' | file -    /dev/stdin: UTF-8 Unicode text    bash$ pyx 2>/dev/null <<<'<root>&#xaa;</root>' | file -    /dev/stdin: ISO-8859 text    bash$ pyx 2>/dev/null <<<'<root>&#x2668;&#xaa;</root>' | file -    /dev/stdin: Non-ISO extended-ASCII text    bash$ pyx 2>/dev/null <<<'<root>&#x2668;&#xaa;</root>' | grep ^- | head -c -1 | xxd    0000000: 2de2 99a8 0a2d aa                        -....-.                            ^^                            |       _____________________/      /      | ... this byte should have a <0xC2> byte before it because:    bash$ $ unicode U+00AA | fgrep UTF-8: | sed -re 's/  UTF-16.*//'    UTF-8: c2 aa yet it does not (instead there's the dash at the beginning of the line), thus, as `file` points out, the result is neither valid UTF-8 nor valid Latin-1 even though the presence of the properly-encoded UTF-8 character means that the entire text should be valid UTF-8.