Subject: | Encoding mixture problem with "pyx" program |
Date: | Mon, 28 May 2012 10:37:32 -0400 |
To: | bug-XML-PYX [...] rt.cpan.org |
From: | "Milo Mirate" <mmirate [...] gmx.com> |
When using the "pyx" program from XML-PYX-0.07 on perl v5.14.2 and Linux 3.3.7-1-ARCH on i686, with input piped in containing a variety of entities but no literal non-ASCII characters, each entity is output as UTF-8 only when that entity is not representable using Latin-1. This is a problem because:
bash$ pyx 2>/dev/null <<<'<root>♨</root>' | file -
/dev/stdin: UTF-8 Unicode text
bash$ pyx 2>/dev/null <<<'<root>ª</root>' | file -
/dev/stdin: ISO-8859 text
bash$ pyx 2>/dev/null <<<'<root>♨ª</root>' | file -
/dev/stdin: Non-ISO extended-ASCII text
bash$ pyx 2>/dev/null <<<'<root>♨ª</root>' | grep ^- | head -c -1 | xxd
0000000: 2de2 99a8 0a2d aa -....-.
^^
|
_____________________/
/
|
... this byte should have a <0xC2> byte before it because:
bash$ $ unicode U+00AA | fgrep UTF-8: | sed -re 's/ UTF-16.*//'
UTF-8: c2 aa
yet it does not (instead there's the dash at the beginning of the line), thus, as `file` points out, the result is neither valid UTF-8 nor valid Latin-1 even though the presence of the properly-encoded UTF-8 character means that the entire text should be valid UTF-8.