Bug #77505 for XML-PYX: Encoding mixture problem with "pyx" program

Subject:	Encoding mixture problem with "pyx" program
Date:	Mon, 28 May 2012 10:37:32 -0400
To:	bug-XML-PYX [...] rt.cpan.org
From:	"Milo Mirate" <mmirate [...] gmx.com>

When using the "pyx" program from XML-PYX-0.07 on perl v5.14.2 and Linux 3.3.7-1-ARCH on i686, with input piped in containing a variety of entities but no literal non-ASCII characters, each entity is output as UTF-8 only when that entity is not representable using Latin-1. This is a problem because: bash$ pyx 2>/dev/null <<<'<root>♨</root>' | file - /dev/stdin: UTF-8 Unicode text bash$ pyx 2>/dev/null <<<'<root>ª</root>' | file - /dev/stdin: ISO-8859 text bash$ pyx 2>/dev/null <<<'<root>♨ª</root>' | file - /dev/stdin: Non-ISO extended-ASCII text bash$ pyx 2>/dev/null <<<'<root>♨ª</root>' | grep ^- | head -c -1 | xxd 0000000: 2de2 99a8 0a2d aa -....-. ^^ | _____________________/ / | ... this byte should have a <0xC2> byte before it because: bash$ $ unicode U+00AA | fgrep UTF-8: | sed -re 's/ UTF-16.*//' UTF-8: c2 aa yet it does not (instead there's the dash at the beginning of the line), thus, as `file` points out, the result is neither valid UTF-8 nor valid Latin-1 even though the presence of the properly-encoded UTF-8 character means that the entire text should be valid UTF-8.