Subject: | Transcoding problems (ISO-2022-JP to UTF-8) |
Dan,
I've encountered an issue whereby Microsoft and other tools working in a
Japanese encoding (JIS X 208) will use the code point 0x2d21 to
represent a character comprising a circled digit 1 (and 0x2d22 is
circle-2, etc) and save the resulting as ISO-2022-JP.
This is not officially defined in the standards, but according to
Japanese colleagues is in widespread usage.
When transcoding to UTF-8 0x2d21 is converted to U2460 (which seems like
a sane thing to do), but on round-tripping back to ISO-2022-JP, this
code point is lost, and is left as \x{2460}.
I've attached a sample file of Japanese text. The roundtrip code should
be trivial.
(The use case here, for reference, is that I have text/html in
ISO-20220-JP and need to convert it to UTF-8 prior to parsing, else the
parser tends to fall over any 0x3c bytes in the stream which are coming
from Japanese mode, and then need to get it back to ISO-2022-JP prior to
display).
Is there any way of getting Encode to use a 'quirks' mode or similar to
allow the non-standard code points to be converted back to MSFT's
version of JIS X 208?
Oh, this is the Encode which comes as standard with both perl 5.8.4 and
5.8.8 - it's a custom build running on RHEL 3.0
Thanks,
Ben
Subject: | hy-new-2006-06-13-1.txt |
test1$B-!-!%F%9%H(B
test2$B- -!-"-#-$-%-&-'-(-)-*(B