Skip Menu |

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 19867
Status: resolved
Priority: 0/
Queue: Encode

People
Owner: Nobody in particular
Requestors: Ben.Evans [...] morganstanley
com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: (no value)
Fixed in: (no value)



Subject: Transcoding problems (ISO-2022-JP to UTF-8)
Dan, I've encountered an issue whereby Microsoft and other tools working in a Japanese encoding (JIS X 208) will use the code point 0x2d21 to represent a character comprising a circled digit 1 (and 0x2d22 is circle-2, etc) and save the resulting as ISO-2022-JP. This is not officially defined in the standards, but according to Japanese colleagues is in widespread usage. When transcoding to UTF-8 0x2d21 is converted to U2460 (which seems like a sane thing to do), but on round-tripping back to ISO-2022-JP, this code point is lost, and is left as \x{2460}. I've attached a sample file of Japanese text. The roundtrip code should be trivial. (The use case here, for reference, is that I have text/html in ISO-20220-JP and need to convert it to UTF-8 prior to parsing, else the parser tends to fall over any 0x3c bytes in the stream which are coming from Japanese mode, and then need to get it back to ISO-2022-JP prior to display). Is there any way of getting Encode to use a 'quirks' mode or similar to allow the non-standard code points to be converted back to MSFT's version of JIS X 208? Oh, this is the Encode which comes as standard with both perl 5.8.4 and 5.8.8 - it's a custom build running on RHEL 3.0 Thanks, Ben
Subject: hy-new-2006-06-13-1.txt
test1$B-!-!%F%9%H(B test2$B- -!-"-#-$-%-&-'-(-)-*(B
On Tue Jun 13 07:04:16 2006, guest wrote: Show quoted text
> Dan, > > I've encountered an issue whereby Microsoft and other tools working in a > Japanese encoding (JIS X 208) will use the code point 0x2d21 to > represent a character comprising a circled digit 1 (and 0x2d22 is > circle-2, etc) and save the resulting as ISO-2022-JP. > > This is not officially defined in the standards, but according to > Japanese colleagues is in widespread usage.
Not official therefore not supported, period. This is true especially when it comes to ISO-2022-JP. Show quoted text
> When transcoding to UTF-8 0x2d21 is converted to U2460 (which seems like > a sane thing to do), but on round-tripping back to ISO-2022-JP, this > code point is lost, and is left as \x{2460}.
It's considered malformed in ISO-2022-JP so Encode is acting correctly. Show quoted text
> I've attached a sample file of Japanese text. The roundtrip code should > be trivial. > > (The use case here, for reference, is that I have text/html in > ISO-20220-JP and need to convert it to UTF-8 prior to parsing, else the > parser tends to fall over any 0x3c bytes in the stream which are coming > from Japanese mode, and then need to get it back to ISO-2022-JP prior to > display). > > Is there any way of getting Encode to use a 'quirks' mode or similar to > allow the non-standard code points to be converted back to MSFT's > version of JIS X 208?
Instead of quirk mode, Encode offers a verious way to add more encodings. see Encode::Encodings, enc2xs and Encode::EUCJPMS for details. I also recomend that you talk to naruse-san who maintains Encode::EUCJPMS on this problem. My position is that I cannot change Encode base upon your rationale since there are equal or even more number of people AGAINST M$-ing ISO-2022-JP. So write a Encode extension that suits your need, perhaps with naruse-san. And share with the rest via CPAN. Case dismissed. Show quoted text
> Oh, this is the Encode which comes as standard with both perl 5.8.4 and > 5.8.8 - it's a custom build running on RHEL 3.0 > > Thanks, > > Ben
Dan the Encode Maintainer