Bug #19867 for Encode: Transcoding problems (ISO-2022-JP to UTF-8)

Tue Jun 13 07:04:16 2006 Guest - Ticket created

Subject:

Transcoding problems (ISO-2022-JP to UTF-8)

Dan, I've encountered an issue whereby Microsoft and other tools working in a Japanese encoding (JIS X 208) will use the code point 0x2d21 to represent a character comprising a circled digit 1 (and 0x2d22 is circle-2, etc) and save the resulting as ISO-2022-JP. This is not officially defined in the standards, but according to Japanese colleagues is in widespread usage. When transcoding to UTF-8 0x2d21 is converted to U2460 (which seems like a sane thing to do), but on round-tripping back to ISO-2022-JP, this code point is lost, and is left as \x{2460}. I've attached a sample file of Japanese text. The roundtrip code should be trivial. (The use case here, for reference, is that I have text/html in ISO-20220-JP and need to convert it to UTF-8 prior to parsing, else the parser tends to fall over any 0x3c bytes in the stream which are coming from Japanese mode, and then need to get it back to ISO-2022-JP prior to display). Is there any way of getting Encode to use a 'quirks' mode or similar to allow the non-standard code points to be converted back to MSFT's version of JIS X 208? Oh, this is the Encode which comes as standard with both perl 5.8.4 and 5.8.8 - it's a custom build running on RHEL 3.0 Thanks, Ben

Subject:

hy-new-2006-06-13-1.txt

test1$B-!-!%F%9%H(B test2$B- -!-"-#-$-%-&-'-(-)-*(B

Tue Jun 13 07:18:39 2006 DANKOGAI [...] cpan.org - Correspondence added

On Tue Jun 13 07:04:16 2006, guest wrote: Show quoted text

> Dan, > > I've encountered an issue whereby Microsoft and other tools working in a > Japanese encoding (JIS X 208) will use the code point 0x2d21 to > represent a character comprising a circled digit 1 (and 0x2d22 is > circle-2, etc) and save the resulting as ISO-2022-JP. > > This is not officially defined in the standards, but according to > Japanese colleagues is in widespread usage.

Not official therefore not supported, period. This is true especially when it comes to ISO-2022-JP. Show quoted text

> When transcoding to UTF-8 0x2d21 is converted to U2460 (which seems like > a sane thing to do), but on round-tripping back to ISO-2022-JP, this > code point is lost, and is left as \x{2460}.

It's considered malformed in ISO-2022-JP so Encode is acting correctly. Show quoted text

> I've attached a sample file of Japanese text. The roundtrip code should > be trivial. > > (The use case here, for reference, is that I have text/html in > ISO-20220-JP and need to convert it to UTF-8 prior to parsing, else the > parser tends to fall over any 0x3c bytes in the stream which are coming > from Japanese mode, and then need to get it back to ISO-2022-JP prior to > display). > > Is there any way of getting Encode to use a 'quirks' mode or similar to > allow the non-standard code points to be converted back to MSFT's > version of JIS X 208?

Instead of quirk mode, Encode offers a verious way to add more encodings. see Encode::Encodings, enc2xs and Encode::EUCJPMS for details. I also recomend that you talk to naruse-san who maintains Encode::EUCJPMS on this problem. My position is that I cannot change Encode base upon your rationale since there are equal or even more number of people AGAINST M$-ing ISO-2022-JP. So write a Encode extension that suits your need, perhaps with naruse-san. And share with the rest via CPAN. Case dismissed. Show quoted text

> Oh, this is the Encode which comes as standard with both perl 5.8.4 and > 5.8.8 - it's a custom build running on RHEL 3.0 > > Thanks, > > Ben

Dan the Encode Maintainer

Tue Jun 13 07:18:40 2006 The RT System itself - Status changed from 'new' to 'open'

Tue Jun 13 07:18:46 2006 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'