Subject: | bug in RTF::Parser |
Date: | Fri, 13 Feb 2009 14:51:04 -0700 |
To: | <bug-RTF-Parser [...] rt.cpan.org> |
From: | <jferguson [...] micron.com> |
I'm calling RTF::TEXT::Converter which in turn is calling RTF::Parser. When the RTF text contains Japanese SJIS characters some of the bytes are being corrupted because they are being translated with the ansi.pm module.
As an example I can set the following string:
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fscript\fprq2\fcharset0 Comic Sans MS;}{\f1\froman\fprq1\fcharset128 MS PGothic;}}{\colortbl ;\red0\green0\blue128;}{\*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\tx720\cf1\f0\fs20 test \cf0\lang1041\f1\fs20\'83\'65\'83\'58\'83\'67\cf1\lang1033\f0\fs20\par}
The results I get back are:
test fefXfg
0x00000000 (00000) 74657374 20666566 5866670a test fefXfg.
Where the 'f' (0x66) character is located should be 0x83. For whatever reason the asni.pm file has a translation of 0x83 to an 'f'. This ends up corrupting the resultant SJIS string.
As a test I removed the entry from ansi.pm for 83 and the resultant string contains the correct 0x83 character.
test âeâXâg
0x00000000 (00000) 74657374 20836583 5883670a test .e.X.g.
There needs to be some sort of check so that if the lang1041 is set it should not attempt to translate characters to something else.