Bug #48322 for Encode: encode/decode fails with certain constructs

Thu Jul 30 08:54:15 2009 mr [...] cvt.dk - Ticket created

Subject:

encode/decode fails with certain constructs

I am building a .ucm file for converting entities (&...;) into unicode. But after running enc2xs and compiling it, the resulting encoding module won't convert all of my entities to unicode. Multiple unicode values aren't converted correct into an entity if there is a shorter form available as well. E.g. U0030+U0304 -> &0macr; doesn't work if there is a U0030 -> \x30 mapping. I have attached a minimal test case of the problems, including a test module that will fail validation on the module. To make the test work you can delete the line which maps "&acedil;" and the line mapping U0030 -> \x30. I am testing this on Debian Lenny, which contains perl 5.10 and Encode 2.23.

Subject:

encode_fail.tar.gz

Download encode_fail.tar.gz
application/x-gzip 4.2k

Message body not shown because it is not plain text.

Thu Jul 30 09:46:20 2009 DANKOGAI [...] cpan.org - Correspondence added

On Thu Jul 30 08:54:15 2009, MortenRoenne wrote: Show quoted text

> I am building a .ucm file for converting entities (&...;) into unicode. > > But after running enc2xs and compiling it, the resulting encoding module > won't convert all of my entities to unicode. > Multiple unicode values aren't converted correct into an entity if there > is a shorter form available as well. > E.g. U0030+U0304 -> &0macr; doesn't work if there is a U0030 -> \x30 > mapping.

There is no wonder about that. perldoc enc2xs Show quoted text

> Coping with duplicate mappings > When you create a map, you SHOULD make your mappings round‐trip safe. > That is, "encode('your−encoding', decode('your−encoding', $data)) eq > $data" stands for all characters that are marked as "|0". Here is how > to make sure: > > * Sort your map in Unicode order. > * When you have a duplicate entry, mark either one with ’|1’ or ’|3’. > * And make sure the ’|1’ or ’|3’ entry FOLLOWS the ’|0’ entry. > > Here is an example from big5-eten. > > <U2550> \xF9\xF9 |0 > <U2550> \xA2\xA4 |3 > > Internally Encoding −> Unicode and Unicode −> Encoding Map looks like > this; > > E to U U to E > −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− > \xF9\xF9 => U2550 U2550 => \xF9\xF9 > \xA2\xA4 => U2550 > > So it is round‐trip safe for \xF9\xF9. But if the line above is upside > down, here is what happens. > > E to U U to E > −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− > \xA2\xA4 => U2550 U2550 => \xF9\xF9 > (\xF9\xF9 => U2550 is now overwritten!) > > The Encode package comes with ucmlint, a crude but sufficient utility > to check the integrity of a UCM file. Check under the Encode/bin > directory for this. > > When in doubt, you can use ucmsort, yet another utility under > Encode/bin directory.

UCM-based encodings use what is called 'code page' technique and your case does not get along very well with it. That is one of the the reason why some encodings like gsm0338 are not UCM-based even though they look simple. Dan the Encode Maintainer

Thu Jul 30 09:46:21 2009 The RT System itself - Status changed from 'new' to 'open'

Thu Jul 30 09:46:21 2009 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'