On Thu Jul 30 08:54:15 2009, MortenRoenne wrote:
Show quoted text> I am building a .ucm file for converting entities (&...;) into unicode.
>
> But after running enc2xs and compiling it, the resulting encoding module
> won't convert all of my entities to unicode.
> Multiple unicode values aren't converted correct into an entity if there
> is a shorter form available as well.
> E.g. U0030+U0304 -> &0macr; doesn't work if there is a U0030 -> \x30
> mapping.
There is no wonder about that.
perldoc enc2xs
Show quoted text> Coping with duplicate mappings
> When you create a map, you SHOULD make your mappings round‐trip safe.
> That is, "encode('your−encoding', decode('your−encoding', $data)) eq
> $data" stands for all characters that are marked as "|0". Here is how
> to make sure:
>
> * Sort your map in Unicode order.
> * When you have a duplicate entry, mark either one with ’|1’ or ’|3’.
> * And make sure the ’|1’ or ’|3’ entry FOLLOWS the ’|0’ entry.
>
> Here is an example from big5-eten.
>
> <U2550> \xF9\xF9 |0
> <U2550> \xA2\xA4 |3
>
> Internally Encoding −> Unicode and Unicode −> Encoding Map looks like
> this;
>
> E to U U to E
> −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
> \xF9\xF9 => U2550 U2550 => \xF9\xF9
> \xA2\xA4 => U2550
>
> So it is round‐trip safe for \xF9\xF9. But if the line above is upside
> down, here is what happens.
>
> E to U U to E
> −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
> \xA2\xA4 => U2550 U2550 => \xF9\xF9
> (\xF9\xF9 => U2550 is now overwritten!)
>
> The Encode package comes with ucmlint, a crude but sufficient utility
> to check the integrity of a UCM file. Check under the Encode/bin
> directory for this.
>
> When in doubt, you can use ucmsort, yet another utility under
> Encode/bin directory.
UCM-based encodings use what is called 'code page' technique and your case does not get
along very well with it. That is one of the the reason why some encodings like gsm0338
are not UCM-based even though they look simple.
Dan the Encode Maintainer