Skip Menu |

This queue is for tickets about the Unicode-Map CPAN distribution.

Report information
The Basics
Id: 16734
Status: new
Priority: 0/
Queue: Unicode-Map

People
Owner: Nobody in particular
Requestors: gian [...] csoft.co.uk
ntyni [...] iki.fi
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: (no value)
Fixed in: (no value)



Subject: to_utf8 cannot convert 1 byte characters from Big5
I am running Perl v5.8.2 built for i686-linux on Redhat 9.0 I am trying convert Big5 characters to utf8 using the to_utf8 function in Unicode-MapUTF8. This seems to work fine for 2 byte characters. If I try to pass a 1 byte character like the digit 0 or 1 or 2 it does not return a utf8 character. I believe that some characters in Big5 are represented as 1 bytes. ### FOR EXAMPLE # string consisting of three Big5 characters 0xA540, 0xA541, 0x30 $STR = "\xA5\x40\xA5\x41\x30"; $NEW_STR = to_utf8({ -string=>$STR,-charset=>'Big5'}); # The above returns the utf8 representations of only the first two chinese # characters, but fails to convert the third. --------------- From http://www.fifi.org/cgi-bin/man2html/usr/share/man/man7/charsets.7.gz Big5 is a popular character set in Taiwan to express traditional Chinese. (Big5 is both a character set and an encoding.) It is a superset of US ASCII. Non-ASCII characters are expressed in two bytes. Bytes 0xa1-0xfe are used as leading bytes for two-byte characters. Big5 and its extension is widely used in Taiwan and Hong Kong. It is not ISO 2022-compliant.
Subject: BIG5 map is missing one-byte ASCII characters 0-127
Hi, the BIG5 map distributed in Unicode::Map 0.112 (Map/EASTASIA/BIG5.map) is missing the characters 0-127, which are the same as the respective ASCII characters. The actual error is in the original input file, currently at <ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT>. Since that file is now considered obsolete by the Unicode Consortium, I suppose they are not interested in updating it. Please consider updating the mapping in Unicode::Map. I'm attaching a patch for t/map.t which exhibits the behaviour. It makes the tests fail if the ASCII characters are not included in the Big5 map. I have solved the problem for Debian by inserting the following hex strings in the binary map at offset 12: "\x0\x8\x0" # partial key-value mappings "\x8\x1\x10\x1" # input 1 char of 8-bits at a time, output 1 char of 16 bits "\x80\x0\x80\x0\x0" # 128 characters starting at 0x00 -> 128 chars starting at 0x0000 "\x0\x0\x0" # end of submap [ This bug was originally reported as CPAN bug 5385, <http://rt.cpan.org/NoAuth/Bug.html?id=5385>, against Unicode-MapUTF8, and Debian bug #320406 <http://bugs.debian.org/320406>.] Regards, -- Niko Tyni (on behalf of the Debian Perl Group) ntyni@iki.fi
diff -urN libunicode-map-perl-0.112/t/map.t libunicode-map-perl-0.112-big5/t/map.t --- libunicode-map-perl-0.112/t/map.t 2001-01-07 23:51:18.000000000 +0200 +++ libunicode-map-perl-0.112-big5/t/map.t 2005-12-26 19:33:20.566730033 +0200 @@ -6,7 +6,7 @@ # Change 1..1 below to 1..last_test_to_print . # (It may become useful if the test is moved to ./t subdirectory.) -BEGIN { $| = 1; print "1..5\n"; } +BEGIN { $| = 1; print "1..6\n"; } END {print "not ok 1\n" unless $loaded;} use Unicode::Map; $loaded = 1; @@ -27,6 +27,7 @@ ["GB2312", "n->m: GB2312 (GB2312-80^8080 + ISO8859-1)"], ["DEVANAGA", "n->m: DEVANAGA"], ["EUC_JP", "n->m: EUC-JP"], + ["BIG5", "n->m: BIG5"], ); { @@ -133,6 +134,21 @@ return testMapping ( "APPLE-DEVANAGA", $_locale, $_unicode ); } +sub BIG5 { + my $_locale = + "\xA5\x40" + ."\xA5\x41" + ."\x30" + ." " + ; + my $_unicode = + "\x4E\x16" + ."\x4E\x15" + ."\x00\x30\x00\x20\x00\x20" + ; + return testMapping ( "BIG5", $_locale, $_unicode ); +} + sub testMapping { my ( $charsetId, $txtLocale, $txtUnicode ) = @_; return 0 if ! ( my $Map = new Unicode::Map($charsetId) );
From: ntyni [...] iki.fi
Show quoted text
> I am trying convert Big5 characters to utf8 using the to_utf8 function > in Unicode-MapUTF8. This seems to work fine for 2 byte characters. > If I try to pass a 1 byte character like the digit 0 or 1 or 2 it > does not return a utf8 character. I believe that some characters in > Big5 are represented as 1 bytes.
Hi, this bug is actually in the Unicode::Map module. I have re-reported it as CPAN bug 16734, <http://rt.cpan.org/NoAuth/Bug.html?id=16734>. Regards, -- Niko Tyni ntyni@iki.fi