Bug #46701 for Encode: Incorrect character mapping in Encode::GSM0338

Sat Jun 06 06:03:09 2009 bhawkeslewis [...] googlemail.com - Ticket created

Subject:

Incorrect character mapping in Encode::GSM0338

As you can see from the source code: http://cpansearch.perl.org/src/DANKOGAI/Encode- 2.33/lib/Encode/GSM0338.pm Encode::GSM maps 0x09 in GSM to lowercase c cedilla in Unicode (U+00E7). "\x{00E7}" => "\x09", # LATIN SMALL LETTER C WITH CEDILLA But I think this is wrong. GSM 03.38 maps the same character to /uppercase/ c cedilla (U+00C7). See ETSI TS 100 900 V7.2.0 (1999-07) Digital cellular telecommunications system (Phase 2+); Alphabets and language-specific information (GSM 03.38 version 7.2.0 Release 1998), Section 6.2.1 ("Default Alphabet"): http://pda.etsi.org/exchangefolder/ts_100900v070200p.pdf So this line needs changing to: "\x{00C7}" => "\x09", # LATIN CAPITAL LETTER C WITH CEDILLA

Wed Jul 08 09:16:19 2009 DANKOGAI [...] cpan.org - Correspondence added

On Sat Jun 06 06:03:09 2009, benjaminhawkeslewis wrote: Show quoted text

> As you can see from the source code: > > http://cpansearch.perl.org/src/DANKOGAI/Encode- > 2.33/lib/Encode/GSM0338.pm > > Encode::GSM maps 0x09 in GSM to lowercase c cedilla in Unicode (U+00E7). > > "\x{00E7}" => "\x09", # LATIN SMALL LETTER C WITH CEDILLA > > But I think this is wrong. > > GSM 03.38 maps the same character to /uppercase/ c cedilla (U+00C7). > > See ETSI TS 100 900 V7.2.0 (1999-07) Digital cellular telecommunications > system (Phase 2+); Alphabets and language-specific information (GSM > 03.38 version 7.2.0 Release 1998), Section 6.2.1 ("Default Alphabet"): > > http://pda.etsi.org/exchangefolder/ts_100900v070200p.pdf > > So this line needs changing to: > > "\x{00C7}" => "\x09", # LATIN CAPITAL LETTER C WITH CEDILLA

But that conflicts with what http://pda.etsi.org/exchangefolder/ts_100900v070200p.pdf says. Section 6.2.1 just shows the glyph. No unicode code point. Dan the Encode Maintainer

Wed Jul 08 09:16:19 2009 The RT System itself - Status changed from 'new' to 'open'

Wed Sep 18 20:37:34 2013 wiml [...] hhhh.org - Correspondence added

The Unicode Consortium's mapping table for GSM 03.38 has this to say on the matter: # The ETSI GSM 03.38 specification shows an uppercase C-cedilla # glyph at 0x09. This may be the result of limited display # capabilities for handling characters with descenders. However, the # language coverage intent is clearly for the lowercase c-cedilla, as shown # in the mapping below. The mapping for uppercase C-cedilla is shown # in a commented line in the mapping table. The other accented characters in column 0000 of the table are mostly lowercase with no uppercase equivalents elsewhere in the mapping, so who knows.

Sat Jul 25 09:07:34 2020 pali [...] cpan.org - Cc PALI added

Sat Jul 25 09:09:00 2020 pali [...] cpan.org - Correspondence added

Just to note that since Encode version 2.47 is GSM0338 byte 0x09 mapped to UNICODE code point U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA).

Sat Jul 25 09:13:53 2020 DANKOGAI [...] cpan.org - Correspondence added

https://github.com/dankogai/p5-encode/pull/149 should resolve all that.

Sat Jul 25 09:13:54 2020 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Sat Jul 25 16:00:43 2020 pali [...] cpan.org - Correspondence added

On Wed Sep 18 20:37:34 2013, WIML wrote: Show quoted text

> The Unicode Consortium's mapping table for GSM 03.38 has this to say > on the matter: > > > # The ETSI GSM 03.38 specification shows an uppercase C-cedilla > # glyph at 0x09. This may be the result of limited display > # capabilities for handling characters with descenders. However, the > # language coverage intent is clearly for the lowercase c-cedilla, as shown > # in the mapping below. The mapping for uppercase C-cedilla is shown > # in a commented line in the mapping table. > > > The other accented characters in column 0000 of the table are mostly > lowercase with no uppercase equivalents elsewhere in the mapping, so > who knows.

I'm afraid, but Unicode Consortium's mapping table is incorrect here. Maybe older GSM specifications were not clear about this issue (and Unicode Consortium come up with that incorrect interpretation), but the latest GSM 03.38 specification ETSI TS 123 038 V16.0.0 (2020-07) available at https://www.etsi.org/deliver/etsi_ts/123000_123099/123038/16.00.00_60/ts_123038v160000p.pdf is clear in fact that Upper case C-cedilla is at the position 0x09 of GSM 7 bit Default Alphabet and lower case C-cedilla is available in some National Single Shift Alphabets. National Single Shift Alphabets are used when requested via escape byte 0x1B. So GSM 03.38 supports both lower case and upper case C-cedilla. Above Unicode Consortium's mapping table supports only lower case C-cedilla which is limitation due to incorrect interpretation. Just to note that Encode::GSM0338 currently does not provide National Single Shift Alphabets, therefore it does not support lower case C-cedilla. Indication of National Single Shift Alphabets is out-of-band and therefore National Single Shift Alphabets cannot be implemented directly into Encode::GSM0338 module as Encode API does not provide out-of-band settings when encoding/decoding strings. So the best choice for future implementation of National Single Shift Alphabets and National Locking Shift Alphabets into Encode API would be to provide new encoding modules for every alphabet. E.g. GSM0338-Turkish, GSM0338-Spanish, ...

Sat Jul 25 16:00:45 2020 pali [...] cpan.org - Fixed in 2.47 added