Skip Menu |

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 46701
Status: resolved
Priority: 0/
Queue: Encode

People
Owner: Nobody in particular
Requestors: bhawkeslewis [...] googlemail.com
Cc: pali [...] cpan.org
AdminCc:

Bug Information
Severity: (no value)
Broken in: 2.33
Fixed in: 2.47



Subject: Incorrect character mapping in Encode::GSM0338
As you can see from the source code: http://cpansearch.perl.org/src/DANKOGAI/Encode- 2.33/lib/Encode/GSM0338.pm Encode::GSM maps 0x09 in GSM to lowercase c cedilla in Unicode (U+00E7). "\x{00E7}" => "\x09", # LATIN SMALL LETTER C WITH CEDILLA But I think this is wrong. GSM 03.38 maps the same character to /uppercase/ c cedilla (U+00C7). See ETSI TS 100 900 V7.2.0 (1999-07) Digital cellular telecommunications system (Phase 2+); Alphabets and language-specific information (GSM 03.38 version 7.2.0 Release 1998), Section 6.2.1 ("Default Alphabet"): http://pda.etsi.org/exchangefolder/ts_100900v070200p.pdf So this line needs changing to: "\x{00C7}" => "\x09", # LATIN CAPITAL LETTER C WITH CEDILLA
On Sat Jun 06 06:03:09 2009, benjaminhawkeslewis wrote: Show quoted text
> As you can see from the source code: > > http://cpansearch.perl.org/src/DANKOGAI/Encode- > 2.33/lib/Encode/GSM0338.pm > > Encode::GSM maps 0x09 in GSM to lowercase c cedilla in Unicode (U+00E7). > > "\x{00E7}" => "\x09", # LATIN SMALL LETTER C WITH CEDILLA > > But I think this is wrong. > > GSM 03.38 maps the same character to /uppercase/ c cedilla (U+00C7). > > See ETSI TS 100 900 V7.2.0 (1999-07) Digital cellular telecommunications > system (Phase 2+); Alphabets and language-specific information (GSM > 03.38 version 7.2.0 Release 1998), Section 6.2.1 ("Default Alphabet"): > > http://pda.etsi.org/exchangefolder/ts_100900v070200p.pdf > > So this line needs changing to: > > "\x{00C7}" => "\x09", # LATIN CAPITAL LETTER C WITH CEDILLA
But that conflicts with what http://pda.etsi.org/exchangefolder/ts_100900v070200p.pdf says. Section 6.2.1 just shows the glyph. No unicode code point. Dan the Encode Maintainer
The Unicode Consortium's mapping table for GSM 03.38 has this to say on the matter: # The ETSI GSM 03.38 specification shows an uppercase C-cedilla # glyph at 0x09. This may be the result of limited display # capabilities for handling characters with descenders. However, the # language coverage intent is clearly for the lowercase c-cedilla, as shown # in the mapping below. The mapping for uppercase C-cedilla is shown # in a commented line in the mapping table. The other accented characters in column 0000 of the table are mostly lowercase with no uppercase equivalents elsewhere in the mapping, so who knows.
Just to note that since Encode version 2.47 is GSM0338 byte 0x09 mapped to UNICODE code point U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA).
On Wed Sep 18 20:37:34 2013, WIML wrote: Show quoted text
> The Unicode Consortium's mapping table for GSM 03.38 has this to say > on the matter: > > > # The ETSI GSM 03.38 specification shows an uppercase C-cedilla > # glyph at 0x09. This may be the result of limited display > # capabilities for handling characters with descenders. However, the > # language coverage intent is clearly for the lowercase c-cedilla, as shown > # in the mapping below. The mapping for uppercase C-cedilla is shown > # in a commented line in the mapping table. > > > The other accented characters in column 0000 of the table are mostly > lowercase with no uppercase equivalents elsewhere in the mapping, so > who knows.
I'm afraid, but Unicode Consortium's mapping table is incorrect here. Maybe older GSM specifications were not clear about this issue (and Unicode Consortium come up with that incorrect interpretation), but the latest GSM 03.38 specification ETSI TS 123 038 V16.0.0 (2020-07) available at https://www.etsi.org/deliver/etsi_ts/123000_123099/123038/16.00.00_60/ts_123038v160000p.pdf is clear in fact that Upper case C-cedilla is at the position 0x09 of GSM 7 bit Default Alphabet and lower case C-cedilla is available in some National Single Shift Alphabets. National Single Shift Alphabets are used when requested via escape byte 0x1B. So GSM 03.38 supports both lower case and upper case C-cedilla. Above Unicode Consortium's mapping table supports only lower case C-cedilla which is limitation due to incorrect interpretation. Just to note that Encode::GSM0338 currently does not provide National Single Shift Alphabets, therefore it does not support lower case C-cedilla. Indication of National Single Shift Alphabets is out-of-band and therefore National Single Shift Alphabets cannot be implemented directly into Encode::GSM0338 module as Encode API does not provide out-of-band settings when encoding/decoding strings. So the best choice for future implementation of National Single Shift Alphabets and National Locking Shift Alphabets into Encode API would be to provide new encoding modules for every alphabet. E.g. GSM0338-Turkish, GSM0338-Spanish, ...