Bug #18567 for Encode: gsm0338 encode malfunction

Thu Apr 06 11:02:08 2006 Guest - Ticket created

Subject:

gsm0338 encode malfunction

decode gsm0338 correctly translates alabic alef (0B5F) to (D8A7), but encode gsm0338 translates (D8A7) to (0B). ---- Can you add support for Greek capitals? ---- Where can I find the spec to say which characters can be sent in gsm0338? Does this list change based upon language selection? ---- This is perl, v5.8.7 built for i486-linux-gnu-thread-multi (with 1 registered patch, see perl -V for more detail) Linux 2.6.12-10-386 #1 Mon Feb 13 12:13:15 UTC 2006 i686 GNU/Linux

Thu Apr 06 11:34:07 2006 DANKOGAI [...] cpan.org - Correspondence added

On Thu Apr 06 11:02:08 2006, guest wrote: Show quoted text

> decode gsm0338 correctly translates alabic alef (0B5F) to (D8A7), but > encode gsm0338 translates (D8A7) to (0B). > ---- > Can you add support for Greek capitals? > ---- > Where can I find the spec to say which characters can be sent in > gsm0338? Does this list change based upon language selection? > ---- > This is perl, v5.8.7 built for i486-linux-gnu-thread-multi > (with 1 registered patch, see perl -V for more detail) > Linux 2.6.12-10-386 #1 Mon Feb 13 12:13:15 UTC 2006 i686 GNU/Linux

Strange. alabic alef does not even exist in gsm0338. The one used in Encode.pm is based upon http://www.unicode.org/Public/MAPPINGS/ETSI/GSM0338.TXT Would you explain in more details? Dan the Encode Maintainer

Thu Apr 06 11:34:08 2006 The RT System itself - Status changed from 'new' to 'open'

Fri Apr 07 05:42:32 2006 michael [...] email4all.org - Correspondence added

Subject:	Re: [rt.cpan.org #18567] gsm0338 encode malfunction
Date:	Fri, 07 Apr 2006 12:43:11 +0300
To:	bug-Encode [...] rt.cpan.org
From:	Michael Virgo <michael [...] email4all.org>

Hi Dan, It gets more puzzling the more I investigate, but first to isolate the bug specifically: $msgtxt = chr(0xa7); $msgtxt = Encode::encode ("gsm0338", $msgtxt); $msgtxt is then "" not chr(0x5F) as I would expect from the table you sent: 0x5F 0x00A7 # SECTION SIGN Now about the Arabic. My goal is to write an SMS gateway that will cope with as many languages as possible, and especially Arabic. So I got a friend to send me an Arabic GSM message. He started with one containing a single alef. This arrived in GSM 7 bit format as 0x0b5f. Decoding that with gsm0338 it becomes 0xd8a7. When printed into the terminal this displays as a vertical bar (the correct shape for the letter alef). I spent ages trying to work out how that was an alef (not being familiar with utf8), and finally discovered that if I ran $msgtxt = Encode::decode_utf8 ($msgtxt); it was translated to \x0627 an Arabic alef. Fine, but there was a bug in the version of Encode I was using so I installed version 2.14. Now the process above no longer works. It only works if the utf8 flag is off. I don't understand this. If perl's internal format is utf8, I would not necessarily expect Encode::decode_utf8 to do anything. But since it used to translate 0xd8a7 from utf8 to \x0627 which I think is correctly called ucs2 I would expect it to be equivalent to $msgtxt = Encode::encode ("UCS2", $msgtxt); but that doesn't make any change to my utf8 string 0xd8a7 whether or not the utf8 flag is set. Please can you explain this? What I want is to be able to translate the GSM into utf8, then translate the utf8 to ucs2 (0x0b5f -> 0xd8a7 -> 0x0627). Shouldn't there be a perl way of doing this without having to adjust the utf8 flag? I would also like to be able to do the reverse translation. Thanks for your help, Michael On Thu, 2006-04-06 at 11:34 -0400, via RT wrote: Show quoted text

> <URL: http://rt.cpan.org/Ticket/Display.html?id=18567 > > > On Thu Apr 06 11:02:08 2006, guest wrote:

> > decode gsm0338 correctly translates alabic alef (0B5F) to (D8A7), but > > encode gsm0338 translates (D8A7) to (0B). > > ---- > > Can you add support for Greek capitals? > > ---- > > Where can I find the spec to say which characters can be sent in > > gsm0338? Does this list change based upon language selection? > > ---- > > This is perl, v5.8.7 built for i486-linux-gnu-thread-multi > > (with 1 registered patch, see perl -V for more detail) > > Linux 2.6.12-10-386 #1 Mon Feb 13 12:13:15 UTC 2006 i686 GNU/Linux

> > Strange. alabic alef does not even exist in gsm0338. The one used in Encode.pm is based > upon > > http://www.unicode.org/Public/MAPPINGS/ETSI/GSM0338.TXT > > Would you explain in more details? > > Dan the Encode Maintainer > >

Sun Jun 04 00:02:33 2006 DANKOGAI [...] cpan.org - Correspondence added

On Fri Apr 07 05:42:32 2006, michael@email4all.org wrote: Show quoted text

> It gets more puzzling the more I investigate, but first to isolate the > bug specifically: > > $msgtxt = chr(0xa7); > $msgtxt = Encode::encode ("gsm0338", $msgtxt);

encode()? not decode()? Unless you 'use utf8', chr(0xa7) will be treated as ISO-Latin, not UTF8 I've got a feeling you misused Encode and perl unicode rather than found a bug. I'll close this ticket for the time being. Please read perlunicode and perluniintro (and perlunitut if you have bleedperl handy). If you still encounter the bug, give me a mail BEFORE issuing a ticket via RT. Dan the Encode Maintainer

Sun Jun 04 00:02:34 2006 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Thu Oct 05 06:26:36 2006 michael [...] email4all.org - Correspondence added

Subject:	Re: [rt.cpan.org #18567] gsm0338 encode malfunction
Date:	Thu, 5 Oct 2006 10:25:59 -0000 (GMT)
To:	bug-Encode [...] rt.cpan.org
From:	michael [...] email4all.org

Hi Dan, Sorry for the delay, I've been away for a while. Please try this code. Why doesn't encode do the reverse operation from decode for these 6 characters? #!/usr/bin/perl # # Simple test program that executes encode and decode gsm338 # # ****************************************** require Encode; for ($i=0;$i<128;$i++) { $gsmtxt = chr($i); $msgtxt = Encode::decode("gsm0338", $gsmtxt); $msgord = ord($msgtxt); $ngsmtxt = Encode::encode("gsm0338", $msgtxt); $ngsmord = ord($ngsmtxt); if (($i != $ngsmord) and ($i != 0x1b)) { printf "%4x%4x", $i,$msgord; #print " $msgtxt"; printf "%4x\n", $ngsmord; } } Michael Show quoted text

> > <URL: http://rt.cpan.org/Ticket/Display.html?id=18567 > > > On Fri Apr 07 05:42:32 2006, michael@email4all.org wrote:

>> It gets more puzzling the more I investigate, but first to isolate the >> bug specifically: >> >> $msgtxt = chr(0xa7); >> $msgtxt = Encode::encode ("gsm0338", $msgtxt);

> > encode()? not decode()? > Unless you 'use utf8', chr(0xa7) will be treated as ISO-Latin, not UTF8 > I've got a feeling you misused Encode and perl unicode rather than found a > bug. > > I'll close this ticket for the time being. Please read perlunicode and > perluniintro (and > perlunitut if you have bleedperl handy). If you still encounter the bug, > give me a mail BEFORE > issuing a ticket via RT. > > Dan the Encode Maintainer > >

Thu Oct 05 06:26:48 2006 The RT System itself - Status changed from 'resolved' to 'open'

Fri Apr 06 07:50:46 2007 DANKOGAI [...] cpan.org - Correspondence added

On Thu Oct 05 06:26:36 2006, michael@email4all.org wrote: Show quoted text

> Hi Dan, > > Sorry for the delay, I've been away for a while. > > Please try this code. Why doesn't encode do the reverse operation from > decode for these 6 characters? > > #!/usr/bin/perl > # > # Simple test program that executes encode and decode gsm338 > # > # ****************************************** > require Encode; > > for ($i=0;$i<128;$i++) > { $gsmtxt = chr($i); > $msgtxt = Encode::decode("gsm0338", $gsmtxt);

decode? not encode? Here you are treating $i as GSM character, not UTF-8 Character. This code does not make sense to me. Show quoted text

> $msgord = ord($msgtxt); > $ngsmtxt = Encode::encode("gsm0338", $msgtxt); > $ngsmord = ord($ngsmtxt); > if (($i != $ngsmord) and ($i != 0x1b)) > { printf "%4x%4x", $i,$msgord; > #print " $msgtxt"; > printf "%4x\n", $ngsmord; > } > } > > Michael

Till you convince me it's Encode's bug, not your misunderstanding, I'll close this ticket. Please open a new ticket if you find a new evidence. Dan the Encode Maintainer

Fri Apr 06 07:50:52 2007 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Sat Apr 07 04:18:41 2007 michael [...] email4all.org - Correspondence added

Subject:	Re: [rt.cpan.org #18567] gsm0338 encode malfunction
Date:	Sat, 7 Apr 2007 08:18:22 -0000 (GMT)
To:	bug-Encode [...] rt.cpan.org
From:	michael [...] email4all.org

Hi Dan, Yes, I do want to start from incoming SMS messages, therefore GSM characters. If its my misunderstanding that means I can't convert GSM characters into Perl format and back again without having to tweek internal flags in perl I think its time I found a more sane language to write in! Seriously, please will you tell me what information I need to add to my GSM characters to tell decode to convert them in such a way that encode can reverse the process? Thanks, Michael Show quoted text

> <URL: http://rt.cpan.org/Ticket/Display.html?id=18567 > > > On Thu Oct 05 06:26:36 2006, michael@email4all.org wrote:

>> Hi Dan, >> Sorry for the delay, I've been away for a while. >> Please try this code. Why doesn't encode do the reverse operation from

decode for these 6 characters? Show quoted text

>> #!/usr/bin/perl >> # >> # Simple test program that executes encode and decode gsm338 >> # >> # ****************************************** >> require Encode; >> for ($i=0;$i<128;$i++) >> { $gsmtxt = chr($i); >> $msgtxt = Encode::decode("gsm0338", $gsmtxt);

> > decode? not encode? Here you are treating $i as GSM character, not

UTF-8 Show quoted text

> Character. This > code does not make sense to me. >

>> $msgord = ord($msgtxt); >> $ngsmtxt = Encode::encode("gsm0338", $msgtxt); >> $ngsmord = ord($ngsmtxt); >> if (($i != $ngsmord) and ($i != 0x1b)) >> { printf "%4x%4x", $i,$msgord; >> #print " $msgtxt"; >> printf "%4x\n", $ngsmord; >> } >> } >> Michael

> > Till you convince me it's Encode's bug, not your misunderstanding, I'll

close this ticket. Show quoted text

> Please open a new ticket if you find a new evidence. > > Dan the Encode Maintainer >

Sat Apr 07 04:18:42 2007 The RT System itself - Status changed from 'resolved' to 'open'

Mon Apr 23 14:43:01 2007 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'