Bug #71228 for Locale-SubCountry: utf8 "\xE4" does not map to Unicode

Mon Sep 26 00:47:53 2011 xenoterracide [...] gmail.com - Ticket created

Subject:	utf8 "\xE4" does not map to Unicode
Date:	Sun, 25 Sep 2011 23:47:44 -0500
To:	bugs-Locale-SubCountry [...] rt.cpan.org
From:	Caleb Cushing <xenoterracide [...] gmail.com>

I enabled use utf8::all in an application I'm building, it causes these errors to occur. utf8 "\xE4" does not map to Unicode at /home/ccushing/perl5/perlbrew/perls/perl-5.14.1/lib/site_perl/5.14.1/Locale/SubCountry.pm line 251. -- Caleb Cushing http://xenoterracide.com

Wed Apr 18 18:36:21 2012 kimryan [...] cpan.org - Correspondence added

I was not able to identify which line of text contains this error. Are you able to locate it? On Mon Sep 26 00:47:53 2011, XENO wrote: Show quoted text

> I enabled use utf8::all in an application I'm building, it causes > these errors to occur. > > utf8 "\xE4" does not map to Unicode at > /home/ccushing/perl5/perlbrew/perls/perl- > 5.14.1/lib/site_perl/5.14.1/Locale/SubCountry.pm > line 251. > >

Wed Apr 18 18:36:22 2012 The RT System itself - Status changed from 'new' to 'open'

Wed Apr 18 19:15:37 2012 xenoterracide [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #71228] utf8 "\xE4" does not map to Unicode
Date:	Wed, 18 Apr 2012 18:15:21 -0500
To:	bug-Locale-SubCountry [...] rt.cpan.org
From:	Caleb Cushing <xenoterracide [...] gmail.com>

On Wed, Apr 18, 2012 at 5:36 PM, Kim Ryan via RT <bug-Locale-SubCountry@rt.cpan.org> wrote: Show quoted text

> I was not able to identify which line of text contains this error. Are > you able to locate it?

lines 269 and 270 have an auml, which is latin1 character E4 I am not sure why this error is thrown when using utf8::all, certainly removing these should fix it too, but I'd think there'd be a way to make this work. also please remove your bugs section (none known) pod does not update itself. -- Caleb Cushing http://xenoterracide.com

Wed Apr 18 19:17:17 2012 xenoterracide [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #71228] utf8 "\xE4" does not map to Unicode
Date:	Wed, 18 Apr 2012 18:17:03 -0500
To:	bug-Locale-SubCountry [...] rt.cpan.org
From:	Caleb Cushing <xenoterracide [...] gmail.com>

also please note that this error can be reproduced fatally on 5.14 by adding the following to the top use warnings qw(FATAL utf8); -- Caleb Cushing http://xenoterracide.com

Tue May 08 04:02:06 2012 kimryan [...] cpan.org - Correspondence added

On Wed Apr 18 19:17:17 2012, XENO wrote: Show quoted text

> also please note that this error can be reproduced fatally on 5.14 by > adding the following to the top > > use warnings qw(FATAL utf8); >

I tried adding all the warning, but still can;t get them reported, using Perl 5.14.2 on Win32 Anyway looked at the auml character problem. According to this http://www.utf8-chartable.de/, a with umlat is encoded as c3 a4 U+00E4 ä c3 a4 LATIN SMALL LETTER A WITH DIAERESIS I did a hex dump on line 269 with AZERBAIJAN : Länkäran; and the ä character shows up correctly as c3 a4. So still very puzzled. Will fix the other error you reported also with the known bugs section

Sat May 26 00:01:00 2012 kimryan [...] cpan.org - Correspondence added

Verified that the UTF character for \xE4, 'a' with umlat above is correctly mapped. The warning is also been reported for the comments section at line 251 of the code, not the actual code. Not sure of precise problem ,but think that the highest warning level for utf8 module may be over reporting errors.

Sat May 26 00:01:02 2012 kimryan [...] cpan.org - Status changed from 'open' to 'resolved'

Tue Feb 26 16:41:47 2013 davidp [...] preshweb.co.uk - Correspondence added

I can still reproduce this error with the latest CPAN version - the key appears to be having use open ":encoding(utf8)" enabled before loading Locale::SubCountry: [dave@gen:~]$ perl -w -e 'use open ":encoding(utf8)"; use Locale::SubCountry; my $lc = Locale::SubCountry->new("GB"); print $lc->country;' utf8 "\xE4" does not map to Unicode at /opt/perlbrew/perls/perl-5.14.2/lib/site_perl/5.14.2/Locale/SubCountry.pm line 259. utf8 "\xE4" does not map to Unicode at /opt/perlbrew/perls/perl-5.14.2/lib/site_perl/5.14.2/Locale/SubCountry.pm line 259. utf8 "\xE4" does not map to Unicode at /opt/perlbrew/perls/perl-5.14.2/lib/site_perl/5.14.2/Locale/SubCountry.pm line 259. United Kingdom [dave@gen:~]$ cpanm Locale::SubCountry Locale::SubCountry is up to date. (1.61) Of course, in response to "it throws warnings when I do $this", "don't do that, then" is a reasonable response, but I thought I'd re-open this ticket with a one-liner example of how to tickle this warning. It's certainly irritating to us in our codebase at work. I think the line number reported by the warning is wrong, though - line 259 is an empty line for me. The actual source of the problem is the two AZERBAIJAN examples - the following diff silences the warning for us: -- SubCountry.pm 2013-02-26 21:27:14.000000000 +0000 +++ /opt/perlbrew/perls/perl-5.14.2/lib/site_perl/5.14.2/Locale/SubCountry.pm 2013-02-26 21:31:58.000000000 +0000 @@ -277,8 +277,6 @@ because the name represents two different types of sub country, such as a province and a geographical unit. Examples are: - AZERBAIJAN : L�nk�ran; LA (the City), LAN (the Rayon) - AZERBAIJAN : S�ki; SA,SAK AZERBAIJAN : Susa; SS,SUS AZERBAIJAN : Yevlax; YE,YEV INDONESIA : Kalimantan Timur; KI,KT Note that whatever characters they're supposed to be, they don't display properly for me in my editor, here in this ticket, or on MetaCPAN. Having worked out what characters they're supposed to be ("LATIN SMALL LETTER A WITH DIAERESIS"), I copied and pasted the names from Wikipedia containing the right characters, and all was well: [dave@gen:~]$ diff -u SubCountry.pm /opt/perlbrew/perls/perl-5.14.2/lib/site_perl/5.14.2/Locale/SubCountry.pm --- SubCountry.pm 2013-02-26 21:27:14.000000000 +0000 +++ /opt/perlbrew/perls/perl-5.14.2/lib/site_perl/5.14.2/Locale/SubCountry.pm 2013-02-26 21:39:21.000000000 +0000 @@ -277,8 +277,8 @@ because the name represents two different types of sub country, such as a province and a geographical unit. Examples are: - AZERBAIJAN : L�nk�ran; LA (the City), LAN (the Rayon) - AZERBAIJAN : S�ki; SA,SAK + AZERBAIJAN : Länkäran; LA (the City), LAN (the Rayon) + AZERBAIJAN : Shäki; SA,SAK AZERBAIJAN : Susa; SS,SUS AZERBAIJAN : Yevlax; YE,YEV INDONESIA : Kalimantan Timur; KI,KT So, this looks to be the fix. Would you be willing to incorporate this change in the next version?

Tue Feb 26 16:41:49 2013 The RT System itself - Status changed from 'resolved' to 'open'

Tue Feb 26 18:48:34 2013 kimryan [...] cpan.org - Taken

Tue Feb 26 19:07:38 2013 kimryan [...] cpan.org - Correspondence added

Thanks for the update. I copied the new data for Länkäran from your suggested patch above. When I paste it into a text editor, it still comes up as the multibyte hex sequence C3 A4. I'm using TextPad to view it, not sure if it is doing some conversion. Maybe we need CP1252 encoding? I think part of the problem is that not all browsers or editors are utf- 8 aware, or don't cope with some utf-8 data mixed into an ASCII file. The SubCountry.pm really contains all ASCII data except for this limitation section. I have noticed that in Chrome and IE8 the a with diaeresis does not display correctly when viewing the module documentation. I'm thinking that a good workaround is to simply drop the A WITH DIAERESIS and use the spelling of: "Lankaran" in the LIMITATIONS section, with an explanatory note. The correct representation of course will remain in the Data.pm file, all of which is utf8 encoded. Seems like failry low priority so would prefer to combine this with the next data update.

Thu Aug 08 01:46:02 2013 kimryan [...] cpan.org - Correspondence added

All utf-8 characters in the POD section have been removed from the SubCountry.pm file. I converted them to ASCII with an explanatory note.

Thu Aug 08 01:46:03 2013 kimryan [...] cpan.org - Status changed from 'open' to 'resolved'

Thu Aug 08 01:46:03 2013 kimryan [...] cpan.org - Fixed in 1.62 added