Skip Menu |

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 94287
Status: resolved
Priority: 0/
Queue: Encode

People
Owner: Nobody in particular
Requestors: victor [...] vsespb.ru
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in:
  • 2.55
  • 2.58
Fixed in: (no value)



Subject: strange behaviour in from_to with UTF-8 encoding and Latin1 range
use Devel::Peek; use Encode qw/from_to/; my $a = "\xC2\xB5"; die unless from_to($a, 'utf-8', 'cp1251'); Dump $a; __END__ SV = PV(0x20e2c20) at 0x21104d8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x20fd0a0 "\265"\0 CUR = 1 LEN = 16 \xC2\xB2 is Latin1 character (Unicode 0xB5) in UTF-8 encoding. It does not map to CP1251 (Windows 1251) but from_to will return byte 0xB5 for it (i.e. character with code 0xB5 in CP1251). why so? it should instead replace it with replacement character. p.s. same behaviour if $check parameter is true.
Actually same happends with encode() use Devel::Peek; use Encode qw/encode/; my $a = "\xB5"; Dump encode("cp1251", "$a"); utf8::upgrade($a); Dump encode("cp1251", "$a"); __END__ SV = PV(0xbf9d20) at 0xb9f658 REFCNT = 1 FLAGS = (TEMP,POK,pPOK) PV = 0xc62600 "\265"\0 CUR = 1 LEN = 16 SV = PV(0xbe6370) at 0xb9f658 REFCNT = 1 FLAGS = (TEMP,POK,pPOK) PV = 0xbae0d0 "\265"\0 CUR = 1 LEN = 16 same result as if I would specify "latin1" instead of "cp1251" On Fri Mar 28 20:47:00 2014, vsespb wrote: Show quoted text
> use Devel::Peek; > use Encode qw/from_to/; > my $a = "\xC2\xB5"; > die unless from_to($a, 'utf-8', 'cp1251'); > Dump $a; > > __END__ > > SV = PV(0x20e2c20) at 0x21104d8 > REFCNT = 1 > FLAGS = (PADMY,POK,pPOK) > PV = 0x20fd0a0 "\265"\0 > CUR = 1 > LEN = 16 > > \xC2\xB2 is Latin1 character (Unicode 0xB5) in UTF-8 encoding. It does > not map to CP1251 (Windows 1251) > > but from_to will return byte 0xB5 for it (i.e. character with code > 0xB5 in CP1251). > > why so? it should instead replace it with replacement character. > > p.s. > same behaviour if $check parameter is true.
I am not an expert of CP1251/KOI8-U but so far as I see the conversion table it appears to be normal: UTF-8("\xC2\xB5") = chr(0xB5) = cp1251(0xB5) = 'μ' = "\N{MICRO SIGN}" 0265(octal) = 0xB5(hex) = 181(dec) http://en.wikipedia.org/wiki/Windows-1251 Dan the Maintainer Thereof On Fri Mar 28 12:55:59 2014, vsespb wrote: Show quoted text
> Actually same happends with encode() > > use Devel::Peek; > use Encode qw/encode/; > my $a = "\xB5"; > Dump encode("cp1251", "$a"); > utf8::upgrade($a); > Dump encode("cp1251", "$a"); > __END__ > SV = PV(0xbf9d20) at 0xb9f658 > REFCNT = 1 > FLAGS = (TEMP,POK,pPOK) > PV = 0xc62600 "\265"\0 > CUR = 1 > LEN = 16 > SV = PV(0xbe6370) at 0xb9f658 > REFCNT = 1 > FLAGS = (TEMP,POK,pPOK) > PV = 0xbae0d0 "\265"\0 > CUR = 1 > LEN = 16 > > same result as if I would specify "latin1" instead of "cp1251" > > On Fri Mar 28 20:47:00 2014, vsespb wrote:
> > use Devel::Peek; > > use Encode qw/from_to/; > > my $a = "\xC2\xB5"; > > die unless from_to($a, 'utf-8', 'cp1251'); > > Dump $a; > > > > __END__ > > > > SV = PV(0x20e2c20) at 0x21104d8 > > REFCNT = 1 > > FLAGS = (PADMY,POK,pPOK) > > PV = 0x20fd0a0 "\265"\0 > > CUR = 1 > > LEN = 16 > > > > \xC2\xB2 is Latin1 character (Unicode 0xB5) in UTF-8 encoding. It does > > not map to CP1251 (Windows 1251) > > > > but from_to will return byte 0xB5 for it (i.e. character with code > > 0xB5 in CP1251). > > > > why so? it should instead replace it with replacement character. > > > > p.s. > > same behaviour if $check parameter is true.
> >