Bug #87119 for Text-Unidecode: Characters converting to "a" character instead of representative character or empty string otherwise

Sun Jul 21 08:54:48 2013 harisekhon [...] gmail.com - Ticket created

Subject:	Characters converting to "a" character instead of representative character or empty string otherwise
Date:	Sun, 21 Jul 2013 13:53:54 +0100
To:	bug-Text-Unidecode [...] rt.cpan.org
From:	Hari Sekhon <harisekhon [...] gmail.com>

Hi Sean, Some annoying characters seem to convert to the character "a" instead of the correct character or just nothing if they aren't representable in ASCII. For example: ?~@~]\?~@~] which appears on a web page as "\" but gets converted to a\a. If this case double quote backslash double quote is representable in ASCII. I find this happening a lot with space dash space copied from websites as well. Thanks Hari Sekhon

Sun Jul 21 09:17:44 2013 sburke [...] cpan.org - Correspondence added

Show quoted text

> Some annoying characters seem to convert to the character "a" instead > of the correct character or just nothing if they aren't representable > in ASCII. For example: > > ?~@~]\?~@~] > > which appears on a web page as "\"

Thank you for your bug report! But hm, I can't reproduce the error. Can you give me a short Perl program that demonstrates the problem? I'm suspecting this is a problem to do with encodings.

Sun Jul 21 09:17:44 2013 The RT System itself - Status changed from 'new' to 'open'

Sun Jul 21 09:17:45 2013 sburke [...] cpan.org - Taken

Sun Jul 21 09:33:13 2013 harisekhon [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #87119] Characters converting to "a" character instead of representative character or empty string otherwise
Date:	Sun, 21 Jul 2013 14:32:22 +0100
To:	bug-Text-Unidecode [...] rt.cpan.org
From:	Hari Sekhon <harisekhon [...] gmail.com>

Hi Sean, See attached unidecode_example.pl where I can copy/pasted the string straight in to a variable and called unidecode on the variable. Thanks Hari On 21 July 2013 14:17, Sean M. Burke via RT <bug-Text-Unidecode@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=87119 > >

>> Some annoying characters seem to convert to the character "a" instead >> of the correct character or just nothing if they aren't representable >> in ASCII. For example: >> >> ?~@~]\?~@~] >> >> which appears on a web page as "\"

> > Thank you for your bug report! > But hm, I can't reproduce the error. Can you give me a short Perl program that demonstrates the problem? > I'm suspecting this is a problem to do with encodings. >

Message body is not shown because sender requested not to inline it.

Thu Aug 01 08:48:37 2013 sburke [...] cpan.org - Correspondence added

Ah, this is a thing where the string looks like utf8 to you but is flat bytes to Perl. Add this line to your code: print "It is ", length($str), " characters long.\n"; And it'll say It is 33 characters long. But use this program: #!/usr/bin/perl use Text::Unidecode; # copied from a website where it appears as value="\"@timestamp\":\"" my $str = 'value=”\”@timestamp\”:\”"'; utf8::decode($str); binmode(STDOUT, ":utf8") || die "WHUT $!"; # Read perldoc: # perlunitut, perluniintro, perlrun, bytes, perlunicode perluni # where there's explanations of perl -CL and other fun stuff # that might, or might not, be more DWIM than having to # call utf8::decode as above. print 'string as it appears on website : value="\"@timestamp\":\""' . "\n"; print "raw string as copy/pasted in Mac terminal: $str\n"; print "It is ", length($str), " characters long.\n"; print "string returned by unidecode() : " . unidecode($str) . "\n"; And that works, and it says: It is 25 characters long. string returned by unidecode() : value="\"@timestamp\":\"" The "a"s were coming from the fact that the byte values for the ” you have is e2 80 9d. Now, 80 and 9d are no good in Unicode so each of them are empty-string, but e2 is "â" ...which Unidecode turns into "a", and that's why it looks like Unidecode is turning a “ character into an a character. BTW, in mystery cases like this, I often throw in a thing like this to make sure that what I consider characters and what Perl considers characters are syncing up, or not: foreach my $char (split '', $str) { printf "\tChar %0x : \"%s\" => u:\"%s\"\n", ord($char), $char, unidecode($char); } Am I making sense? I often explain things poorly and can't tell. And "perldoc utf8" sometimes leaves me more confused than before I read it! I often just go thru the various functions and call one or the other until I get whichever one does the job... and then I see that its documentation *now* (in 20/20 hindsight) makes perfect sense. OH UNICODE!