Bug #15152 for Petal: Decoding Entities

Thu Oct 20 02:38:25 2005 Guest - Ticket created

Subject:

Decoding Entities

Hi Bruno, As you know, I've struggled quite a bit with entities and character encodings. It seems that I've got many clients whose editors like to use   entities. At any rate, I finally realized tonite that Petal does not appear to be using the Petal::Entities module since v1.06. Instead the MKDoc::XML::Decode classes are being used. I suppose you may be leaving it for backwards compatibililty. Despite my efforts, I still have some pages which insist on displaying the FFFD character in place of the chr() characters that MKDoc::XML::Decode::XHTML is outputting (some pages work, some don't although both files appear the same). I have found the adding the following line after $decode->process is called in Petal.pm fixes the problem: $res = Encode::encode('utf8', $res); I'm not sure this is a permanent solution but it seems to be working in all situations. Furthermore, on the pages which do not work, if I dump the page to the logs, I see the nbsp entities being output as \x{a0}. In a newsgroup posting[1], I found the following code which shows these \x chars are not being treated as Unicode: perl -MDevel::Peek -we "print Dump qq/\xa0/,Dump qq/\x{a0}/" I'm guessing that the Encode::encode function that I'm using is converting these back to utf8 so that the browser displays them correctly. This doesn't explain why some pages come through ok, but I'm happier with having the extra code when decoding the page than having the bad characters showing. Thoughts? Thanks, William [1] http://groups.google.com/group/perl.perl5.porters/browse_thread/thread/b847e3300a91f71d/1415aa675c860c53?lnk=st&q=%22\x%7Ba0%7D%22&rnum=2&hl=en#1415aa675c860c53

Thu Oct 20 09:07:25 2005 Guest - Correspondence added

Show quoted text

> Furthermore, on the pages which do not work, if I dump the page to the > logs, I see the nbsp entities being output as \x{a0}. In a > newsgroup posting[1], I found the following code which shows these > \x chars are not being treated as Unicode:

Actually both the good and bad pages were outputting this character. I do not know why Perl would sometimes show it as a real nbsp or substitute it. I wonder if something on the page is causing the processing of the document to interfere with these characters. I'll try to do some more testing tonite. Thanks, William

Fri Oct 21 00:55:29 2005 Guest - Correspondence added

From:

william [...] knowmad.com

Bruno, I think that I've finally tracked down this issue. It turns out that the documents which do not display the FFFD character have a high level character (greater than 255) such as —. Those which display it only have the   or other latin1 type entities. I have attached an example script which shows the FFFD being output when the page is viewed in a browser (tested with perl 5.8.4 on FreeBSD and Linux). I believe that the documentation for chr() explains why this is happening: "Note that characters from 128 to 255 (inclusive) are by default not encoded in UTF-8 Unicode for backward compatibility reasons (but see encoding)." So, I think it's going to be necessary to run the Encode::encode('utf8', $res) line after the decoding in order to force the encoding back to utf8. BTW, I tried the 'encoding' pragma in the script, in Petal.pm and in MKDoc::XML::Decode. None of these worked for me. Thoughts? Thanks, William

Download petal_bug-fffd.cgi
application/x-cgi 512b

Message body not shown because it is not plain text.

Fri Oct 21 00:57:12 2005 Guest - Correspondence added

So the characters that I put into that last message are not being escaped. The emdash should have been &emdash;. The nbsp should have been &nbsp;. Hopefully that makes sense. William

Fri Oct 21 01:08:15 2005 Guest - Correspondence added

OK, once again. It turns out that I was editing the wrong Petal.pm file. Setting use encoding 'utf8' does in fact work inside of Petal.pm. So does the Encode::encode() function. I'm not sure which is preferable though the docs for encoding[1] seemed to indicate it's not nice to use the pragma in a module. William [1] http://perldoc.perl.org/encoding.html

Tue Oct 25 08:16:02 2005 bruno [...] postle.net - Correspondence added

Date:	Tue, 25 Oct 2005 13:15:26 +0100
From:	Bruno Postle <bruno [...] postle.net>
To:	bug-Petal [...] rt.cpan.org
Subject:	Re: [cpan #15152] Decoding Entities
RT-Send-Cc:

Guest via RT wrote: Show quoted text

> I have attached an > example script which shows the FFFD being output when the page is viewed > in a browser (tested with perl 5.8.4 on FreeBSD and Linux).

Hopefully I'll get to look at this soon. I can't find this script anywhere - I would love to be able to write a test for this and pin it down for good.

Tue Oct 25 09:13:23 2005 Guest - Correspondence added

[bruno@postle.net - Tue Oct 25 08:16:02 2005]: Show quoted text

> Hopefully I'll get to look at this soon. I can't find this script > anywhere - I would love to be able to write a test for this and pin > it down for good.

Hey Bruno, The attachment is in there. On my system, I needed to scroll horizontally to see the Download button. If you still don't see it, let me know and I'll send it seperately. I too would love to have this issue pinned down for good. Thanks, William

Fri Jan 27 17:02:36 2006 Guest - Correspondence added

From:

william [...] knowmad.com

On Tue Oct 25 08:16:02 2005, bruno@postle.net wrote: Show quoted text

> Guest via RT wrote: >

> > I have attached an > > example script which shows the FFFD being output when the page is viewed > > in a browser (tested with perl 5.8.4 on FreeBSD and Linux).

> > Hopefully I'll get to look at this soon. I can't find this script > anywhere - I would love to be able to write a test for this and pin > it down for good.

Hi Bruno, I see you released a new version. Any chance you can address this issue now? Thanks, William

Fri Jan 27 17:02:37 2006 The RT System itself - Status changed from 'new' to 'open'

Sat Jan 28 17:16:53 2006 Guest - Correspondence added

From:

william [...] knowmad.com

Bruno, I'm really confused about character set handling in Perl. In a recent project which has brought me back to addressing this issue, I'm finding that I need to use the following code at the end of sub process in Petal.pm: ($] > 5.007) and do { $res = Encode::decode('iso-8859-1', $res); }; $@ and warn $@; return $res; If I don't, the page is output in the utf-8 charset but the contents contain iso-8859-1 characters. If I use Encode::encode('utf8', $res), I get a different character. When doing this, it appears that the template was read in as utf8, converted to latin1 (produces the 0xFFFD char) then converted back to utf8 (which produces the 0xC2 character). I'm not sure if that decode line is safe for systems that run non-latin1 character sets. It's probably safer to patch MKDoc::XML::Decode::XHTML to decode the results of the chr() function which, I think, is how those entities are getting converted from utf8 back to latin1 (or whatever the native format is for the system). Actually, we may be able to use the decode_charset option to figure out how to do the decoding. Well that doesn't work. When I did that with the code above, it fails with the following error if the charset is utf8 (which is the default): Cannot decode string with wide characters at /usr/local/lib/perl5/5.8.3/i686-linux/Encode.pm line 164. I don't really understand the error but I guess it's trying to decode some utf8 back to Perl's internal format (which I think is utf8 as of 5.8). All of this is making my head hurt. Hopefully the code sample will help... Talk to you later, William

Tue Feb 27 02:22:58 2007 WMCKEE [...] cpan.org - Correspondence added

I was continuing to run into unicode errors on a recent project and dove back into the fray. I found a good posting about unicode by Ivan Kurmanov at http://ahinea.com/en/tech/perl-unicode-struggle.html. Between his comments and reading of perluniintro, I think I have a patch for MKDoc::XML::Decode::XHTML that fixes my encoding issues. I've posted a patch to MKDoc::XML as #25166 - http://rt.cpan.org//Ticket/Display.html?id=25166 I hope this will finally put this issue to sleep!