Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the Pod-Simple CPAN distribution.

Report information
The Basics
Id: 29587
Status: resolved
Priority: 0/
Queue: Pod-Simple

People
Owner: dwheeler [...] cpan.org
Requestors: agentzh [...] yahoo.cn
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: (no value)
Fixed in: 3.16

Attachments


Subject: The HTML emitter escapes all my UTF-8 Chinese chars
Hi, Allison The HTML emitter blindly escapes every Chinese UTF-8 char into something like this: <li>&#32534;&#20889;&#20195;&#34920;&#29992;&#25143;&#35843;&#29992; &#20854;&#20182;&#27169;&#22359;&#30340;&#8220;&#31574;&#30053;&#20803; &#27169;&#22359;&#8221;&#20063;&#24517;&#39035;&#26159;&#21487;&#33021; &#30340;&#12290;</li> I don't know what you will feel when you get an HTML page full of craps like this...and furthermore such stuff makes my HTML pages unfriendly to search engine's crawlers :( For example, my Chinese transcript of Perl 6's Synopsis 1 won't get indexed by either Google or Yahoo: http://perlcabal.org/syn/zh-cn/S01.html Hopefully Pod::HTML::Simple will at least provide an option to disable such an annoying behavior. Thanks, agentz
CC: yichun.zhang [...] alibaba-inc.com
Subject: Re: [rt.cpan.org #29587] The HTML emitter escapes all my UTF-8 Chinese chars
Date: Mon, 24 Sep 2007 11:56:54 -0700
To: bug-Pod-Simple [...] rt.cpan.org
From: Allison Randal <allison [...] perl.org>
Agent Zhang (章亦春) via RT wrote: Show quoted text
> > The HTML emitter blindly escapes every Chinese UTF-8 char into something > like this: > > <li>&#32534;&#20889;&#20195;&#34920;&#29992;&#25143;&#35843;&#29992; > &#20854;&#20182;&#27169;&#22359;&#30340;&#8220;&#31574;&#30053;&#20803; > &#27169;&#22359;&#8221;&#20063;&#24517;&#39035;&#26159;&#21487;&#33021; > &#30340;&#12290;</li>
Thanks for the report. Could you send me a very short Pod example and the HTML it should produce? I can use it as a test case. Sean was usually quite careful about alternate character sets, but Pod::Simple was developed before UTF-8 support was well integrated into Perl, so there may be a bug from the interaction between the two. Thanks, Allison
CC: bug-Pod-Simple [...] rt.cpan.org
Subject: Re: [rt.cpan.org #29587] The HTML emitter escapes all my UTF-8 Chinese chars
Date: Tue, 25 Sep 2007 10:58:35 +0800
To: "Allison Randal" <allison [...] perl.org>
From: "Agent Zhang" <agentzh [...] gmail.com>
On 9/25/07, Allison Randal <allison@perl.org> wrote: Show quoted text
> > Thanks for the report. Could you send me a very short Pod example and > the HTML it should produce? I can use it as a test case. >
I've attached a minimized test case to this mail :) The sample HTML file test.html was generated by a temporarily patched Pod::Simple::HTML and verified by "eyes". Hope this helps :) Thanks! agentz
Download test.tar.gz
application/x-gzip 948b

Message body not shown because it is not plain text.

Show quoted text
> I've attached a minimized test case to this mail :) > > The sample HTML file test.html was generated by a temporarily patched > Pod::Simple::HTML and verified by "eyes". > > Hope this helps :)
Thanks. Can you send in the patch you applied, as well? That'd help us to figure out where to apply a formal fix. —Theory
Subject: Re: [rt.cpan.org #29587] The HTML emitter escapes all my UTF-8 Chinese chars
Date: Tue, 27 Oct 2009 17:59:56 +0800
To: bug-Pod-Simple [...] rt.cpan.org
From: agentzh <agentzh [...] gmail.com>
On Tue, Oct 27, 2009 at 3:08 AM, David Wheeler via RT <bug-Pod-Simple@rt.cpan.org> wrote: Show quoted text
> Thanks. Can you send in the patch you applied, as well? That'd help us to figure out where to > apply a formal fix. >
It's already too late for me to find the original patched code :P More than two years have passed already ;) I'll take another look at the issue again when I have some cycles :) Cheers, -agentzh
On Tue Oct 27 06:00:14 2009, agentzh@gmail.com wrote: Show quoted text
> It's already too late for me to find the original patched code :P More > than two years have passed already ;) > > I'll take another look at the issue again when I have some cycles :)
Thanks, appreciated. Best, David
On Wed Oct 28 13:50:56 2009, DWHEELER wrote: Show quoted text
> On Tue Oct 27 06:00:14 2009, agentzh@gmail.com wrote: >
> > It's already too late for me to find the original patched code :P More > > than two years have passed already ;) > > > > I'll take another look at the issue again when I have some cycles :)
> > Thanks, appreciated.
Hey there agentzh, any luck finding this code? David
On Fri Nov 12 15:10:56 2010, DWHEELER wrote: Show quoted text
> Hey there agentzh, any luck finding this code?
I got annoyed by this today, too, and so used your test to write a quick test. There are a few problems with this, though: * It disables the use of HTML::Entities. We could instead pass a second argument to it, `'<>&"'`, but that would probably be best to do *only* if the string being encoded has the utf8 flag set. * Even if we were to do that, the default content-type meta header declares the charset as ISO-8859-1. We should probably consider changing that, anyway. * But we may *not* want to do that if any string is encoded that neither has the utf8 flag set nor contains only ASCII characters. Then, well, who knows what it's encoding would be? We really want to discourage mystery encodings, of course, and have people put the `=encoding` tag in their Pod, but there will be cases where that's not done, and then what? * Even if we do decide to go whole hog and force everything to be UTF-8 (which I could easily get on board with), we run into the problem of supporting older Perls. Pod::Simple requires Perl 5.0.0. Not 5.6, not 5.4, not 5.8, but 5.0! For proper encoding support, we'd have to seriously consider dumping everything before 5.6, and, ideally perhaps, even before 5.8.1. Maybe we could consider doing that just for the XHTML generator? Then we'd have to be pretty careful to write the tests in a compatible way. * This ignores the HTML output. Maybe that's a good thing, too? Honestly, Pod::Simple::XHTML generates pretty good HTML now. One can get around all these problems in any one usage by doing something like this: use Pod::Simple::XHTML; $Pod::Simple::XHTML::HAS_HTML_ENTITIES = 0; my $psx = Pod::Simple::XHTML->new; $psx->html_header_tags('<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />'); And then making sure it applies only to Pod files that properly declare their encoding. Since I think it'd be quite a bit of work to rework Pod::Simple for smarter encoding, Maybe that's good enough for now? Comments? Best, David
Subject: 0001-Add-test-for-not-encoding-high-bit-UTF-8-characters.patch
From 797f79a25de33218dc7c51b7df3134e0b3406fd9 Mon Sep 17 00:00:00 2001 From: David E. Wheeler <david@justatheory.com> Date: Mon, 21 Feb 2011 19:51:59 -0800 Subject: [PATCH] Add test for not encoding high-bit UTF-8 characters. --- lib/Pod/Simple/XHTML.pm | 2 +- t/corpus/chinese_utf8.pod | 6 ++++++ t/corpus/chinese_utf8.xml | 10 ++++++++++ t/xhtml10.t | 22 ++++++++++++++++++++-- 4 files changed, 37 insertions(+), 3 deletions(-) create mode 100644 t/corpus/chinese_utf8.pod create mode 100644 t/corpus/chinese_utf8.xml diff --git a/lib/Pod/Simple/XHTML.pm b/lib/Pod/Simple/XHTML.pm index d388db9..27fb52a 100644 --- a/lib/Pod/Simple/XHTML.pm +++ b/lib/Pod/Simple/XHTML.pm @@ -46,7 +46,7 @@ my %entities = ( ); sub encode_entities { - return HTML::Entities::encode_entities( $_[0] ) if $HAS_HTML_ENTITIES; +# return HTML::Entities::encode_entities( $_[0] ) if $HAS_HTML_ENTITIES; my $str = $_[0]; my $ents = join '', keys %entities; $str =~ s/([$ents])/'&' . $entities{$1} . ';'/ge; diff --git a/t/corpus/chinese_utf8.pod b/t/corpus/chinese_utf8.pod new file mode 100644 index 0000000..e0d5cb6 --- /dev/null +++ b/t/corpus/chinese_utf8.pod @@ -0,0 +1,6 @@ +=encoding utf8 + +=head1 作者 + +你回家了么?&<Yes>;我回家了哈:) + diff --git a/t/corpus/chinese_utf8.xml b/t/corpus/chinese_utf8.xml new file mode 100644 index 0000000..66348b1 --- /dev/null +++ b/t/corpus/chinese_utf8.xml @@ -0,0 +1,10 @@ +<Document start_line="1"> + <head1 start_line="3"> + &#20316;&#32773; + </head1> + <Para start_line="5"> + &#20320;&#22238;&#23478;&#20102;&#20040;&#65311;&#38;&#60;Yes + &#62; + ;&#25105;&#22238;&#23478;&#20102;&#21704;:) + </Para> +</Document> diff --git a/t/xhtml10.t b/t/xhtml10.t index c3ec202..39ee637 100644 --- a/t/xhtml10.t +++ b/t/xhtml10.t @@ -8,8 +8,9 @@ BEGIN { use strict; use lib '../lib'; -use Test::More tests => 44; -#use Test::More 'no_plan'; +#use Test::More tests => 44; +use Test::More 'no_plan'; +use File::Spec; use_ok('Pod::Simple::XHTML') or exit; @@ -397,6 +398,23 @@ is $results, <<'EOF', 'And it should work!'; EOF +$ENV{FOO} = 1; +use utf8; +initialize($parser, $results); +ok $parser->parse_file(File::Spec->catfile(qw(corpus chinese_utf8.pod))), + 'Parse chinese UTF-8 Pod document'; +is $results, <<'EOF', 'Should have unencoded Unicode characters'; +<ul id="index"> + <li><a href="#pod-">作者</a></li> +</ul> + +<h1 id="pod-">作者</h1> + +<p>你回家了么?&amp;&lt;Yes&gt;;我回家了哈:)</p> + +EOF + + sub initialize { $_[0] = Pod::Simple::XHTML->new; $_[0]->html_header(''); -- 1.7.1
On Mon Feb 21 23:02:01 2011, DWHEELER wrote: Show quoted text
> Honestly, Pod::Simple::XHTML generates pretty good HTML now. One can > get around all these > problems in any one usage by doing something like this: > > use Pod::Simple::XHTML; > $Pod::Simple::XHTML::HAS_HTML_ENTITIES = 0; > my $psx = Pod::Simple::XHTML->new; > $psx->html_header_tags('<meta http-equiv="Content-Type" content="text/html;
charset=UTF-8" />'); Okay, thinking about this some more, I added some attributes to allow you to take a bit more control. So as of these commits: https://github.com/theory/pod- simple/commit/f9530442e1fff9749c7b11ec363238d74c989314 https://github.com/theory/pod- simple/commit/0f6dae598c703f2160b9f13b9c9c3d655bf7979a You can now do this: $psx->html_charset('UTF-8'); $psx->html_encode_chars('&<>"'); And then, assuming that your source document properly uses the `=encoding` command, you should get HTML properly encoded as UTF-8 and with only those four characters encoded. That work for everybody? I'll give it a day or so before I package this up for release (I'd like to get it into 5.14). —Theory