Skip Menu |

This queue is for tickets about the HTML-Format CPAN distribution.

Report information
The Basics
Id: 9700
Status: open
Priority: 0/
Queue: HTML-Format

People
Owner: nigel.metheringham [...] gmail.com
Requestors: lulu [...] lululand.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: 2.04
Fixed in: (no value)



Subject: FormatText.pm corrupts multi-byte Unicode characters
An HTML file containing multi-byte Unicode text will have some of the text corrupted. I have attached a sample HTML file that demonstrates the problem. I am running Perl 5.8.5, Linux FC3, i686, using HTML-Format 2.0.4.

Here's another Unicode test.
Spanish:  ¿Dónde Está la Unicode?
 
French:  Il a été affligé par une maladie grave à 13 ans.
 
German:  Bleigießen, Wörterbuch über
 
Norwegian:  FrÃ¥n och med 1/1 2005 är det fri entré till museets utställningar.
 
Swedish:  atomvÃ¥pen vært større
 
 
Chinese:  十峰中文学校
 
Vietnamese:  giá sản phẩm TV kỹ thuật số 
 
 Arabic: بيانات صحفية حكومي
 
 
 
 
From: lulu [...] lululand.com
The problem, according to comments in the sub function of Formatter.pm, is from a tr that attempts to handle soft hyphens. Commenting out that line fixes the problem. I think it is probably best to not corrupt multi-byte characters than to translate hyphens to spaces. I have attached a patch. This patch is applied on top of a patch I had previously submitted for bug #9602. [guest - Thu Jan 13 22:41:18 2005]: Show quoted text
> An HTML file containing multi-byte Unicode text will have some of the > text corrupted. > > I have attached a sample HTML file that demonstrates the problem. > > I am running Perl 5.8.5, Linux FC3, i686, using HTML-Format 2.0.4.
--- FormatText.pm 2005-01-13 19:33:09.000000000 -0800 +++ FormatText.pm.sav 2005-01-13 19:35:21.000000000 -0800 @@ -188,10 +188,7 @@ my $self = shift; my $text = shift; - # uncomment the following if you want soft-hyphen translation. - # (according to Formatter.pm) - # however, it will corrupt multi-byte unicode characters. -# $text =~ tr/\xA0\xAD/ /d; + $text =~ tr/\xA0\xAD/ /d; if (defined $self->{vspace}) { if ($self->{out}) {
From: lulu [...] lululand.com
I created the previous patch incorrectly. Attached is the corrected version. My sincere apologies.
--- FormatText.pm.sav 2005-01-13 19:35:21.000000000 -0800 +++ FormatText.pm 2005-01-13 19:33:09.000000000 -0800 @@ -188,7 +188,10 @@ my $self = shift; my $text = shift; - $text =~ tr/\xA0\xAD/ /d; + # uncomment the following if you want soft-hyphen translation. + # (according to Formatter.pm) + # however, it will corrupt multi-byte unicode characters. +# $text =~ tr/\xA0\xAD/ /d; if (defined $self->{vspace}) { if ($self->{out}) {
From: martin.ferrari [...] gmail.com
On Sat Jan 15 01:45:03 2005, guest wrote: Show quoted text
> > I created the previous patch incorrectly. Attached is the corrected > version. My sincere apologies.
From what I understand, this is a bug in HTML::TreeBuilder, which doesn't set the utf8 flag when reading utf8 content. See this example: $ perl -Iblib/lib -e ' use encoding "utf-8", STDOUT => "utf-8"; use utf8; use HTML::Element; use HTML::FormatText; $e = new HTML::Element("p"); $e->push_content("fóo"); print utf8::is_utf8($e->as_XML) ? "is" : "is not"," UTF-8\n"; print HTML::FormatText->format_string($e->as_XML);' is UTF-8 fóo
Is this still an issue with current perls and/or current HTML::TreeBuilder? [a failing test would be really useful here] If I don't hear anything back on this I'll close it down - I've just taken on maintenance of this module and am trying to clear the RT queue.
Subject: [rt.cpan.org #9700] Problem still exists in 2.11
Date: Thu, 13 Nov 2014 17:55:04 +0700
To: bug-HTML-Format [...] rt.cpan.org
From: Pongtawat Chippimolchai <pongtawat.c [...] gmail.com>
I just ran into the problem describe by this bug in HTML-Format 2.11. FormatText still corrupts Thai UTF-8 contents as the tr line is still there. It could be easily solved by comment out that tr line. HTML-Format 2.11, Perl 5.14.2