Skip Menu |

This queue is for tickets about the HTML-Extract CPAN distribution.

Report information
The Basics
Id: 47128
Status: new
Priority: 0/
Queue: HTML-Extract

People
Owner: Nobody in particular
Requestors: metabaron [...] metabaron.net
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Bug when dealing with UTF-8 web pages
Date: Thu, 18 Jun 2009 23:17:47 +0200
To: bug-HTML-Extract [...] rt.cpan.org
From: Mathieu Feulvarc'h <metabaron [...] metabaron.net>
Perl v5.8.8 built for i486-linux-gnu-thread-multi Debian: Linux serveur 2.6.18-4-686 #1 SMP Wed May 9 23:03:12 UTC 2007 i686 GNU/Linux Of course, using the latest version of th HTML::Extract; Ok, first, please excuse my English but I'm French. So, I retrieving a web page from a French website (ratp.fr): <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> and using this code in order to clean the HTML code from it: my $extractor = new HTML::Extract; my $page_nohtml = $extractor->gethtml($page_result, "tagname=body", "returntype=text"); print $page_nohtml."\n"; So, as you can see, nothing fancy. And here are the error messages I received: Malformed UTF-8 character (unexpected end of string) in subroutine entry at /usr/local/share/perl/5.8.8/HTML/Extract.pm line 127. Malformed UTF-8 character (unexpected end of string) in length at /usr/share/perl5/HTML/TreeBuilder.pm line 988. Malformed UTF-8 character (unexpected end of string) in substitution (s///) at /usr/share/perl5/HTML/TreeBuilder.pm line 1106. Malformed UTF-8 character (unexpected end of string) in subroutine entry at /usr/local/share/perl/5.8.8/HTML/Extract.pm line 127. Malformed UTF-8 character (unexpected end of string) in length at /usr/share/perl5/HTML/TreeBuilder.pm line 988. Malformed UTF-8 character (unexpected end of string) in substitution (s///) at /usr/share/perl5/HTML/TreeBuilder.pm line 1106. Malformed UTF-8 character (unexpected end of string) in subroutine entry at /usr/local/share/perl/5.8.8/HTML/Extract.pm line 127. Malformed UTF-8 character (unexpected end of string) in length at /usr/share/perl5/HTML/TreeBuilder.pm line 988. Malformed UTF-8 character (unexpected end of string) in substitution (s///) at /usr/share/perl5/HTML/TreeBuilder.pm line 1106. Malformed UTF-8 character (unexpected end of string) in subroutine entry at /usr/local/share/perl/5.8.8/HTML/Extract.pm line 127. Malformed UTF-8 character (unexpected end of string) in length at /usr/share/perl5/HTML/TreeBuilder.pm line 988. Malformed UTF-8 character (unexpected end of string) in substitution (s///) at /usr/share/perl5/HTML/TreeBuilder.pm line 1106. So, reading the first line, we can locate a problem related to UTF-8 encoding The line generating an error is: Encode::_utf8_on($content2); And, it should be change (in order to stop the error messages and correctly display the web page) by: utf8::decode($content2); Hope this help...