Subject: | Bug when dealing with UTF-8 web pages |
Date: | Thu, 18 Jun 2009 23:17:47 +0200 |
To: | bug-HTML-Extract [...] rt.cpan.org |
From: | Mathieu Feulvarc'h <metabaron [...] metabaron.net> |
Perl v5.8.8 built for i486-linux-gnu-thread-multi
Debian: Linux serveur 2.6.18-4-686 #1 SMP Wed May 9 23:03:12 UTC 2007
i686 GNU/Linux
Of course, using the latest version of th HTML::Extract;
Ok, first, please excuse my English but I'm French.
So, I retrieving a web page from a French website (ratp.fr):
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
and using this code in order to clean the HTML code from it:
my $extractor = new HTML::Extract;
my $page_nohtml = $extractor->gethtml($page_result, "tagname=body",
"returntype=text");
print $page_nohtml."\n";
So, as you can see, nothing fancy.
And here are the error messages I received:
Malformed UTF-8 character (unexpected end of string) in subroutine entry
at /usr/local/share/perl/5.8.8/HTML/Extract.pm line 127.
Malformed UTF-8 character (unexpected end of string) in length at
/usr/share/perl5/HTML/TreeBuilder.pm line 988.
Malformed UTF-8 character (unexpected end of string) in substitution
(s///) at /usr/share/perl5/HTML/TreeBuilder.pm line 1106.
Malformed UTF-8 character (unexpected end of string) in subroutine entry
at /usr/local/share/perl/5.8.8/HTML/Extract.pm line 127.
Malformed UTF-8 character (unexpected end of string) in length at
/usr/share/perl5/HTML/TreeBuilder.pm line 988.
Malformed UTF-8 character (unexpected end of string) in substitution
(s///) at /usr/share/perl5/HTML/TreeBuilder.pm line 1106.
Malformed UTF-8 character (unexpected end of string) in subroutine entry
at /usr/local/share/perl/5.8.8/HTML/Extract.pm line 127.
Malformed UTF-8 character (unexpected end of string) in length at
/usr/share/perl5/HTML/TreeBuilder.pm line 988.
Malformed UTF-8 character (unexpected end of string) in substitution
(s///) at /usr/share/perl5/HTML/TreeBuilder.pm line 1106.
Malformed UTF-8 character (unexpected end of string) in subroutine entry
at /usr/local/share/perl/5.8.8/HTML/Extract.pm line 127.
Malformed UTF-8 character (unexpected end of string) in length at
/usr/share/perl5/HTML/TreeBuilder.pm line 988.
Malformed UTF-8 character (unexpected end of string) in substitution
(s///) at /usr/share/perl5/HTML/TreeBuilder.pm line 1106.
So, reading the first line, we can locate a problem related to UTF-8
encoding
The line generating an error is:
Encode::_utf8_on($content2);
And, it should be change (in order to stop the error messages and
correctly display the web page) by:
utf8::decode($content2);
Hope this help...