Skip Menu |

This queue is for tickets about the XML-LibXML CPAN distribution.

Report information
The Basics
Id: 62223
Status: resolved
Priority: 0/
Queue: XML-LibXML

People
Owner: Nobody in particular
Requestors: step.aleksey [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 1.70
Fixed in: (no value)



Subject: HTML parser error : htmlParseEntityRef: expecting ';'
Hello. I used XML-LIBXML for parsing HTML documents. In version 1.69 everything was ok, but when I started to use 1.70, I took a warnings. But data, what I wanted to get is right. Example test.pl: use strict; use warnings; use XXX::Sender; use XXX::Parser; my $url = 'http://video.mail.ru/mail/kadaj.ff/16/18.html'; my $sender = XXX::Sender->new(); my $page_content = $sender->get($url); return unless $page_content; my $parser = XXX::Parser->new(); $parser->parse_html($page_content); my $favicon_uri = $parser->get_favicon_src(); if($favicon_uri){ print "Favicon_uri:".$favicon_uri."\n"; } Need modules: package XXX::Sender; use strict; use LWP::UserAgent; use HTTP::Request; use base qw( Class::Accessor::Fast ); __PACKAGE__->mk_ro_accessors( qw( parser root ) ); sub new{ my $class = shift; $class = ref $class || $class; my $self = bless {}, $class; $self = $self->init; return $self; } sub init{ my $self = shift; $self->{ua} = LWP::UserAgent->new(); $self->{ua}->timeout(40); return $self; } sub get{ my $self = shift; my $uri = shift; my $req = HTTP::Request->new(GET =>$uri); my $res = $self->{ua}->request($req); if($res->is_success){ return $res->content; } return undef; } 1; package XXX::Parser; use strict; use XML::LibXML; local $XML::LibXML::skipXMLDeclaration = 1; local $XML::LibXML::skipDTD = 1; use base qw( Class::Accessor::Fast ); __PACKAGE__->mk_ro_accessors( qw( parser root ) ); sub new{ my $class = shift; $class = ref $class || $class; my $self = bless {}, $class; $self = $self->_init; return $self; } sub _init{ my $self = shift; my $parser = XML::LibXML->new(); $parser->expand_entities(0); $parser->validation(0); $parser->no_network(1); $parser->recover_silently(1); $self->{parser} = $parser; return $self; } sub parse_html{ my $self = shift; my $html = shift; my $dom = $self->parser->parse_html_string($html); $self->{root} = $dom->documentElement(); return $self->{root}; } sub get_favicon_src{ my $self = shift; foreach my $node (@{$self->root->findnodes('//link[@rel="shortcut icon"]')}){ return $node->getAttribute("href"); } return undef; } 1; And result what I took after starting: HTML parser error : htmlParseEntityRef: expecting ';' rel="video_src" href="http://img.mail.ru/r/video2/player_v2.swf?orig=2&movieSrc ^ HTML parser error : htmlParseEntityRef: expecting ';' tp://img.mail.ru/r/video2/player_v2.swf?orig=2&movieSrc=mail/kadaj.ff/16/18&host ^ HTML parser error : htmlParseEntityRef: expecting ';' player_v2.swf?orig=2&movieSrc=mail/kadaj.ff/16/18&host=video.mail.ru&contentHost ^ validity error : ID cln6259 already defined <a class="lw lw-mail" href="http://mail.ru" name="cln6259"><i></i></a> ^ validity error : ID cln6259 already defined <a class="lw lw-video" href="http://video.mail.ru" name="cln6259"><i></i></a> ^ validity error : ID cln4880 already defined A"><a href="https://money.mail.ru/" name="cln4880" class="shAaa" target="_blank" ^ HTML parser error : htmlParseEntityRef: expecting ';' <a href="http://www.mail.ru/agent?message&to=kadaj.ff@mail.ru" title="Щелк ^ HTML parser error : Element script embeds close tag '<i class="mf_spIco" onclick="return Captcha.hide();"></i>' + ^ HTML parser error : Element script embeds close tag '</form>' + ^ HTML parser error : Element script embeds close tag '</div>' + ^ HTML parser error : Element script embeds close tag '</div>'; ^ HTML parser error : htmlParseEntityRef: expecting ';' <param name="flashvars" value="orig=2&movieSrc=mail/kadaj.ff/16/18" / ^ HTML parser error : Element script embeds close tag de").value = fotoNothing; gebi("lj-code").value = "<lj-embed>" + fotoNothing + ' ^ validity error : ID goleft_listId already defined <div id="goleft_listId"><a href="#" class="u"><img height="22" width="100% ^ validity error : ID goright_listId already defined <div id="goright_listId"><a href="#" class="d"><img height="22" width="100 ^ HTML parser error : htmlParseEntityRef: expecting ';' <img src="http://rs.mail.ru/d275994.gif?rnd=203403146&ts=1287384542" width="1" h ^ HTML parser error : htmlParseEntityRef: expecting ';' <img src="http://rs.mail.ru/d288730.gif?rnd=157728719&ts=1287384542" width="1" h ^ HTML parser error : Element script embeds close tag ww.macromedia.com/shockwave/download/index.cgiP1_Prod_Version=ShockwaveFlash" /> ^ HTML parser error : Element script embeds close tag edia.com/shockwave/download/index.cgiP1_Prod_Version=ShockwaveFlash" /></object> ^ HTML parser error : Element script embeds close tag rs.mail.ru/b11675457.jpg" width="200" height="300" border="0" alt="" title="" /> ^ HTML parser error : Element script embeds close tag ail.ru/b11675457.jpg" width="200" height="300" border="0" alt="" title="" /></a> ^ HTML parser error : htmlParseEntityRef: expecting ';' "rb_banner"><a name="clb288730" href="http://1link.mail.ru/c.php?site_id=49118&p ^ HTML parser error : htmlParseEntityRef: expecting ';' Show quoted text
><a name="clb288730"
href="http://1link.mail.ru/c.php?site_id=49118&p=231&sub_id ^ HTML parser error : htmlParseEntityRef: expecting ';' сайте</a> <a type="mrim-status-9" href="http://www.mail.ru/agent?message&to ^ HTML parser error : htmlParseEntityRef: expecting ';' сайте</a> <a type="mrim-status-9" href="http://www.mail.ru/agent?message&to ^ HTML parser error : Tag wbr invalid y.mail.ru/mail/kolpakova-d/" class="booster-sc">Дарья Колпаков<wbr ^ HTML parser error : htmlParseEntityRef: expecting ';' сайте</a> <a type="mrim-status-9" href="http://www.mail.ru/agent?message&to ^ HTML parser error : Tag wbr invalid .mail.ru/mail/green_apple-/" class="booster-sc">Мария Сергеевн<wbr ^ HTML parser error : htmlParseEntityRef: expecting ';' сайте</a> <a type="mrim-status-9" href="http://www.mail.ru/agent?message&to ^ HTML parser error : Tag wbr invalid href="http://my.mail.ru/mail/roni-74/" class="booster-sc">Вероника<wbr ^ HTML parser error : Tag wbr invalid /mail/roni-74/" class="booster-sc">Вероника<wbr /> Свиридов<wbr ^ HTML parser error : htmlParseEntityRef: expecting ';' <!-- start slot 1879 --><img src="http://rs.mail.ru/d225277.gif?rnd=174129342&ts ^ HTML parser error : htmlParseEntityRef: expecting ';' <!-- Start slot 3 --><img src="http://rs.mail.ru/d292152.gif?rnd=991533156&ts=12 ^ HTML parser error : Element script embeds close tag ('<sc'+'ript type="text/javascript" src="http://an.yandex.ru/system/context.js"> ^ HTML parser error : Element script embeds close tag +'ript type="text/javascript" src="http://autocontext.begun.ru/autocontext2.js"> ^ And then I took right answer: Favicon_uri:http://video.mail.ru/favicon.ico Is it a bug in a library? Or do I do something wrong?
Hi Aleksey, thanks for your report and sorry it took us so long to get to you. Can you demonstrate the problem using a self contained Test::More script (see http://metacpan.org/module/Test::Tutorial ) with a local file (or prefarably one contained within the test)? We cannot really afford to test the output of live web services, because doing that will: 1. Be subject to the returned content changing. 2. will overload the web-service, and may not fall under their terms-of-use. 3. Will make us depend on LWP::UserAgent. 4. will be prone to networking problems. Regards, -- Shlomi Fish On Mon Oct 18 03:14:57 2010, step.aleksey@gmail.com wrote: Show quoted text
> Hello. > I used XML-LIBXML for parsing HTML documents. In version 1.69 > everything > was ok, but when I started to use 1.70, I took a warnings. But data, > what I wanted to get is right.
[SNIPPED]
From: tim [...] tim-landscheidt.de
Attached is a test case that fails for me with 1.74 and
Subject: xmlparser-test.pl
#!/usr/bin/perl -w use strict; use warnings; use Test::More tests => 2; use Test::Output; use XML::LibXML; my $XMLParser = new XML::LibXML (recover => 2, suppress_errors => 1, suppress_warnings => 1) or die ($!); ok (defined ($XMLParser), 'Parser created'); stderr_is (sub { $XMLParser->parse_html_string ('<html><body><a href="http://host/script?a=1&b=2"></body></html>') }, '', 'Incorrect HTML parsed without warnings');
From: tim [...] tim-landscheidt.de
"libxml2-2.7.8-6.fc16.i686" got cut off :-).
Hi Tim, On Sat May 12 15:04:44 2012, tim@tim-landscheidt.de wrote: Show quoted text
> Attached is a test case that fails for me with 1.74 and > "libxml2-2.7.8-6.fc16.i686
here is the result of running this on Mageia Linux 2 : <SHELL> shlomif@lap:~$ rpm -q lib64xml2-devel lib64xml2-devel-2.7.8-14.20120229.1.mga2 shlomif@lap:~$ perl -MXML::LibXML\ 9999 XML::LibXML version 9999 required--this is only version 1.95 at /usr/lib/perl5/vendor_perl/5.14.2/x86_64-linux-thread-multi/XML/LibXML.pm line 52. BEGIN failed--compilation aborted. shlomif@lap:~$ prove xmlparser-test.pl xmlparser-test.pl .. ok All tests successful. Files=1, Tests=2, 1 wallclock secs ( 0.03 usr 0.00 sys + 0.08 cusr 0.01 csys = 0.12 CPU) Result: PASS shlomif@lap:~$ perl xml xmlparser-test.pl xmlparser-test.pl~ shlomif@lap:~$ perl xmlparser-test.pl 1..2 ok 1 - Parser created ok 2 - Incorrect HTML parsed without warnings </SHELL> I see that libxml2’s version is identical (modulo some different vendor patches), but XML::LibXML's version is incredibly out of date: there's already 1.97 and XML-LibXML-1.74 was released 11 months ago according to https://metacpan.org/release/SHLOMIF/XML-LibXML-1.74/ . Can you test it with XML-LibXML-1.97? You can do "perl Makefile.PL; make; make test" and then run your script using "perl -Mblib". Regards, -- Shlomi Fish
From: tim [...] tim-landscheidt.de
After an update to Fedora Rawhide's perl-XML-LibXML-1.97-1.fc18.i686
Hi Tim, On Sat May 12 20:50:00 2012, tim@tim-landscheidt.de wrote: Show quoted text
> After an update to Fedora Rawhide's perl-XML-LibXML-1.97-1.fc18.i686
it seems your message got cut down again. Regards, -- Shlomi Fish
From: tim [...] tim-landscheidt.de
Show quoted text
> > After an update to Fedora Rawhide's perl-XML-LibXML-1.97-1.fc18.i686
From: tim [...] tim-landscheidt.de
Okay, adding some newlines apparently didn't work, so let's try with another browser: After an update to Fedora Rawhide's perl-XML-LibXML-1.97-1.fc18.i686 with no update of libxml2, the test case succeeds. So I'd propose to close this bug as "fixed in 1.97". Thanks!
On Sun May 13 13:07:03 2012, tim@tim-landscheidt.de wrote: Show quoted text
> Okay, adding some newlines apparently didn't work, so let's > try with another browser: After an update to Fedora Rawhide's > perl-XML-LibXML-1.97-1.fc18.i686 with no update of libxml2, > the test case succeeds. So I'd propose to close this bug > as "fixed in 1.97". Thanks!
Thanks! I resolved this bug. Aleksey: if you're interested in providing a better test case, which still fails in XML-LibXML-1.97, please comment on this bug (and thus re-opening it). Otherwise, the bug will remain resolved due to lack of responsiveness. Regards, -- Shlomi Fish