Bug #38666 for XML-LibXML: URI Option Does Not Work

Fri Aug 22 14:24:44 2008 dwheeler [...] cpan.org - Ticket created

CC:	Eric Glover <eric [...] searchme.com>
Subject:	URI Option Does Not Work
Date:	Fri, 22 Aug 2008 11:24:22 -0700
To:	bug-xml-libxml [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

Howdy, This prints an undef: #!/usr/local/bin/perl -w use strict; use warnings; use feature ':5.10'; use XML::LibXML; my $html = '<html><body>foo</body></html>'; my $parser = XML::LibXML->new; my $doc = $parser->parse_html_string($html, { URI => 'http:// foo.com/' }); say $doc->baseURI; Shouldn't baseURI return 'http://foo.com/'? Or am I mis-reading the docs? Thanks, David

Sat Aug 23 04:04:14 2008 christian.glahn [...] lo-f.at - Correspondence added

Subject:	Re: [rt.cpan.org #38666] URI Option Does Not Work
Date:	Sat, 23 Aug 2008 10:03:55 +0200
To:	bug-XML-LibXML [...] rt.cpan.org
From:	Christian Glahn <christian.glahn [...] lo-f.at>

Hi David, This appears to be a documentation bug. The synopsis suggests a hash reference passed to parse_*string() functions. However, if you look at the actual documentation you find that the function expects a string as the optional second parameter. In this case the synopsis is wrong and the function description is correct. I tested it with your code and it works nicely. Another remark: if you know that your input is XHTML (rather than HTML strict) I suggest that you use the normal parse_string() function instead of its html sibling. Cheers Christian On Fri, 2008-08-22 at 14:24 -0400, David Wheeler via RT wrote: Show quoted text

> Fri Aug 22 14:24:44 2008: Request 38666 was acted upon. > Transaction: Ticket created by DWHEELER > Queue: XML-LibXML > Subject: URI Option Does Not Work > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: dwheeler@cpan.org > Status: new > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=38666 > > > > Howdy, > > This prints an undef: > > #!/usr/local/bin/perl -w > > use strict; > use warnings; > use feature ':5.10'; > use XML::LibXML; > > my $html = '<html><body>foo</body></html>'; > > my $parser = XML::LibXML->new; > my $doc = $parser->parse_html_string($html, { URI => 'http:// > foo.com/' }); > say $doc->baseURI; > > Shouldn't baseURI return 'http://foo.com/'? Or am I mis-reading the > docs? > > Thanks, > > David

-- Christian Glahn <christian.glahn@lo-f.at>

Sat Aug 23 04:04:16 2008 The RT System itself - Status changed from 'new' to 'open'

Sat Aug 23 09:48:53 2008 dwheeler [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #38666] URI Option Does Not Work
Date:	Sat, 23 Aug 2008 06:48:25 -0700
To:	bug-XML-LibXML [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

On Aug 23, 2008, at 01:04, Christian Glahn via RT wrote: Show quoted text

> This appears to be a documentation bug. > > The synopsis suggests a hash reference passed to parse_*string() > functions. However, if you look at the actual documentation you find > that the function expects a string as the optional second parameter. > > In this case the synopsis is wrong and the function description is > correct. I tested it with your code and it works nicely.

I just did this: my $html = '<html><body>foo</body></html>'; my $parser = XML::LibXML->new; my $doc = $parser->parse_html_string($html, 'http://foo.com/'); say $doc->baseURI; And it still printed an undef. Show quoted text

> Another remark: if you know that your input is XHTML (rather than HTML > strict) I suggest that you use the normal parse_string() function > instead of its html sibling.

This is why I'm passing a hash. I'm parsing arbitrary Web pages that will have god knows what kind of HTML in them. So my code actually looks like this: my $parser = XML::LibXML->new; my $doc = $parser->parse_html_string($html, { suppress_errors => 1, # Suppress errors suppress_warnings => 1, # Suppress warnings no_network => 1, # Don't make network requests. recover => 1, # Relaxed parsing for bad HTML. URI => 'http://foo.com/', }); say $doc->baseURI; Which also, BTW, outputs undef. And so does this: my $doc = $parser->parse_html_string($html, 'http://foo.com/', { suppress_errors => 1, # Suppress errors suppress_warnings => 1, # Suppress warnings no_network => 1, # Don't make network requests. recover => 1, # Relaxed parsing for bad HTML. }); say $doc->baseURI; IOW, there is no way I can see to properly set baseURI. David

Sun Aug 24 13:38:31 2008 christian.glahn [...] lo-f.at - Correspondence added

Subject:	Re: [rt.cpan.org #38666] URI Option Does Not Work
Date:	Sun, 24 Aug 2008 19:38:09 +0200
To:	bug-XML-LibXML [...] rt.cpan.org
From:	Christian Glahn <christian.glahn [...] lo-f.at>

Hi David, I dived into the code and found two issues and one of them explains your problem. You use the baseURI function. baseURI() uses libxml2's xmlGetNodeBase() function, which determines the base URL for HTML documents from the base tag in the documents header. Your document has no header and no base tag. Hence, the result is correctly undef. But there are good news for you: on the document node of your DOM tree and ONLY for this node, you can call the URI function, which returns the internal URL that has been set by the parse function. Therefore, in line 5 instead of saying $doc->baseURI; you should say $doc->URI;. Cheers and thanks for the report Christian On Sat, 2008-08-23 at 09:48 -0400, David Wheeler via RT wrote: Show quoted text

> Queue: XML-LibXML > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=38666 > > > On Aug 23, 2008, at 01:04, Christian Glahn via RT wrote: >

> > This appears to be a documentation bug. > > > > The synopsis suggests a hash reference passed to parse_*string() > > functions. However, if you look at the actual documentation you find > > that the function expects a string as the optional second parameter. > > > > In this case the synopsis is wrong and the function description is > > correct. I tested it with your code and it works nicely.

> > I just did this: > > my $html = '<html><body>foo</body></html>'; > > my $parser = XML::LibXML->new; > my $doc = $parser->parse_html_string($html, 'http://foo.com/'); > say $doc->baseURI; > > And it still printed an undef. >

> > Another remark: if you know that your input is XHTML (rather than HTML > > strict) I suggest that you use the normal parse_string() function > > instead of its html sibling.

> > This is why I'm passing a hash. I'm parsing arbitrary Web pages that > will have god knows what kind of HTML in them. So my code actually > looks like this: > > my $parser = XML::LibXML->new; > my $doc = $parser->parse_html_string($html, { > suppress_errors => 1, # Suppress errors > suppress_warnings => 1, # Suppress warnings > no_network => 1, # Don't make network requests. > recover => 1, # Relaxed parsing for bad HTML. > URI => 'http://foo.com/', > }); > say $doc->baseURI; > > Which also, BTW, outputs undef. And so does this: > > my $doc = $parser->parse_html_string($html, 'http://foo.com/', { > suppress_errors => 1, # Suppress errors > suppress_warnings => 1, # Suppress warnings > no_network => 1, # Don't make network requests. > recover => 1, # Relaxed parsing for bad HTML. > }); > say $doc->baseURI; > > IOW, there is no way I can see to properly set baseURI. > > David

-- Christian Glahn <christian.glahn@lo-f.at>

Sun Aug 24 13:45:45 2008 phish [...] cpan.org - Correspondence added 90 min

Problem was that baseURI() works slightly different for XML and for HTML documents. To access the URI the has been set during parse time in a consistent way, one should call the URI() function on the document root.

Sun Aug 24 13:45:47 2008 phish [...] cpan.org - Status changed from 'open' to 'resolved'

Sun Aug 24 13:45:48 2008 phish [...] cpan.org - Given to PHISH

Mon Aug 25 18:53:56 2008 dwheeler [...] cpan.org - Correspondence added

CC:	Eric Glover <eric [...] searchme.com>
Subject:	Re: [rt.cpan.org #38666] URI Option Does Not Work
Date:	Mon, 25 Aug 2008 15:53:42 -0700
To:	bug-XML-LibXML [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

On Aug 24, 2008, at 10:38, Christian Glahn via RT wrote: Show quoted text

> I dived into the code and found two issues and one of them explains > your > problem.

Thank you, Christian. Show quoted text

> You use the baseURI function. baseURI() uses libxml2's > xmlGetNodeBase() > function, which determines the base URL for HTML documents from the > base > tag in the documents header. Your document has no header and no base > tag. Hence, the result is correctly undef.

Ah, okay, that makes sense. Show quoted text

> But there are good news for you: on the document node of your DOM tree > and ONLY for this node, you can call the URI function, which returns > the > internal URL that has been set by the parse function. > > Therefore, in line 5 instead of saying $doc->baseURI; you should say > $doc->URI;.

Great, this works: my $parser = XML::LibXML->new; my $doc = $parser->parse_html_string($html, { URI => 'http:// foo.com/' }); say $doc->URI; Good, that's exactly what I need. Any chance of the docs being updated to reflect this? Note that this does not, however (not that I care, but since it's what the docs seem to indicate: my $parser = XML::LibXML->new; my $doc = $parser->parse_html_string($html, 'http://foo.com/'); say $doc->URI; Best, David

Mon Aug 25 18:53:58 2008 The RT System itself - Status changed from 'resolved' to 'open'

Sun Nov 02 15:51:41 2008 pajas [...] matfyz.cz - Correspondence added

I believe the current documentation does not indicate that parse_html_string($html,$uri) should do something useful (unlike parse_html_string($html,{URI=>$uri}), which works as expected). I have added documentation of $doc->URI, added a $doc->setURI method, and added documentation of $node->baseURI and $node->setBaseURI. The changes are in the SVN and will appear in 1.67 (to be released soon). With this, I'm closing this ticket. Please do not reopen it, unless you want to complain about the changes made in SVN. -- Petr

Sun Nov 02 15:51:43 2008 pajas [...] matfyz.cz - Status changed from 'open' to 'resolved'