Skip Menu |

This queue is for tickets about the HTML-ExtractMain CPAN distribution.

Report information
The Basics
Id: 82671
Status: open
Priority: 0/
Queue: HTML-ExtractMain

People
Owner: Nobody in particular
Requestors: user42 [...] zip.com.au
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: return plain html not xhtml
Date: Sun, 13 Jan 2013 10:06:23 +1100
To: bug-HTML-ExtractMain [...] rt.cpan.org
From: Kevin Ryde <user42 [...] zip.com.au>
It'd be good if there was an option to get back plain html rather than xhtml. The differences are small but for example xhtml has &apos; which is not a html entity (though some browsers allow it). Perhaps something like extract_main_html($html, output_type => 'html'); which could default to "xhtml", and allow "html". Maybe even allow "treebuilder" to return a crunched HTML::TreeBuilder object, which the caller can then ask for any of its various output styles. (Or is "tree" for a HTML::Tree object better?) Key/value for the options might help with future expansion if for instance having to tune the main-ness of some inputs etc.
On Sat Jan 12 18:06:53 2013, user42@zip.com.au wrote: Show quoted text
> It'd be good if there was an option to get back plain html rather than > xhtml. The differences are small but for example xhtml has &apos; which > is not a html entity (though some browsers allow it). > > Perhaps something like > > extract_main_html($html, > output_type => 'html'); > > which could default to "xhtml", and allow "html". Maybe even allow > "treebuilder" to return a crunched HTML::TreeBuilder object, which the > caller can then ask for any of its various output styles. (Or is "tree" > for a HTML::Tree object better?) Key/value for the options might help > with future expansion if for instance having to tune the main-ness of > some inputs etc.
Hi, I'm a bit busy right now, but I think this is a good idea. Would you like to build this functionality? The code's up on Github at https://github.com/anirvan/html-extractmain Thanks!
Subject: Re: [rt.cpan.org #82671] return plain html not xhtml
Date: Sat, 19 Jan 2013 09:33:49 +1100
To: bug-HTML-ExtractMain [...] rt.cpan.org
From: Kevin Ryde <user42 [...] zip.com.au>
"Anirvan Chatterjee via RT" <bug-HTML-ExtractMain@rt.cpan.org> writes: Show quoted text
> > Would you like to build this functionality?
I'll have a go. I was also contemplating retaining the <head> part and the enclosing <html>, to have the output a complete html document, just chopped down to its "main" parts.