Bug #82671 for HTML-ExtractMain: return plain html not xhtml

Sat Jan 12 18:06:53 2013 user42 [...] zip.com.au - Ticket created

Subject:	return plain html not xhtml
Date:	Sun, 13 Jan 2013 10:06:23 +1100
To:	bug-HTML-ExtractMain [...] rt.cpan.org
From:	Kevin Ryde <user42 [...] zip.com.au>

It'd be good if there was an option to get back plain html rather than xhtml. The differences are small but for example xhtml has ' which is not a html entity (though some browsers allow it). Perhaps something like extract_main_html($html, output_type => 'html'); which could default to "xhtml", and allow "html". Maybe even allow "treebuilder" to return a crunched HTML::TreeBuilder object, which the caller can then ask for any of its various output styles. (Or is "tree" for a HTML::Tree object better?) Key/value for the options might help with future expansion if for instance having to tune the main-ness of some inputs etc.

Tue Jan 15 03:50:22 2013 ANIRVAN [...] cpan.org - Correspondence added

On Sat Jan 12 18:06:53 2013, user42@zip.com.au wrote: Show quoted text

> It'd be good if there was an option to get back plain html rather than > xhtml. The differences are small but for example xhtml has ' which > is not a html entity (though some browsers allow it). > > Perhaps something like > > extract_main_html($html, > output_type => 'html'); > > which could default to "xhtml", and allow "html". Maybe even allow > "treebuilder" to return a crunched HTML::TreeBuilder object, which the > caller can then ask for any of its various output styles. (Or is "tree" > for a HTML::Tree object better?) Key/value for the options might help > with future expansion if for instance having to tune the main-ness of > some inputs etc.

Hi, I'm a bit busy right now, but I think this is a good idea. Would you like to build this functionality? The code's up on Github at https://github.com/anirvan/html-extractmain Thanks!

Tue Jan 15 03:50:24 2013 The RT System itself - Status changed from 'new' to 'open'

Fri Jan 18 17:33:58 2013 user42 [...] zip.com.au - Correspondence added

Subject:	Re: [rt.cpan.org #82671] return plain html not xhtml
Date:	Sat, 19 Jan 2013 09:33:49 +1100
To:	bug-HTML-ExtractMain [...] rt.cpan.org
From:	Kevin Ryde <user42 [...] zip.com.au>

"Anirvan Chatterjee via RT" <bug-HTML-ExtractMain@rt.cpan.org> writes: Show quoted text

> > Would you like to build this functionality?

I'll have a go. I was also contemplating retaining the <head> part and the enclosing <html>, to have the output a complete html document, just chopped down to its "main" parts.