Skip Menu |

This queue is for tickets about the HTML-Element-Extended CPAN distribution.

Report information
The Basics
Id: 48522
Status: resolved
Priority: 0/
Queue: HTML-Element-Extended

People
Owner: MSISK [...] cpan.org
Requestors: KWittrock [...] web.de
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Problem in HTML::ElementTable
Date: Fri, 07 Aug 2009 19:33:09 +0200
To: bug-HTML-Element-Extended [...] rt.cpan.org
From: "K. Wittrock" <KWittrock [...] web.de>
The attached script demonstrates two problems with method as_text of HTML::ElementTable: In the last col of the 2nd row, a blank is inserted in the middle of the text. Firefox an Internet Explorer display this text intact. So IMHO the HTML code of this cell should be considered as unusual, though either correct or tolerable. In the following rows, as_text() apparently ignores the <br> tags, thus loosing the newlines of multiline text. This makes extraction of info unneccessarily complicated (and sometimes impossible). Please contact me if you like to look at the original web page. Then I will send you a script to fetch this page with WWW::Mechanize. I work with Windows XP SP3, perl v5.8.8 and HTML::ElementTable 1.17. Kind regards Klaus Wittrock
0 - 7 Uhr Ortsgespräch Ferngespräch Alle Mobilfu nknetze * 01078
Call by Call
Minute: 0,51 Ct.
Takt: 60/60
01078

01078
Call by Call
Minute: 0,51 Ct.
Takt: 60/60
01078

0900531
Call by Call
Minute: 6,80 Ct.
Takt: 60/60
0900531


01013
Call-by-Call
Minute: 0,97 Ct.
Takt: 60/60
01013

01073
01073
Minute: 0,60 Ct.
Takt: 60/60
01073

01073
01073
Minute: 6,90 Ct.
Takt: 60/60
01073

use strict; use warnings; use HTML::TreeBuilder; use HTML::ElementTable; my $file_name = 'demopage.html'; my $root = HTML::TreeBuilder->new_from_file($file_name); my $tbl = $root->find('table'); my $eltbl = HTML::ElementTable->new_from_tree($tbl); my @tbl_rows; foreach (0 .. $eltbl->maxrow()) { push @tbl_rows, $eltbl->row($_); } printrow($_) foreach @tbl_rows; sub printrow{ my $zeil_ref = shift; # Type is HTML::ElementTable::RowGlob my @zeile = map({ $_ || ''} $zeil_ref->as_text()); print "\nRow: @zeile\n"; print "Cells of this row:\n"; print " $_\n" foreach @zeile; } $root->delete();
Hi Klaus, Technically this is a problem with HTML::Element. This class is a subclass of that, and is where the as_text() method is implemented. If you take a look at the code for HTML::Element, you'll see that the implementation is pretty simple. Adding a clause that detects BR tags and inserts a newline would be trivial. Perhaps it could be made dependent on a parameter that is passed into the as_text() call. I could override the as_text() method in my class, but the intention is for my extended classes to be fully compatible (therefore embeddable) with traditional HTML::Element objects. If a traverse of any sort is begun at a top-level element that is not aware of my modifications, my features are not invoked. (the traverse, or as_text(), etc, are not called on a per-element basis). Cheers, Matt
Subject: Re: [rt.cpan.org #48522] Problem in HTML::ElementTable
Date: Sat, 12 Jun 2010 10:50:07 +0200
To: bug-HTML-Element-Extended [...] rt.cpan.org
From: "K. Wittrock" <KWittrock [...] web.de>
Hello Matt, thank you for your detailed explaination about the BR bug in as_text(). MSISK via RT schrieb: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=48522 > > > Hi Klaus, > > Technically this is a problem with HTML::Element. This class is a > subclass of that, and is where the as_text() method is implemented.
I think that it is always first choice to fix a bug at the place where it is caused. So it will be best that I issue a bug report for HTML::Element on this matter. Show quoted text
> > If you take a look at the code for HTML::Element, you'll see that the > implementation is pretty simple. Adding a clause that detects BR tags > and inserts a newline would be trivial.
That's only half of the story. Newlines in the HTML code are cosmetics that enhance the readability of the code for the programmer. They don't affect the visual appearence of text and hence should be replaced by blanks within as_text(). Show quoted text
> Perhaps it could be made > dependent on a parameter that is passed into the as_text() call.
Ok, some people may have a different view on the meaning of text in HTML documents. Kind regards Klaus
thought this one was already closed; Klause feel free to contact me with your discoveries with HTML::Element.
Subject: Re: [rt.cpan.org #48522] Problem in HTML::ElementTable
Date: Thu, 01 Sep 2011 13:57:09 +0200
To: bug-HTML-Element-Extended [...] rt.cpan.org
From: "K. Wittrock" <KWittrock [...] web.de>
Am 29.08.2011 18:19, schrieb MSISK via RT: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=48522> > > thought this one was already closed; Klause feel free to contact me with > your discoveries with HTML::Element. >
I didn't really get what you expect me to provide in my reply. I will try my very best. The cells of the HTML table contains info about call-by-call telephone connections. These cells are organized as multiline text, separated by <br> tags. Row 1 (normally the 1st row, sometimes the 2nd, that's no problem) contains the CbC number, a series of digits. Since I get the text of the row as a single line string, I'm in trouble when row 2 starts with digits. When this happens, I can often get the CbC number from row 5. In rare cases I have to insert the CbC number manually. Also extracting all needed info at once from the single line string makes the regex rather complex. A multi line string would make the extraction more robust against changes in the layout of the cells. I looked at sub as_text in HTML::Element. My feeling is that there should be some code like "if the current $pile[0] is a <br> tag, shift it off, add a newline to $text and skip the rest of this iteration". Please tell me if you would like me to check this out. And also tell me if I didn't reply as you expected. By the way, I don't understand the beginning of the loop: if ( !defined( $pile[0] ) ) { # undef! # no-op } Wouldn't enter this an infinite loop? Kind regards Klaus
Klaus, I think you have two options if I'm understanding your goal correctly. 1) Use HTML::TableExtract to parse the HTML. By default, this module will convert <br> tags into newlines. (you can also use this module in 'TREE' mode in order to get an HTML::Element structure rather than raw text, if that's what you really want). 2) If you have an HTML::Element structure, you can use the format() method rather than as_text(). By default, this uses HTML::FormatText to convert the HTML into text and "does the right thing" including the conversion of <br> tags into newlines. In the original script you provided, you can simply replace the call to as_text() with format(). Please let me know if either of these work for you. Cheers, Matt
Subject: Re: [rt.cpan.org #48522] Problem in HTML::ElementTable
Date: Fri, 02 Sep 2011 13:52:21 +0200
To: bug-HTML-Element-Extended [...] rt.cpan.org
From: "K. Wittrock" <KWittrock [...] web.de>
Matt, ActivePerl doesn't offer HTML::FormatText in its ppm repository. So I installed HTML::FormatText::WithLinks - and it works. Method format() returns the multiline texts of the cells as multiline strings, exactly as I need. Thank you very much for your kind help. Klaus Am 01.09.2011 18:43, schrieb MSISK via RT: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=48522> > > Klaus, > > I think you have two options if I'm understanding your goal correctly. > > 1) Use HTML::TableExtract to parse the HTML. By default, this module > will convert<br> tags into newlines. (you can also use this module in > 'TREE' mode in order to get an HTML::Element structure rather than raw > text, if that's what you really want). > > 2) If you have an HTML::Element structure, you can use the format() > method rather than as_text(). By default, this uses HTML::FormatText to > convert the HTML into text and "does the right thing" including the > conversion of<br> tags into newlines. In the original script you > provided, you can simply replace the call to as_text() with format(). > > Please let me know if either of these work for you. > > Cheers, > Matt >
Resolved with workaround, using format() rather than as_text()