Skip Menu |

This queue is for tickets about the HTML-TableParser CPAN distribution.

Report information
The Basics
Id: 1490
Status: resolved
Priority: 0/
Queue: HTML-TableParser

People
Owner: Nobody in particular
Requestors: xma [...] arctur.com
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 0.31
Fixed in: 0.32



Subject: TableParser patch

Message body is not shown because it is too large.

diff -cr HTML-TableParser-0.31/TableParser.pm HTML-TableParser-0.31.orig/TableParser.pm *** HTML-TableParser-0.31/TableParser.pm Sat Aug 31 10:11:10 2002 --- HTML-TableParser-0.31.orig/TableParser.pm Fri Apr 19 22:56:06 2002 *************** *** 921,935 **** my ( $self, $attr, $line ) = @_; my $otbl = $self->{Tables}[-1]; ! my $oids; ! if (! $otbl) { ! # print STDERR "No old table\n"; ! $oids = undef; ! }else{ ! $oids = $otbl->ids; ! } ! my $tbl = HTML::TableParser::Table->new( $self, ! $oids, $self->{reqs}, $line ); $self->process( $tbl->process ); --- 921,929 ---- my ( $self, $attr, $line ) = @_; my $otbl = $self->{Tables}[-1]; ! ! my $tbl = HTML::TableParser::Table->new( $self, ! $self->{Tables}[-1]->ids, $self->{reqs}, $line ); $self->process( $tbl->process ); *************** *** 945,951 **** my $tbl = pop @{$self->{Tables}}; undef $tbl; ! $self->process( $self->{Tables}[-1]->process ) if defined $self->{Tables}[-1]; } --- 939,945 ---- my $tbl = pop @{$self->{Tables}}; undef $tbl; ! $self->process( $self->{Tables}[-1]->process ); }
Thanks for your bug report. The "real" problem is that the input HTML is malformed. There is an extra </table> tag in line 1016 of the input which caused the error. It's this line: 1016:</TABLE></TD></TR></TABLE></TD></TR></TABLE><A NAME="_BOTTOM"></A> I've modified the code to croak if there's an extra end table tag. I'll upload it to CPAN shortly. Diab [guest - Sat Aug 31 13:32:54 2002]: Show quoted text
> The patch attached fixes the following errors, which show up only when > turning on "use diagnostics": > Uncaught exception from user code: > Uncaught exception from user code: > Modification of non-creatable array value attempted, subscript > -1 at /usr/lib/perl5/site_perl/5.6.1/HTML/TableParser.pm line 942. > HTML::TableParser::end_table('HTML::TableParser=HASH(0xa0210f4)', > undef, 1016) called at > /usr/lib/perl5/site_perl/5.6.1/HTML/TableParser.pm line 905 > HTML::TableParser::end('HTML::TableParser=HASH(0xa0210f4)', > 'table', undef, 1016) called at > /usr/lib/perl5/site_perl/5.6.1/cygwin-multi/HTML/Parser.pm line 104 > eval {...} called at /usr/lib/perl5/site_perl/5.6.1/cygwin- > multi/HTML/Parser.pm line 104 > HTML::Parser::parse_file('HTML::TableParser=HASH(0xa0210f4)', > 'EP0210645_1') called at > /downloads/XiaoJunMa/SoftDev/Perl/scripts/parsePatent.pl line 33 > HTML::Parser::parse_file('HTML::TableParser=HASH(0xa0210f4)', > 'EP0210645_1') called at > /downloads/XiaoJunMa/SoftDev/Perl/scripts/parsePatent.pl line 33
xiao-jun, What's happening is that there's a row which contains data as well as starts a new table, i.e.: <tr> <td> data </td> <td> <table> .... </table> </td> </tr> The parser has to finish off the embedded table before it can finish off the row. So, the data for the embedded table are returned before the data for the enclosing row. There's no way around this, unfortunately. If you modify your row routine to print out the table id, you'll see that the out-of-order rows are due to embedded tables, as described above. You should use the table id passed to row() to sort your data, or use the class constructor method to create a new object per table. diab [xma@arcturusag.com - Tue Sep 3 12:20:29 2002]: Show quoted text
> Hi Diab, > > Thanks for the clarification. I also noticed that the row handlers are > fired > out of line order, for example with the file I attached before: > > 828: [:IPC Code:::::C07K 15/00;....] > 849: [Aug. 1, 1985:DE1985003527568] > 862: [] > 844: [:Priority Number::::::] > > by this code : > sub row { > my ( $tbl_id, $line_no, $data, $udata ) = @_; > print STDERR "$line_no: ", "[", join(":", @$data), "]\n"; > } > > This can cause problems for getting the right table cell. Do you have > any > suggestions? > (I had to save all rows then sort by line number before I use them)? > > Thanks, > > xiao-jun > > > > > -----Original Message----- > From: via RT [mailto:comment-HTML-TableParser@rt.cpan.org] > Sent: Tuesday, September 03, 2002 8:10 AM > To: Xiao-Jun Ma > Subject: [cpan #1490] TableParser patch > > > This message about HTML-TableParser was sent to you by DJERIUS via > rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: http://rt.cpan.org/NoAuth/Bug.html?id=1490 > > > Thanks for your bug report. The "real" problem is that the input HTML > is malformed. There is an extra </table> tag in line 1016 of the input > which caused the error. It's this line: > > 1016:</TABLE></TD></TR></TABLE></TD></TR></TABLE><A > NAME="_BOTTOM"></A> > > I've modified the code to croak if there's an extra end table tag. > > I'll upload it to CPAN shortly. > > Diab > > [guest - Sat Aug 31 13:32:54 2002]: >
> > The patch attached fixes the following errors, which show up only
> when
> > turning on "use diagnostics": > > Uncaught exception from user code: > > Uncaught exception from user code: > > Modification of non-creatable array value attempted,
> subscript
> > -1 at /usr/lib/perl5/site_perl/5.6.1/HTML/TableParser.pm line
> 942.
> >
> HTML::TableParser::end_table('HTML::TableParser=HASH(0xa0210f4)',
> > undef, 1016) called at > > /usr/lib/perl5/site_perl/5.6.1/HTML/TableParser.pm line 905 > > HTML::TableParser::end('HTML::TableParser=HASH(0xa0210f4)', > > 'table', undef, 1016) called at > > /usr/lib/perl5/site_perl/5.6.1/cygwin-multi/HTML/Parser.pm line
> 104
> > eval {...} called at /usr/lib/perl5/site_perl/5.6.1/cygwin- > > multi/HTML/Parser.pm line 104 > >
> HTML::Parser::parse_file('HTML::TableParser=HASH(0xa0210f4)',
> > 'EP0210645_1') called at > > /downloads/XiaoJunMa/SoftDev/Perl/scripts/parsePatent.pl line 33 > >
> HTML::Parser::parse_file('HTML::TableParser=HASH(0xa0210f4)',
> > 'EP0210645_1') called at > > /downloads/XiaoJunMa/SoftDev/Perl/scripts/parsePatent.pl line 33