Skip Menu |

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 18965
Status: resolved
Priority: 0/
Queue: HTML-Parser

People
Owner: Nobody in particular
Requestors: code [...] yaakovnet.net
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 3.52
Fixed in: (no value)



Subject: <script/> leads to ignoring <script> events
First of all, thank you very much for integrating bug 18936 so quickly into release 3.53. This bug applies to both 3.52 and to 3.53 releases (just the input form does not yet offer the new version number). *** Problem *** After declaring $p->empty_element_tags(1); $p->ignore_elements("script","x"); the tag <x/> works correctly like <x></x>. Howeverm the tag <script/> confuses the parser: A following <script> tag is ignored and left in the text event! The attached test script runs a few sample strings through the parser with the above settings and prints the text, tag and event values. The first example demonstrates the bug. The following examples demonstrate that the <x/> and <y/> tags work correctly according to the documentation: *** Tests with version 3.53 **** ================ Parse: <script/>A<script>B</script>C ================ '' start_document 'A<script>B' text '' end_document ================ Parse: <x/>A<x>B</x>C ================ '' start_document 'A' text 'C' text '' end_document ================ Parse: <y/>A<y>B</y>C ================ '' start_document '<y/>' <y> start '' </y> end 'A' text '<y>' <y> start 'B' text '</y>' </y> end 'C' text '' end_document ================ Parse: </x>A ================ '' start_document '' end_document www@kranich:~/111$ perl test.pl ================ Parse: <script/>A<script>B</script>C ================ '' start_document 'A<script>B' text 'C' text '' end_document ================ Parse: <x/>A<x>B</x>C ================ '' start_document 'A' text 'C' text '' end_document ================ Parse: <y/>A<y>B</y>C ================ '' start_document '<y/>' <y> start '' </y> end 'A' text '<y>' <y> start 'B' text '</y>' </y> end 'C' text '' end_document ================ Parse: </x>A ================ '' start_document 'A' text '' end_document For your reference, I run the same script with version 3.52. We find that the two bugs are not related: the output shows both the effects of this bug and the effects of bug 18936: ================ Parse: <script/>A<script>B</script>C ================ '' start_document 'A<script>B' text '' end_document ================ Parse: <x/>A<x>B</x>C ================ '' start_document 'A' text 'C' text '' end_document ================ Parse: <y/>A<y>B</y>C ================ '' start_document '<y/>' <y> start '' </y> end 'A' text '<y>' <y> start 'B' text '</y>' </y> end 'C' text '' end_document ================ Parse: </x>A ================ '' start_document '' end_document www@kranich:~/111$ perl -Mblib=HTML-Parser-3.52/lib/ test.pl ================ Parse: <script/>A<script>B</script>C ================ '' start_document 'A<script>B' text '' end_document ================ Parse: <x/>A<x>B</x>C ================ '' start_document 'A' text 'C' text '' end_document ================ Parse: <y/>A<y>B</y>C ================ '' start_document '<y/>' <y> start '' </y> end 'A' text '<y>' <y> start 'B' text '</y>' </y> end 'C' text '' end_document ================ Parse: </x>A ================ '' start_document '' end_document This time, I don't have a fix. Best regards, Yaakov Belch
Subject: test.pl
#!/usr/bin/perl -w use HTML::Parser (); my $p; $p=HTML::Parser->new( api_version => 3); $p->empty_element_tags(1); $p->ignore_elements("script","x"); $p->handler("default"=>sub{my($event,$text,$tag)=@_; $tag=$tag?"<$tag>":""; print "'$text'\t$tag\t$event\n"; },"event,text,tag"); for my $text ( '<script/>A<script>B</script>C', '<x/>A<x>B</x>C', '<y/>A<y>B</y>C', '</x>A' ) { print "\n================ Parse: $text ================\n"; $p->parse($text)->eof; }
Good catch! The empty_element_tag feature interacts badly with literal mode, but the fix was easy. See attached patch. I'll uploaded 3.54 today :)
Index: hparser.c =================================================================== RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v retrieving revision 2.129 diff -u -p -r2.129 hparser.c --- hparser.c 27 Apr 2006 11:44:00 -0000 2.129 +++ hparser.c 28 Apr 2006 07:47:37 -0000 @@ -1383,8 +1383,7 @@ parse_start(PSTATE* p_state, char *beg, report_event(p_state, E_START, beg, s, utf8, tokens, num_tokens, self); if (empty_tag) report_event(p_state, E_END, s, s, utf8, tokens, 1, self); - - if (!p_state->xml_mode) { + else if (!p_state->xml_mode) { /* find out if this start tag should put us into literal_mode */ int i;