Skip Menu |

This queue is for tickets about the XML-Twig CPAN distribution.

Report information
The Basics
Id: 86773
Status: resolved
Priority: 0/
Queue: XML-Twig

People
Owner: Nobody in particular
Requestors: melmothx [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: 3.45



Subject: End of CDATA always escaped?
Date: Mon, 08 Jul 2013 14:11:12 +0200
To: bug-XML-Twig [...] rt.cpan.org
From: Marco Pessotto <melmothx [...] gmail.com>
It looks like that the end of CDATA is unconditionally escaped. See the following test script: #!/usr/bin/env perl use strict; use warnings; use XML::Twig; use Test::More; plan tests => 3; my $html = <<'EOF'; <div id="body">body</div> <script> //<![CDATA[ if ( this.value && ( !request.term || matcher.test(text) ) && 1 > 0 && 0 < 1 ) //]]> </script> EOF my $parser = XML::Twig->new(); my $xml = $parser->safe_parse_html($html); print $@ if $@; my @cdata = $xml->get_xpath('#CDATA'); ok(@cdata > 0); my @elts = $xml->get_xpath('//script'); foreach my $el (@elts) { $el->set_asis; diag $el->text; ok(((index $el->text, "//]]>") >= 0), "end of cdata ok"); } ok(((index $xml->sprint, "//]]>") >= 0), "end of cdata ok"); diag $xml->sprint; __END__ Beside the fact that the CDATA is not parsed as such, probably because of the HTML->XML conversion, but I can live with that, it seems that the text marked as "AS IS" is escaped during the output. The culprit seems to be line 8543 in the latest version: if( ! $elt->{extra_data_in_pcdata}) { $string=~ s/([$replaced_ents])/$XML::Twig::base_ent{$1}/g unless( !$replaced_ents || $keep_encoding || $elt->{asis}); $string=~ s{\Q]]>}{]]&gt;}g; ### why is always replaced? } but I could be wrong. Thanks in advance. Best wishes -- Marco
Hi, Again, sorry for the late response, I did not get an email alert for the bug. There was a bug in the conversion from HTML to XML using HTML::TreeBuilder (the default) a reversed test caused <[CDATA[ sections to be escaped (including the CDATA markers) and not the rest. It's fixed in the next release. in the test you sent, get_xpath( '#CDATA') should be get_xpath( '//#CDATA'), as it is no CDATA element was found. doing set_asis on a CDATA section does not work the way you think it does. It does not add the CDATA tags. The text in the CDATA section will not be escaped, but the opening and cloding marks ('<![CDATA[' and ']]>') are not added. I think that's a feature. What would the reason be to do set_asis on the element? A modified version of your test is part of t/test_3_45.t Thanks -- mirod On Mon Jul 08 08:11:36 2013, melmothx@gmail.com wrote: Show quoted text
> > It looks like that the end of CDATA is unconditionally escaped. See > the > following test script: > > #!/usr/bin/env perl > > use strict; > use warnings; > use XML::Twig; > use Test::More; > plan tests => 3; > > my $html = <<'EOF'; > <div id="body">body</div> > <script> > //<![CDATA[ > if ( this.value && ( !request.term || matcher.test(text) ) && 1 > 0 && > 0 < 1 ) > //]]> > </script> > EOF > > my $parser = XML::Twig->new(); > > my $xml = $parser->safe_parse_html($html); > print $@ if $@; > > my @cdata = $xml->get_xpath('#CDATA'); > ok(@cdata > 0); > > > my @elts = $xml->get_xpath('//script'); > > foreach my $el (@elts) { > $el->set_asis; > diag $el->text; > ok(((index $el->text, "//]]>") >= 0), "end of cdata ok"); > } > > ok(((index $xml->sprint, "//]]>") >= 0), "end of cdata ok"); > diag $xml->sprint; > > > __END__ > > Beside the fact that the CDATA is not parsed as such, probably because > of the HTML->XML conversion, but I can live with that, it seems that > the > text marked as "AS IS" is escaped during the output. > > The culprit seems to be line 8543 in the latest version: > > if( ! $elt->{extra_data_in_pcdata}) > { > $string=~ s/([$replaced_ents])/$XML::Twig::base_ent{$1}/g > unless( !$replaced_ents || $keep_encoding || $elt->{asis}); > $string=~ s{\Q]]>}{]]&gt;}g; ### why is always replaced? > } > > but I could be wrong. > > Thanks in advance. > > Best wishes
-- __ mirod