Skip Menu |

This queue is for tickets about the XML-LibXML CPAN distribution.

Report information
The Basics
Id: 123479
Status: open
Priority: 0/
Queue: XML-LibXML

People
Owner: Nobody in particular
Requestors: jkahrman [...] mathworks.com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 2.0016
Fixed in: (no value)



Subject: XML::LibXML whitespace settings can leak between parsers
Parsing with one parser that ignores whitespace can cause other parsers to ignore whitespace even when they have not requested it. It appears to be a result of the code reintroduced in 1.97 for https://rt.cpan.org/Public/Bug/Display.html?id=76696 See attached test that has been developed trying to reliably reproduce the issue. The "no_blanks" setting appears to be getting latched in the parse_string calls in a weird way that we don't entirely understand. We have a potential patch that I'll be posting shortly. It involves resetting the xmlKeepBlanksDefaultValue after the parsing has completed. We're still working on testing it as we don't fully understand the underlying issue.
Subject: 30blanks.t
#!/usr/bin/perl # whitespace (no_blanks) bleeds from parser to parser # Bottom line on top: no_blanks is latched not in the instance # but somewhere else (likely the c code) # in a weird way # that I've not been able to figure out how to fix or work-around. use warnings; use strict; use Test::More tests => 23; use XML::LibXML; my $chunk = "<foo>\n" . " <bar/>\n" . "</foo>\n"; my $sampleXML = qq{<?xml version="1.0" encoding="utf-8"?>\n$chunk}; my $exp_wsp = "\n \n"; # WEIRD SEQUENCE I DONT UNDERSTAND assert_text_content([], $sampleXML, $exp_wsp); # 1. shows whitespace as expected assert_text_content([no_blanks=>1], $sampleXML, ''); # 2. no whitespace as expected assert_text_content([], $sampleXML, $exp_wsp); # 3. The known bug: should but doesn't show whitespace assert_text_content([], $sampleXML, $exp_wsp); # 4. Surprise: shows whitespace assert_text_content([no_blanks=>1], $sampleXML, ''); # 5. as expected: no whitespace assert_text_content([no_blanks=>0], $sampleXML, $exp_wsp); # 6. frustrated: doesn't show whitespace # (as sequencd below hits maybe it might) assert_text_content([], $sampleXML, $exp_wsp); # 7. now (!!?) it shows whitespace # The following sequence had me believing that explicitly new'ing up # with no_blanks=>0 would work around the fault. my $exp_no_wsp_chunk = qq{<foo><bar/></foo>\n}; my $exp_no_wsp = qq{<?xml version="1.0" encoding="utf-8"?>\n$exp_no_wsp_chunk}; my $no_blanks_parser = XML::LibXML->new(no_blanks => 1); my $d1 = $no_blanks_parser->parse_string($sampleXML); is($d1, $exp_no_wsp, 'XML::LibXML->new(no_blanks => 1) parse_string'); is($no_blanks_parser->parse_balanced_chunk($chunk), $exp_no_wsp_chunk, 'XML::LibXML->new(no_blanks => 1) parse_balanced_chunk'); my $no_args_parser = XML::LibXML->new; my $d2 = $no_args_parser->parse_string($sampleXML); is($d2, $sampleXML, 'XML::LibXML->new parse_string'); my $s2 = $d2->textContent; is($s2, $exp_wsp, 'XML::LibXML->new textContent'); # OKay, here's some REALLY weird stuff (same instance, new parse): my $d2b = $no_args_parser->parse_string($sampleXML); my $s2b = $d2b->textContent; is($s2b, $exp_wsp, 'XML::LibXML->new textContent'); # Note, new'ing up another parser with no_blanks=>0, # (w/out running parse_string), # is NOT enough to break D1 or fix D2: my $blanks_parser = XML::LibXML->new(no_blanks => 0); my $d1b = $no_blanks_parser->parse_string($sampleXML); is($d1b, $exp_no_wsp, 'XML::LibXML->new(no_blanks => 1) parse_string 2'); my $d2c = $no_args_parser->parse_string($sampleXML); is($d2c, $sampleXML, 'XML::LibXML->new parse_string 2'); # But WAIT: if we re-parse (same instance) we get whitespace now my $d2d = $no_args_parser->parse_string($sampleXML); is($d2d, $sampleXML, 'XML::LibXML->new parse_string 3'); is($d2d->textContent, $exp_wsp, 'XML::LibXML->new textContent 2'); # Note, newing up with no_blanks=>0, AND running parse_string, # turned off the "no_blanks" for this parser (has textContent) # BUT did not break D1 or fix D2: # D3 (FIXED: has whitespace): $d3 my $d3 = $blanks_parser->parse_string($sampleXML); is($d3, $sampleXML, 'XML::LibXML->new(no_blanks => 0) parse_string'); my $s3 = $d3->textContent; is($s3, $exp_wsp, 'XML::LibXML->new(no_blanks => 0) textContent'); # Redo D1 (still no_blanks=GOOD) is( $no_blanks_parser->parse_string($sampleXML), $exp_no_wsp, 'XML::LibXML->new(no_blanks => 1) parse_string 3' ); # Redo D2 (re-broke no_blanks=?) is( $no_args_parser->parse_string($sampleXML), $sampleXML, 'XML::LibXML->new parse_string 4' ); # Earlier experiments had suggested a workaround might be to edit # XML::LibXML to always set no_blanks to something, even zero. # Now we believe the latching is in the parse_string # We wanted to confirm setting the 3rd instance to no_blanks=>0 # does not mess up the original's =>1. # We now believe the D's are not changed (??) # Redo S1 (still no_blanks) is($d1->textContent, '', 'XML::LibXML->new(no_blanks => 1) textContent'); my $d4 = $no_blanks_parser->parse_string($sampleXML); # This latches the no_blanks from P1 my $s4 = $d4->textContent; # There should be NO newline in S4 is($s4, '', 'XML::LibXML->new(no_blanks => 1) textContent 2'); # Redo S1 (still no_blanks) is($d1->textContent, '', 'XML::LibXML->new(no_blanks => 1) textContent 3'); exit; sub assert_text_content { my ( $args, $xml, $exp ) = @_; $args ||= []; my $ans = XML::LibXML->new(@$args)->parse_string($xml)->textContent; is($ans, $exp, "XML::LibXML->new(@$args)->parse_string->textContent"); return; }
From: jkahrman [...] mathworks.com
After adding a few more tests, the patch appears to work.
Subject: no_blanks_latch.patch
--- a/LibXML.xs 2016-11-03 00:16:14.000000000 -0400 +++ b/LibXML.xs 2016-11-03 00:16:14.000000000 -0400 @@ -151,6 +151,9 @@ /* this should keep the default */ static xmlExternalEntityLoader LibXML_old_ext_ent_loader = NULL; +/* Way to save original value of xmlKeepBlanksDefaultValue */ +static int orig_xmlKeepBlanksDefault = 1; + /* global external entity loader */ SV *EXTERNAL_ENTITY_LOADER_FUNC = (SV *)NULL; @@ -928,22 +931,26 @@ if ((parserOptions & XML_PARSE_DTDLOAD) == 0) { parserOptions &= ~(XML_PARSE_DTDVALID | XML_PARSE_DTDATTR | XML_PARSE_NOENT ); } - if (ctxt) xmlCtxtUseOptions(ctxt, parserOptions ); /* Note: sets ctxt->linenumbers = 1 */ - /* - * Without this if/else conditional, NOBLANKS has no effect. - * - * For more information, see: - * - * https://rt.cpan.org/Ticket/Display.html?id=76696 - * - * */ - if (parserOptions & XML_PARSE_NOBLANKS) { - xmlKeepBlanksDefault(0); + if (ctxt) { + xmlCtxtUseOptions(ctxt, parserOptions ); /* Note: sets ctxt->linenumbers = 1 */ + } else { + /* + * Without this if/else conditional, NOBLANKS has no effect when + * there is no context. + * + * For more information, see: + * + * https://rt.cpan.org/Ticket/Display.html?id=76696 + * + * */ + if (parserOptions & XML_PARSE_NOBLANKS) { + orig_xmlKeepBlanksDefault = xmlKeepBlanksDefault(0); + } + else { + orig_xmlKeepBlanksDefault = xmlKeepBlanksDefault(1); + } } - else { - xmlKeepBlanksDefault(1); - } item = hv_fetch( real_obj, "XML_LIBXML_LINENUMBERS", 22, 0 ); if ( item != NULL && SvTRUE(*item) ) { @@ -980,6 +987,7 @@ #ifndef WITH_SERRORS xmlGetWarningsDefaultValue = 0; #endif + xmlKeepBlanksDefault(orig_xmlKeepBlanksDefault); if (EXTERNAL_ENTITY_LOADER_FUNC == NULL && LibXML_old_ext_ent_loader != NULL) { xmlSetExternalEntityLoader( (xmlExternalEntityLoader)LibXML_old_ext_ent_loader ); --- a/t/30blanks.t 1969-12-31 19:00:00.000000000 -0500 +++ b/t/30blanks.t 2017-11-01 13:17:14.863557661 -0400 @@ -0,0 +1,206 @@ +#!/usr/bin/perl + +# whitespace (no_blanks) bleeds from parser to parser + +# Bottom line on top: no_blanks is latched not in the instance +# but somewhere else (likely the c code) +# in a weird way +# that I've not been able to figure out how to fix or work-around. + +use warnings; +use strict; + +use Test::More tests => 48; + +use XML::LibXML; + +my $chunk + = "<foo>\n" + . " <bar/>\n" + . "</foo>"; + +my $sampleXML + = qq{<?xml version="1.0" encoding="utf-8"?>\n$chunk\n}; + +my $exp_wsp = "\n \n"; + +# WEIRD SEQUENCE I DONT UNDERSTAND + +assert_text_content([], $sampleXML, $exp_wsp); # 1. shows whitespace as expected +assert_text_content([no_blanks=>1], $sampleXML, ''); # 2. no whitespace as expected +assert_text_content([], $sampleXML, $exp_wsp); # 3. The known bug: should but doesn't show whitespace +assert_text_content([], $sampleXML, $exp_wsp); # 4. Surprise: shows whitespace +assert_text_content([no_blanks=>1], $sampleXML, ''); # 5. as expected: no whitespace +assert_text_content([no_blanks=>0], $sampleXML, $exp_wsp); # 6. frustrated: doesn't show whitespace + # (as sequencd below hits maybe it might) +assert_text_content([], $sampleXML, $exp_wsp); # 7. now (!!?) it shows whitespace + +# The following sequence had me believing that explicitly new'ing up +# with no_blanks=>0 would work around the fault. + +my $exp_no_wsp_chunk = qq{<foo><bar/></foo>}; +my $exp_no_wsp = qq{<?xml version="1.0" encoding="utf-8"?>\n$exp_no_wsp_chunk\n}; + +my $no_blanks_parser = XML::LibXML->new(no_blanks => 1); +my $d1 = $no_blanks_parser->parse_string($sampleXML); +is($d1, $exp_no_wsp, 'XML::LibXML->new(no_blanks => 1) parse_string'); +is($no_blanks_parser->parse_balanced_chunk($chunk), $exp_no_wsp_chunk, 'XML::LibXML->new(no_blanks => 1) parse_balanced_chunk'); +is($no_blanks_parser->parse_string($sampleXML), $exp_no_wsp, 'XML::LibXML->new(no_blanks => 1) parse_string 2'); + +my $no_args_parser = XML::LibXML->new; +my $d2 = $no_args_parser->parse_string($sampleXML); +is($d2, $sampleXML, 'XML::LibXML->new parse_string'); + +my $s2 = $d2->textContent; +is($s2, $exp_wsp, 'XML::LibXML->new textContent'); + +# OKay, here's some REALLY weird stuff (same instance, new parse): +my $d2b = $no_args_parser->parse_string($sampleXML); +my $s2b = $d2b->textContent; +is($s2b, $exp_wsp, 'XML::LibXML->new textContent'); + +# Note, new'ing up another parser with no_blanks=>0, +# (w/out running parse_string), +# is NOT enough to break D1 or fix D2: + +my $blanks_parser = XML::LibXML->new(no_blanks => 0); + +my $d1b = $no_blanks_parser->parse_string($sampleXML); +is($d1b, $exp_no_wsp, 'XML::LibXML->new(no_blanks => 1) parse_string 3'); + +my $d2c = $no_args_parser->parse_string($sampleXML); +is($d2c, $sampleXML, 'XML::LibXML->new parse_string 2'); + +# But WAIT: if we re-parse (same instance) we get whitespace now +my $d2d = $no_args_parser->parse_string($sampleXML); +is($d2d, $sampleXML, 'XML::LibXML->new parse_string 3'); +is($d2d->textContent, $exp_wsp, 'XML::LibXML->new textContent 2'); + +# Note, newing up with no_blanks=>0, AND running parse_string, +# turned off the "no_blanks" for this parser (has textContent) +# BUT did not break D1 or fix D2: + +# D3 (FIXED: has whitespace): $d3 +my $d3 = $blanks_parser->parse_string($sampleXML); +is($d3, $sampleXML, 'XML::LibXML->new(no_blanks => 0) parse_string'); + +my $s3 = $d3->textContent; +is($s3, $exp_wsp, 'XML::LibXML->new(no_blanks => 0) textContent'); + +# Redo D1 (still no_blanks=GOOD) +is( + $no_blanks_parser->parse_string($sampleXML), $exp_no_wsp, + 'XML::LibXML->new(no_blanks => 1) parse_string 4' +); + +# Redo D2 (re-broke no_blanks=?) +is( + $no_args_parser->parse_string($sampleXML), $sampleXML, + 'XML::LibXML->new parse_string 4' +); + +# Earlier experiments had suggested a workaround might be to edit +# XML::LibXML to always set no_blanks to something, even zero. + +# Now we believe the latching is in the parse_string + +# We wanted to confirm setting the 3rd instance to no_blanks=>0 +# does not mess up the original's =>1. +# We now believe the D's are not changed (??) + +# Redo S1 (still no_blanks) +is($d1->textContent, '', 'XML::LibXML->new(no_blanks => 1) textContent'); + +my $d4 = $no_blanks_parser->parse_string($sampleXML); # This latches the no_blanks from P1 +my $s4 = $d4->textContent; +# There should be NO newline in S4 +is($s4, '', 'XML::LibXML->new(no_blanks => 1) textContent 2'); + +# Redo S1 (still no_blanks) +is($d1->textContent, '', 'XML::LibXML->new(no_blanks => 1) textContent 3'); + +###################################### +# | Parser context +# |------------------------- +# | parse_string | parse_balanced_chunk +# | (has context) | (no context) +#------------|---------------|--------- +# Blanks | Case B1 | Case B2 +# No Blanks | Case A1 | Case A2 + +# Cases: +# A1, B1 +assert_parse_string($no_blanks_parser, $sampleXML, ''); +assert_parse_string($blanks_parser, $sampleXML, $exp_wsp); + +# B1, A1 +assert_parse_string($blanks_parser, $sampleXML, $exp_wsp); +assert_parse_string($no_blanks_parser, $sampleXML, ''); + +# A1, A2 +assert_parse_string($no_blanks_parser, $sampleXML, ''); +assert_parse_balanced_chunk($no_blanks_parser, $chunk, ''); + +# A2, A1 +assert_parse_balanced_chunk($no_blanks_parser, $chunk, ''); +assert_parse_string($no_blanks_parser, $sampleXML, ''); + +# B1, A2 +assert_parse_string($blanks_parser, $sampleXML, $exp_wsp); +assert_parse_balanced_chunk($no_blanks_parser, $chunk, ''); + +# A2, B1 +assert_parse_balanced_chunk($no_blanks_parser, $chunk, ''); +assert_parse_string($blanks_parser, $sampleXML, $exp_wsp); + +# A1, B2 +assert_parse_string($no_blanks_parser, $sampleXML, ''); +assert_parse_balanced_chunk($blanks_parser, $chunk, $exp_wsp); + +# B2, A1 +assert_parse_balanced_chunk($blanks_parser, $chunk, $exp_wsp); +assert_parse_string($no_blanks_parser, $sampleXML, ''); + +# B1, B2 +assert_parse_string($blanks_parser, $sampleXML, $exp_wsp); +assert_parse_balanced_chunk($blanks_parser, $chunk, $exp_wsp); + +# B2, B1 +assert_parse_balanced_chunk($blanks_parser, $chunk, $exp_wsp); +assert_parse_string($blanks_parser, $sampleXML, $exp_wsp); + +# A2, B2 +assert_parse_balanced_chunk($no_blanks_parser, $chunk, ''); +assert_parse_balanced_chunk($blanks_parser, $chunk, $exp_wsp); + +# B2, A2 +assert_parse_balanced_chunk($blanks_parser, $chunk, $exp_wsp); +assert_parse_balanced_chunk($no_blanks_parser, $chunk, ''); + +exit; + +sub assert_parse_string { + my ( $parser, $xml, $exp ) = @_; + my $ans = $parser->parse_string($xml)->textContent; + my $blanks = $parser->keep_blanks ? 1 : 0; + my $case = $blanks ? 'B1' : 'A1'; + is($ans, $exp, "Case: $case; blanks => $blanks: parse_string (has context)"); + return; +} + +sub assert_parse_balanced_chunk { + my ( $parser, $xml, $exp ) = @_; + my $ans = $parser->parse_balanced_chunk($xml)->textContent; + my $blanks = $parser->keep_blanks ? 1 : 0; + my $case = $blanks ? 'B2' : 'A2'; + is($ans, $exp, "Case: $case; blanks => $blanks: parse_balanced_chunk (has no context)"); + return; +} + +sub assert_text_content { + my ( $args, $xml, $exp ) = @_; + $args ||= []; + my $ans = XML::LibXML->new(@$args)->parse_string($xml)->textContent; + is($ans, $exp, "XML::LibXML->new(@$args)->parse_string->textContent"); + return; +}
Subject: Re: [rt.cpan.org #123479] XML::LibXML whitespace settings can leak between parsers
Date: Wed, 1 Nov 2017 23:52:45 +0200
To: bug-XML-LibXML [...] rt.cpan.org
From: Shlomi Fish <shlomif [...] shlomifish.org>
On Wed, 1 Nov 2017 12:33:38 -0400 "jkahrman@mathworks.com via RT" <bug-XML-LibXML@rt.cpan.org> wrote: Show quoted text
> Wed Nov 01 12:33:36 2017: Request 123479 was acted upon. > Transaction: Ticket created by jkahrman@mathworks.com > Queue: XML-LibXML > Subject: XML::LibXML whitespace settings can leak between parsers > Broken in: 2.0016 > Severity: Normal > Owner: Nobody > Requestors: jkahrman@mathworks.com > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=123479 > > > > Parsing with one parser that ignores whitespace can cause other parsers to > ignore whitespace even when they have not requested it. > > It appears to be a result of the code reintroduced in 1.97 for > https://rt.cpan.org/Public/Bug/Display.html?id=76696 > > See attached test that has been developed trying to reliably reproduce the > issue. The "no_blanks" setting appears to be getting latched in the > parse_string calls in a weird way that we don't entirely understand. > > We have a potential patch that I'll be posting shortly. It involves resetting > the xmlKeepBlanksDefaultValue after the parsing has completed. We're still > working on testing it as we don't fully understand the underlying issue.
Thanks! Please see http://www.shlomifish.org/open-source/resources/how-to-contribute-to-my-projects/ and http://perl-begin.org/tutorials/bad-elements/ and also https://metacpan.org/pod/Test::Builder#Test-style regarding the correct style.