Skip Menu |

This queue is for tickets about the XML-Twig CPAN distribution.

Report information
The Basics
Id: 25156
Status: rejected
Priority: 0/
Queue: XML-Twig

People
Owner: Nobody in particular
Requestors: P.J.B.King [...] hw.ac.uk
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Bug in XML-Twig handling Control Characters?
Date: Mon, 26 Feb 2007 15:57:20 +0000
To: bug-XML-Twig [...] rt.cpan.org
From: Peter King <P.J.B.King [...] hw.ac.uk>
I think there may be a bug in the handling of control characters by XML::Twig If the input file contains any character less than <space> except for <tab>, <new line>, <return>, then the input of the file is truncated there. It is not solved by using 'safe' output filter, because this does not translate the characters in question. I have attached a perl script that demonstrates this, with its input, the output I get (the script will write the file output.xml) and the version information from perl and uname I think this bug has existed for a long time - it certainly manifested itself in Twig3.23 All in all an excellent piece of code for fiddling with XML though - thanks for your hard work in developing it. Peter King
<?xml version="1.0"?> <file> <text> <t1>a </t1> <t2> bbb </t2> </text> <text> <t1>a with CTRL-A :: </t1> <t2> bbb </t2> </text> </file>
<?xml version="1.0"?> <file> <text> <t1>ab Control-C :: Now octal 222 :&#146;:</t1> <t2>Second text OK</t2> </text> <text> <t1>a </t1> <t2> bbb </t2> </text> <text> <t1>a with CTRL-A :</t1> </text> </file>
Linux lxpjbk 2.6.18-1.2200.fc5smp #1 SMP Sat Oct 14 17:15:35 EDT 2006 i686 i686 i386 GNU/Linux
#!/usr/bin/perl -w # # project selection and updating script # called by CGI # takes parameters directory # type (UG|PG|ASE) # staff (true) # debug (true) use lib "/u1/staff/pjbk/perl_libs/XML-Twig-3.29/blib/lib"; # XML processing library #use strict; use Fcntl; # for file locking definitions use XML::Twig; $input_file = "input.xml"; $output_file = "output.xml"; # add a new record $t1_text = "ab Control-C :\cC: Now octal 222 :\222:"; $t2_text = "Second text OK"; my $t1 = new XML::Twig::Elt( 't1',$t1_text); my $t2 = new XML::Twig::Elt( 't2',$t2_text); my $text= new XML::Twig::Elt( 'text' , ( $t1, $t2 )); $twig = new XML::Twig( pretty_print => 'indented', output_filter => 'safe'); $twig -> safe_parsefile( $input_file); &mydie ( "Problems with input file: $input_file") if $twig == 0; $root = $twig -> root; $text -> paste( first_child => $root); open(OUTPUTFILE, ">" . $output_file) or &mydie("Can't open OUTPUTFILE: $output_file"); $twig->print(\*OUTPUTFILE); close OUTPUTFILE; exit;
This is perl, v5.8.8 built for i386-linux-thread-multi Copyright 1987-2006, Larry Wall Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found in the Perl 5 source kit. Complete documentation for Perl, including FAQ lists, should be found on this system using "man perl" or "perldoc perl". If you have access to the Internet, point your browser at http://www.perl.org/, the Perl Home Page.
On Mon Feb 26 10:57:51 2007, P.J.B.King@hw.ac.uk wrote: Show quoted text
> I think there may be a bug in the handling of control characters by > XML::Twig > > If the input file contains any character less than <space> except for > <tab>, <new line>, <return>, then the input of the file is truncated
there. I am not sure I quite understand. Characters like CTRL-A are not allowed in XML (see the spec), so the parser (expat, 2 layers below XML::Twig) barfs when it finds them. xmlwf correclty reports that your file is not well-formed XML, and XML::Twig stops processing as soon as it finds the illegal character. All XML processors are required to do this. Am I missing something (and sorry for the late answer, but for a while RT stopped sending notifications when new tickets were created). __ mirod
Subject: Re: [rt.cpan.org #25156] Bug in XML-Twig handling Control Characters?
Date: Mon, 12 Mar 2007 12:52:58 +0000
To: bug-XML-Twig [...] rt.cpan.org
From: Peter King <P.J.B.King [...] hw.ac.uk>
via RT wrote: Show quoted text
> <URL: http://rt.cpan.org/Ticket/Display.html?id=25156 > > > On Mon Feb 26 10:57:51 2007, P.J.B.King@hw.ac.uk wrote:
>> I think there may be a bug in the handling of control characters by >> XML::Twig >> >> If the input file contains any character less than <space> except for >> <tab>, <new line>, <return>, then the input of the file is truncated
> there. > > I am not sure I quite understand. Characters like CTRL-A are not allowed > in XML (see the spec), so the parser (expat, 2 layers below XML::Twig) > barfs when it finds them. xmlwf correclty reports that your file is not > well-formed XML, and XML::Twig stops processing as soon as it finds the > illegal character. All XML processors are required to do this. > > Am I missing something > > (and sorry for the late answer, but for a while RT stopped sending > notifications when new tickets were created). > > __ > mirod
No, I think you are right, but I'm surprised that the parse of the input file doesn't complain, but produces a well formed tree, although it is not DTD compliant. I suppose I found it surprising that XML-Twig would write files that it couldn't subsequently read back, so it's possible to write files that are not well formed. Feel free to remove the "bug" Peter
Subject: Re: [rt.cpan.org #25156] Bug in XML-Twig handling Control Characters?
Date: Mon, 12 Mar 2007 16:28:41 +0100
To: bug-XML-Twig [...] rt.cpan.org
From: mirod <mirod [...] xmltwig.com>
Peter King via RT wrote: Show quoted text
> Queue: XML-Twig > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=25156 > > > via RT wrote:
>> <URL: http://rt.cpan.org/Ticket/Display.html?id=25156 > >> >> On Mon Feb 26 10:57:51 2007, P.J.B.King@hw.ac.uk wrote:
>>> I think there may be a bug in the handling of control characters by >>> XML::Twig >>> >>> If the input file contains any character less than <space> except for >>> <tab>, <new line>, <return>, then the input of the file is truncated
>> there. >> >> I am not sure I quite understand. Characters like CTRL-A are not allowed >> in XML (see the spec), so the parser (expat, 2 layers below XML::Twig) >> barfs when it finds them. xmlwf correclty reports that your file is not >> well-formed XML, and XML::Twig stops processing as soon as it finds the >> illegal character. All XML processors are required to do this. >> >> Am I missing something >> >> (and sorry for the late answer, but for a while RT stopped sending >> notifications when new tickets were created). >> >> __ >> mirod
> > > No, I think you are right, but I'm surprised that the parse of the input > file doesn't complain, but produces a well formed tree, although it is > not DTD compliant. I suppose I found it surprising that XML-Twig would > write files that it couldn't subsequently read back, so it's possible to > write files that are not well formed. > > Feel free to remove the "bug"
Expat is not a validating parser, so the whole stack, including XML::Twig, doesn't know anything about validity, only well-formedness. Abd there are indeed many ways to get XML::Twig to output non well-formed XML. It is usually quite easy to get to output well-formed XML, but indeed Perl happily lets you use the whole Unicode character set, as it should, while XML excludes some it. You could use an output filter to make sure that you filter out improper characters before they're written. -- mirod