Skip Menu |

This queue is for tickets about the XML-Generator CPAN distribution.

Report information
The Basics
Id: 69368
Status: resolved
Priority: 0/
Queue: XML-Generator

People
Owner: Nobody in particular
Requestors: g111 [...] netcologne.de
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: 1.01
Fixed in: (no value)



Subject: Should filter invalid characters
Hello, if you pass a string to XML::Generator that contains invalid characters, the generated XML is invalid, too. You can test this via the following command: $ perl -MXML::Generator -e '$xg=XML::Generator->new(escape=>'always,even- entities',conformance=>'strict'); print $xg->foo("abc\0xyz"), "\n"' | xmllint - -:1: parser error : Char 0x0 out of allowed range <foo>abc ^ -:1: parser error : Premature end of data in tag foo line 1 <foo>abc ^ It seems that some characters are not allowed in UTF-8 XML, even if you try to escape them: $ echo "<foo>abc&#x00;xyz</foo>" | xmllint - -:1: parser error : xmlParseCharRef: invalid xmlChar value 0 <foo>abc&#x00;xyz</foo> So I see two possible solutions: 1) The developer who is using XML::Generator must filter all invalid characters before passing them to XML::Generator. 2) XML::Generator offers a comfortable way to let these invalid characters be filtered automatically. I'd prefer way 2) e.g. by turning on an option in the constructor of XML::Generator like "filter_invalid_chars => 1". The characters that should be filtered seem to be (but they are missing the \0 null character) http://stackoverflow.com/questions/1016910/how-can-i-strip-invalid-xml-characters- from-strings-in-perl "The complete regex for removal of invalid xml-1.0 characters is: # #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] $str =~ s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go; for xml-1.1 it is: # allowed: [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] $str =~ s/[^\x01-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go; # restricted:[#x1-#x8][#xB-#xC][#xE-#x1F][#x7F-#x84][#x86-#x9F] $str =~ s/[\x01-\x08\x0B-\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]//go; " The following chars are allowed in XML 1.0, all others should be filtered: http://www.w3.org/TR/xml/#charsets And here for XML 1.1: http://www.w3.org/TR/xml11/#charsets Distribution: self compiled version 1.01 perl v5.8.5 and also tested with debian/sid libxml-generator-perl 1.01-3 perl v5.12.4 What do you think about this topic? Greetings, Gert
Subject: Re: [rt.cpan.org #69368] Should filter invalid characters
Date: Mon, 11 Jul 2011 08:27:33 -0400 (GMT-04:00)
To: bug-XML-Generator [...] rt.cpan.org, undisclosed-recipients [...] null, null [...] null
From: bholzman [...] earthlink.net
Gert, Thanks for the very detailed bug report. I agree with your analysis and also your suggested fix (and I think I will probably make this filtering the default behavior with :strict enabled). When this issue first came up a long time ago, I decided against doing anything because I didn't want to prevent people from generating invalid xml if they really wanted to (since there are undoubtedly many applications out there that use xml with invalid characters), but providing a way to turn it off addresses that need. Show quoted text
-----Original Message-----
>From: Gert Brinkmann via RT <bug-XML-Generator@rt.cpan.org> >Sent: Jul 8, 2011 1:07 PM >To: undisclosed-recipients@null, null@null >Subject: [rt.cpan.org #69368] Should filter invalid characters > >Fri Jul 08 13:07:43 2011: Request 69368 was acted upon. >Transaction: Ticket created by gbrinkmann > Queue: XML-Generator > Subject: Should filter invalid characters > Broken in: 1.01 > Severity: Wishlist > Owner: Nobody > Requestors: g111@netcologne.de > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=69368 > > > >Hello, > >if you pass a string to XML::Generator that contains invalid characters, the >generated XML is invalid, too. You can test this via the following command: > >$ perl -MXML::Generator -e '$xg=XML::Generator->new(escape=>'always,even- >entities',conformance=>'strict'); print $xg->foo("abc\0xyz"), "\n"' | xmllint - >-:1: parser error : Char 0x0 out of allowed range ><foo>abc > ^ >-:1: parser error : Premature end of data in tag foo line 1 ><foo>abc > ^ > > >It seems that some characters are not allowed in UTF-8 XML, even if you try to >escape them: > >$ echo "<foo>abc&#x00;xyz</foo>" | xmllint - >-:1: parser error : xmlParseCharRef: invalid xmlChar value 0 ><foo>abc&#x00;xyz</foo> > > >So I see two possible solutions: >1) The developer who is using XML::Generator must filter all invalid characters >before passing them to XML::Generator. > >2) XML::Generator offers a comfortable way to let these invalid characters be >filtered automatically. > >I'd prefer way 2) e.g. by turning on an option in the constructor of XML::Generator >like "filter_invalid_chars => 1". > >The characters that should be filtered seem to be (but they are missing the \0 null >character) >http://stackoverflow.com/questions/1016910/how-can-i-strip-invalid-xml-characters- >from-strings-in-perl > >"The complete regex for removal of invalid xml-1.0 characters is: ># #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] >$str =~ s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go; > >for xml-1.1 it is: ># allowed: [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] >$str =~ s/[^\x01-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go; ># restricted:[#x1-#x8][#xB-#xC][#xE-#x1F][#x7F-#x84][#x86-#x9F] >$str =~ s/[\x01-\x08\x0B-\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]//go; >" > >The following chars are allowed in XML 1.0, all others should be filtered: >http://www.w3.org/TR/xml/#charsets > >And here for XML 1.1: >http://www.w3.org/TR/xml11/#charsets > >Distribution: self compiled version 1.01 >perl v5.8.5 > >and also tested with debian/sid libxml-generator-perl 1.01-3 perl v5.12.4 > >What do you think about this topic? > >Greetings, >Gert >
Fixed in 1.04