Subject: | Should filter invalid characters |
Hello,
if you pass a string to XML::Generator that contains invalid characters, the
generated XML is invalid, too. You can test this via the following command:
$ perl -MXML::Generator -e '$xg=XML::Generator->new(escape=>'always,even-
entities',conformance=>'strict'); print $xg->foo("abc\0xyz"), "\n"' | xmllint -
-:1: parser error : Char 0x0 out of allowed range
<foo>abc
^
-:1: parser error : Premature end of data in tag foo line 1
<foo>abc
^
It seems that some characters are not allowed in UTF-8 XML, even if you try to
escape them:
$ echo "<foo>abc�xyz</foo>" | xmllint -
-:1: parser error : xmlParseCharRef: invalid xmlChar value 0
<foo>abc�xyz</foo>
So I see two possible solutions:
1) The developer who is using XML::Generator must filter all invalid characters
before passing them to XML::Generator.
2) XML::Generator offers a comfortable way to let these invalid characters be
filtered automatically.
I'd prefer way 2) e.g. by turning on an option in the constructor of XML::Generator
like "filter_invalid_chars => 1".
The characters that should be filtered seem to be (but they are missing the \0 null
character)
http://stackoverflow.com/questions/1016910/how-can-i-strip-invalid-xml-characters-
from-strings-in-perl
"The complete regex for removal of invalid xml-1.0 characters is:
# #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
$str =~ s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;
for xml-1.1 it is:
# allowed: [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
$str =~ s/[^\x01-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;
# restricted:[#x1-#x8][#xB-#xC][#xE-#x1F][#x7F-#x84][#x86-#x9F]
$str =~ s/[\x01-\x08\x0B-\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]//go;
"
The following chars are allowed in XML 1.0, all others should be filtered:
http://www.w3.org/TR/xml/#charsets
And here for XML 1.1:
http://www.w3.org/TR/xml11/#charsets
Distribution: self compiled version 1.01
perl v5.8.5
and also tested with debian/sid libxml-generator-perl 1.01-3 perl v5.12.4
What do you think about this topic?
Greetings,
Gert