Subject: | output_encoding option should print unencodable characters as entities |
Date: | Fri, 21 Oct 2011 19:09:26 +0200 |
To: | bug-XML-Twig [...] rt.cpan.org |
From: | Zsbán Ambrus <ambrus [...] math.bme.hu> |
I'd like to request that if you set the output_encoding option of
Twig, it outputs characters not repesentable in that encoding as
entities.
Currently, if you set the output_encoding, any characters that can't
be encoded are replaced by a substitution character. Look at the
following output for example, where the character \x{2203} does not
have a representation in iso-8859-2 (the other characters do have a
representation), and a question mark is output in its place.
$ perl -we 'use XML::Twig; my $tw = XML::Twig->new(output_encoding =>
"iso-8859-2"); $tw->set_root(XML::Twig::Elt->new("d", "\x{e9}l\x{151}
\x{2203}t")); $tw->flush; print$/;' | od -tx1c -w8
0000000 3c 3f 78 6d 6c 20 76 65
< ? x m l v e
0000010 72 73 69 6f 6e 3d 22 31
r s i o n = " 1
0000020 2e 30 22 20 65 6e 63 6f
. 0 " e n c o
0000030 64 69 6e 67 3d 22 69 73
d i n g = " i s
0000040 6f 2d 38 38 35 39 2d 32
o - 8 8 5 9 - 2
0000050 22 3f 3e 3c 64 3e e9 6c
" ? > < d > é l
0000060 f5 20 3f 74 3c 2f 64 3e
ő ? t < / d >
0000070 0a
\n
0000071
I believe the it would be better if the above command gave output
similar to the following command, which outputs the numeric XML entity
∃ in place of that character. (This is analogous to how a less
than sign is always output as an entity.)
$ perl -we 'use XML::Twig; use Encode; my $tw =
XML::Twig->new(output_encoding => "iso-8859-2", output_text_filter =>
sub { encode("iso-8859-2", $_[0], Encode::FB_XMLCREF()) });
$tw->set_root(XML::Twig::Elt->new("d", "\x{e9}l\x{151} \x{2203}t"));
$tw->flush; print$/;' | od -tx1c -w8
0000000 3c 3f 78 6d 6c 20 76 65
< ? x m l v e
0000010 72 73 69 6f 6e 3d 22 31
r s i o n = " 1
0000020 2e 30 22 20 65 6e 63 6f
. 0 " e n c o
0000030 64 69 6e 67 3d 22 69 73
d i n g = " i s
0000040 6f 2d 38 38 35 39 2d 32
o - 8 8 5 9 - 2
0000050 22 3f 3e 3c 64 3e e9 6c
" ? > < d > é l
0000060 3f 20 26 23 78 32 32 30
? & # x 2 2 0
0000070 33 3b 74 3c 2f 64 3e 0a
3 ; t < / d > \n
0000100
I am using XML::Twig version 3.39, whose configuration information I
attach to the bottom.
Ambrus
-----------
Configuration:
perl: 5.014002
OS: linux - x86_64-linux
required
XML::Parser : 2.41
expat : <no version information found>
Strongly Recommended
Scalar::Util : 1.23 (for improved memory management)
Encode : 2.42_01 (for encoding conversions)
Modules providing additional features
XML::XPathEngine : <not available> (to use XML::Twig::XPath)
XML::XPath : 1.13 (to use XML::Twig::XPath
if Tree::XPathEngine not available)
LWP : 6.02 (for the parseurl method)
HTML::TreeBuilder : 4.2 (to use parse_html and
parsefile_html)
HTML::Entities::Numbered : <not available> (to allow parsing of
HTML containing named entities)
HTML::Tidy : <not available> (to use parse_html and
parsefile_html with the use_tidy option)
HTML::Entities : 3.69 (for the html_encode filter)
Tie::IxHash : <not available> (for the keep_atts_order option)
Text::Wrap : 2009.0305 (to use the "wrapped"
option for pretty_print)
Modules used only by the auto tests
Test : 1.25_02
Test::Pod : 1.45
XML::Simple : <not available>
XML::Handler::YAWriter : <not available>
XML::SAX::Writer : <not available>
XML::Filter::BufferText : <not available>
IO::Scalar : <not available>
Please add this information to bug reports (you can run
t/zz_dump_config.t to get it)