Skip Menu |

This queue is for tickets about the XML-Twig CPAN distribution.

Report information
The Basics
Id: 71844
Status: open
Priority: 0/
Queue: XML-Twig

People
Owner: Nobody in particular
Requestors: ambrus [...] math.bme.hu
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: output_encoding option should print unencodable characters as entities
Date: Fri, 21 Oct 2011 19:09:26 +0200
To: bug-XML-Twig [...] rt.cpan.org
From: Zsbán Ambrus <ambrus [...] math.bme.hu>
I'd like to request that if you set the output_encoding option of Twig, it outputs characters not repesentable in that encoding as entities. Currently, if you set the output_encoding, any characters that can't be encoded are replaced by a substitution character. Look at the following output for example, where the character \x{2203} does not have a representation in iso-8859-2 (the other characters do have a representation), and a question mark is output in its place. $ perl -we 'use XML::Twig; my $tw = XML::Twig->new(output_encoding => "iso-8859-2"); $tw->set_root(XML::Twig::Elt->new("d", "\x{e9}l\x{151} \x{2203}t")); $tw->flush; print$/;' | od -tx1c -w8 0000000 3c 3f 78 6d 6c 20 76 65 < ? x m l v e 0000010 72 73 69 6f 6e 3d 22 31 r s i o n = " 1 0000020 2e 30 22 20 65 6e 63 6f . 0 " e n c o 0000030 64 69 6e 67 3d 22 69 73 d i n g = " i s 0000040 6f 2d 38 38 35 39 2d 32 o - 8 8 5 9 - 2 0000050 22 3f 3e 3c 64 3e e9 6c " ? > < d > é l 0000060 f5 20 3f 74 3c 2f 64 3e ő ? t < / d > 0000070 0a \n 0000071 I believe the it would be better if the above command gave output similar to the following command, which outputs the numeric XML entity &#x2203; in place of that character. (This is analogous to how a less than sign is always output as an entity.) $ perl -we 'use XML::Twig; use Encode; my $tw = XML::Twig->new(output_encoding => "iso-8859-2", output_text_filter => sub { encode("iso-8859-2", $_[0], Encode::FB_XMLCREF()) }); $tw->set_root(XML::Twig::Elt->new("d", "\x{e9}l\x{151} \x{2203}t")); $tw->flush; print$/;' | od -tx1c -w8 0000000 3c 3f 78 6d 6c 20 76 65 < ? x m l v e 0000010 72 73 69 6f 6e 3d 22 31 r s i o n = " 1 0000020 2e 30 22 20 65 6e 63 6f . 0 " e n c o 0000030 64 69 6e 67 3d 22 69 73 d i n g = " i s 0000040 6f 2d 38 38 35 39 2d 32 o - 8 8 5 9 - 2 0000050 22 3f 3e 3c 64 3e e9 6c " ? > < d > é l 0000060 3f 20 26 23 78 32 32 30 ? & # x 2 2 0 0000070 33 3b 74 3c 2f 64 3e 0a 3 ; t < / d > \n 0000100 I am using XML::Twig version 3.39, whose configuration information I attach to the bottom. Ambrus ----------- Configuration: perl: 5.014002 OS: linux - x86_64-linux required XML::Parser : 2.41 expat : <no version information found> Strongly Recommended Scalar::Util : 1.23 (for improved memory management) Encode : 2.42_01 (for encoding conversions) Modules providing additional features XML::XPathEngine : <not available> (to use XML::Twig::XPath) XML::XPath : 1.13 (to use XML::Twig::XPath if Tree::XPathEngine not available) LWP : 6.02 (for the parseurl method) HTML::TreeBuilder : 4.2 (to use parse_html and parsefile_html) HTML::Entities::Numbered : <not available> (to allow parsing of HTML containing named entities) HTML::Tidy : <not available> (to use parse_html and parsefile_html with the use_tidy option) HTML::Entities : 3.69 (for the html_encode filter) Tie::IxHash : <not available> (for the keep_atts_order option) Text::Wrap : 2009.0305 (to use the "wrapped" option for pretty_print) Modules used only by the auto tests Test : 1.25_02 Test::Pod : 1.45 XML::Simple : <not available> XML::Handler::YAWriter : <not available> XML::SAX::Writer : <not available> XML::Filter::BufferText : <not available> IO::Scalar : <not available> Please add this information to bug reports (you can run t/zz_dump_config.t to get it)
Subject: Re: [rt.cpan.org #71844] output_encoding option should print unencodable characters as entities
Date: Fri, 21 Oct 2011 19:35:56 +0200
To: bug-XML-Twig [...] rt.cpan.org
From: mirod <xmltwig [...] gmail.com>
That makes sense. I am away this week-end, I'll have a look at this next week. -- mirod On 10/21/2011 07:10 PM, ambrus@math.bme.hu via RT wrote: Show quoted text
> Fri Oct 21 13:10:05 2011: Request 71844 was acted upon. > Transaction: Ticket created by ambrus@math.bme.hu > Queue: XML-Twig > Subject: output_encoding option should print unencodable characters as entities > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: ambrus@math.bme.hu > Status: new > Ticket<URL: https://rt.cpan.org/Ticket/Display.html?id=71844> > > > I'd like to request that if you set the output_encoding option of > Twig, it outputs characters not repesentable in that encoding as > entities. > > Currently, if you set the output_encoding, any characters that can't > be encoded are replaced by a substitution character. Look at the > following output for example, where the character \x{2203} does not > have a representation in iso-8859-2 (the other characters do have a > representation), and a question mark is output in its place. > > $ perl -we 'use XML::Twig; my $tw = XML::Twig->new(output_encoding => > "iso-8859-2"); $tw->set_root(XML::Twig::Elt->new("d", "\x{e9}l\x{151} > \x{2203}t")); $tw->flush; print$/;' | od -tx1c -w8 > 0000000 3c 3f 78 6d 6c 20 76 65 > < ? x m l v e > 0000010 72 73 69 6f 6e 3d 22 31 > r s i o n = " 1 > 0000020 2e 30 22 20 65 6e 63 6f > . 0 " e n c o > 0000030 64 69 6e 67 3d 22 69 73 > d i n g = " i s > 0000040 6f 2d 38 38 35 39 2d 32 > o - 8 8 5 9 - 2 > 0000050 22 3f 3e 3c 64 3e e9 6c > " ?> < d> é l > 0000060 f5 20 3f 74 3c 2f 64 3e > ő ? t< / d> > 0000070 0a > \n > 0000071 > > I believe the it would be better if the above command gave output > similar to the following command, which outputs the numeric XML entity > &#x2203; in place of that character. (This is analogous to how a less > than sign is always output as an entity.) > > $ perl -we 'use XML::Twig; use Encode; my $tw = > XML::Twig->new(output_encoding => "iso-8859-2", output_text_filter => > sub { encode("iso-8859-2", $_[0], Encode::FB_XMLCREF()) }); > $tw->set_root(XML::Twig::Elt->new("d", "\x{e9}l\x{151} \x{2203}t")); > $tw->flush; print$/;' | od -tx1c -w8 > 0000000 3c 3f 78 6d 6c 20 76 65 > < ? x m l v e > 0000010 72 73 69 6f 6e 3d 22 31 > r s i o n = " 1 > 0000020 2e 30 22 20 65 6e 63 6f > . 0 " e n c o > 0000030 64 69 6e 67 3d 22 69 73 > d i n g = " i s > 0000040 6f 2d 38 38 35 39 2d 32 > o - 8 8 5 9 - 2 > 0000050 22 3f 3e 3c 64 3e e9 6c > " ?> < d> é l > 0000060 3f 20 26 23 78 32 32 30 > ?& # x 2 2 0 > 0000070 33 3b 74 3c 2f 64 3e 0a > 3 ; t< / d> \n > 0000100 > > > I am using XML::Twig version 3.39, whose configuration information I > attach to the bottom. > > Ambrus > > > ----------- > > Configuration: > > perl: 5.014002 > OS: linux - x86_64-linux > > required > XML::Parser : 2.41 > expat :<no version information found> > > Strongly Recommended > Scalar::Util : 1.23 (for improved memory management) > Encode : 2.42_01 (for encoding conversions) > > Modules providing additional features > XML::XPathEngine :<not available> (to use XML::Twig::XPath) > XML::XPath : 1.13 (to use XML::Twig::XPath > if Tree::XPathEngine not available) > LWP : 6.02 (for the parseurl method) > HTML::TreeBuilder : 4.2 (to use parse_html and > parsefile_html) > HTML::Entities::Numbered :<not available> (to allow parsing of > HTML containing named entities) > HTML::Tidy :<not available> (to use parse_html and > parsefile_html with the use_tidy option) > HTML::Entities : 3.69 (for the html_encode filter) > Tie::IxHash :<not available> (for the keep_atts_order option) > Text::Wrap : 2009.0305 (to use the "wrapped" > option for pretty_print) > > Modules used only by the auto tests > Test : 1.25_02 > Test::Pod : 1.45 > XML::Simple :<not available> > XML::Handler::YAWriter :<not available> > XML::SAX::Writer :<not available> > XML::Filter::BufferText :<not available> > IO::Scalar :<not available> > > > Please add this information to bug reports (you can run > t/zz_dump_config.t to get it) >
Subject: Re: [rt.cpan.org #71844] output_encoding option should print unencodable characters as entities
Date: Sat, 22 Oct 2011 19:57:29 +0200
To: bug-XML-Twig [...] rt.cpan.org
From: Zsbán Ambrus <ambrus [...] math.bme.hu>
On Fri, Oct 21, 2011 at 7:09 PM, Zsbán Ambrus <ambrus@math.bme.hu> wrote: Show quoted text
> I believe the it would be better if the above command gave output > similar to the following command, > > $ perl -we 'use XML::Twig; use Encode; my $tw = > XML::Twig->new(output_encoding => "iso-8859-2", output_text_filter => > sub { encode("iso-8859-2", $_[0], Encode::FB_XMLCREF()) }); > $tw->set_root(XML::Twig::Elt->new("d", "\x{e9}l\x{151} \x{2203}t")); > $tw->flush; print$/;' | od -tx1c -w8
[...] Show quoted text
>          "   ?   >   <   d   >   é   l > 0000060  3f  20  26  23  78  32  32  30 >          ?       &   #   x   2   2   0 > 0000070  33  3b  74  3c  2f  64  3e  0a >          3   ;   t   <   /   d   >  \n > 0000100
Wait a minute, I should have noticed yesterday that this second command is actually wrong. It apparently tries to double-encode the string because both output_encoding and output_text_filter is set, so the \x{151} comes out as a question mark. The following command (which does not use output_encoding) appears to work instead. $ perl -we 'use XML::Twig; use Encode; my $tw = XML::Twig->new(output_text_filter => sub { encode("iso-8859-2", $_[0], Encode::FB_XMLCREF()) }); $tw->set_encoding("iso-8859-2"); $tw->set_root(XML::Twig::Elt->new("d", "\x{e9}l\x{151} \x{2203}t")); $tw->flush; print$/;' | od -tx1c -w8 0000000 3c 3f 78 6d 6c 20 76 65 < ? x m l v e 0000010 72 73 69 6f 6e 3d 22 31 r s i o n = " 1 0000020 2e 30 22 20 65 6e 63 6f . 0 " e n c o 0000030 64 69 6e 67 3d 22 69 73 d i n g = " i s 0000040 6f 2d 38 38 35 39 2d 32 o - 8 8 5 9 - 2 0000050 22 3f 3e 0a 3c 64 3e e9 " ? > \n < d > é 0000060 6c f5 20 26 23 78 32 32 l ő & # x 2 2 0000070 30 33 3b 74 3c 2f 64 3e 0 3 ; t < / d > 0000100 0a \n 0000101 Ambrus