Skip Menu |

This queue is for tickets about the XML-Twig CPAN distribution.

Report information
The Basics
Id: 80503
Status: resolved
Priority: 0/
Queue: XML-Twig

People
Owner: Nobody in particular
Requestors: ambrus [...] math.bme.hu
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: 3.44
Fixed in: (no value)



Subject: Newlines in attribute values
Date: Tue, 30 Oct 2012 22:37:29 +0100
To: bug-XML-Twig [...] rt.cpan.org
From: Zsbán Ambrus <ambrus [...] math.bme.hu>
Hello, According to the specs, a newline character in an attribute value must be escaped with an entity otherwise an xml reader will normalize it to a space, but XML::Twig's writer does not seem to know about this. Let me tell the story of the details. I was trying to edit an XML files,actually project configuration files of MS Visual Studio, with Twig. This XML had an attribute with an escaped CRLF inside an attribute value, something like "foo&#13;&#10;bar". This attribute was in an element I didn't change in my editing. When I tried to use the modified XML, I got an error. It turns out that XML::Twig wrote out the attribute with the CRLF unescaped, and the XML reader in MS Visual Studio read it as a single space. After some inquiry, perlmonks told me that the behavior of the XML reader is correct. It turns out that the XML 1.0 standard claims that if a reader finds unescaped CR, LF, CRLF, or HT in an attribute value, it must normalize it to a space. You can find a reference for this behavior at "http://stackoverflow.com/questions/260436/preserving-attribute-whitespace-in-xslt". It turns out that the reader part of XML::Twig behaves correctly: it too reads an unescaped newline in an attribute as a space, but the writer part fails to escape newlines. This means that when you read an escaped newline from an attribute then write it out, the value changes, so I believe this is a bug in XML::Twig. Here's a simple example showing the bug. $ perl -we 'use XML::Twig; my $ct= qq(<m><n p="q&#x0a;r"/><s t="u\nv"/></m>); my $tw = XML::Twig->new; $tw->parse($ct); $tw->flush; print $/;' <m><n p="q r"/><s t="u v"/></m> $ For this simple example, I'm using perl v5.16.1on amd64-linux, XML::Twig v3.41, XML::Parser v2.41, Encode v2.44, all vanilla; with libexpat 2.0.1-7+squeeze1 from the debian package. Ambrus ---- Configuration: perl: 5.016001 OS: linux - x86_64-linux required XML::Parser : 2.41 Can't exec "xmlwf": No such file or directory at t/zz_dump_config.t line 34. Use of uninitialized value $xmlwf_v in pattern match (m//) at t/zz_dump_config.t line 35. Missing argument in sprintf at t/zz_dump_config.t line 114. expat : <no version information found> Strongly Recommended Scalar::Util : 1.25 (for improved memory management) Encode : 2.44 (for encoding conversions) Modules providing additional features XML::XPathEngine : 0.13 (to use XML::Twig::XPath) XML::XPath : <not available> (to use XML::Twig::XPath if Tree::XPathEngine not available) LWP : 6.04 (for the parseurl method) HTML::TreeBuilder : 5.02 (to use parse_html and parsefile_html) HTML::Entities::Numbered : <not available> (to allow parsing of HTML containing named entities) HTML::Tidy : 1.54 (to use parse_html and parsefile_html with the use_tidy option) HTML::Entities : 3.69 (for the html_encode filter) Tie::IxHash : <not available> (for the keep_atts_order option) Text::Wrap : 2009.0305 (to use the "wrapped" option for pretty_print) Modules used only by the auto tests t/zz_dump_config.t .................. 1/1 Test : 1.25_02 Test::Pod : <not available> XML::Simple : <not available> XML::Handler::YAWriter : <not available> XML::SAX::Writer : <not available> XML::Filter::BufferText : <not available> IO::Scalar : <not available> IO::CaptureOutput : <not available>
On Tue Oct 30 17:37:48 2012, ambrus@math.bme.hu wrote: Show quoted text
> According to the specs, a newline character in an attribute value must > be escaped with an entity otherwise an xml reader will normalize it to > a space, but XML::Twig's writer does not seem to know about this.
Hi, Sorry, I saw the bug report, thought about it, and... forgot to answer it. First the work around: if you create the twig using the keep_encoding option, then you get what you want. Be aware of the (potential) problems with keep_encoding though: all the character data you get, whether in attribute or elements becomes unescaped, and is output asis, so if you add data, you have to escape it yourself. A better fix is not possible, because XML::Parser normally reports the data after resolving the entities, so by the time it gets to XML::Twig the numerical entity is lost. For example: perl -MXML::Parser -E'XML::Parser->new( Handlers => { Start => sub { my( $t, $tag, %att)= @_; say $att{p}; } })->parse( q{<d p="q&#x0a;r"/>})' outputs this: q r XML::Twig with the keep_encoding option has to resort to getting the original string from XML::Parser and re-parsing it. Does this help? __ mirod
Subject: Re: [rt.cpan.org #80503] Newlines in attribute values
Date: Tue, 13 Nov 2012 20:18:05 +0100
To: bug-XML-Twig [...] rt.cpan.org
From: Zsbán Ambrus <ambrus [...] math.bme.hu>
On 11/13/12, MIROD via RT <bug-XML-Twig@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=80503 > > > On Tue Oct 30 17:37:48 2012, ambrus@math.bme.hu wrote: >
>> According to the specs, a newline character in an attribute value must >> be escaped with an entity otherwise an xml reader will normalize it to >> a space, but XML::Twig's writer does not seem to know about this.
> > A better fix is not possible, because XML::Parser normally reports the > data after resolving the entities, so by the time it gets to XML::Twig > the numerical entity is lost.
Hello mirod. Thanks for your reply, but I think you may have misunderstood my report. It's true that keep_encoding could be used as a workaround, but I think a better fix _is_ possible. Currently, when you have a newline in an attribute value, XML::Twig will output it as a literal newline. $ perl -we 'use XML::Twig; $tw = XML::Twig->new; $tw->set_root(XML::Twig::Elt->new("m", {"n" => "p\nq"})); $tw->flush; print $/;' <m n="p q"/> $ This is simply wrong, because the literal newline in the attribute value does not represent a newline, it represents a space instead. This is what the XML standard says, and this is how libexpat and other xml readers read the above output. Even XML::Twig reads this output that way, with a space in the attribute value. In an attribute value, XML::Twig should always escape not only quotation marks and ampersands, but also newlines, because the XML syntax says they must be escaped. So if I run the above code with a hypothetical future version of XML::Twig, the output should be <m n="p&#10;q"/> because that's the only way to correctly represent the given attribute value in the output. This would complicate the XML::Twig code because it means attribute values must be escaped in a different way from pcdata, but I still think such a fix is necessary. Ambrus