Skip Menu |

This queue is for tickets about the XML-RSS CPAN distribution.

Report information
The Basics
Id: 2472
Status: resolved
Priority: 0/
Queue: XML-RSS

People
Owner: KELLAN [...] cpan.org
Requestors: hroberts [...] cyber.law.harvard.edu
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 1.02
Fixed in: (no value)



Subject: encoding support is broken
I'm running version 1.02 of the module and am having trouble with the encoding support. The encode_text function seems just to convert ampersands if they are not in front of an entity (eg. it won't convert &amp;). The correct (and much simpler) thing to do is always to encode the ampersand. The current behavior will, for instance, break html character data by not encoding entities because the xml parser on the other end will decode those entities. For example, suppose I have an rss item with a description of: The ampersand entity is '&amp;amp;'. When I encode that title using the current code, I get: <description>The ampersand entity is '&amp;amp;'.</description> When I parse that xml with a correct parser, I get the following as the description cdata: The ampersand entity is '&amp;'. By not encoding the entity, you've broken the string. More importantly, you have a large list of entitites that you will not replace, but the only standard xml entities are: amp, lt, gt, apos, quot Every other entity must be declared before it can be used. So, when I pass a character data value with, for example, an &nbsp; in it and you do not encode the &nbsp; into &amp;nbsp; but instead just leave it as is, the parser that tries to read your output, sees the undeclared (and therefore illegal) entity &nbsp;, and throws a fatal error. This is in fact the bahavior that caused me to look at encoding function to see what it was doing, since expat correctly refused to parse a '&nbsp;' in the output from the module. There is a clear statement of the proper way to encode character data at: http://www.w3.org/TR/REC-xml#dt-chardata The short of it is that you must always encode '&' and '<', and you must always encode '>' when it appears in the string ']]>' but does not mark the end of a CDATA section.
On Thu May 01 17:21:16 2003, guest wrote: Show quoted text
> I'm running version 1.02 of the module and am having trouble with the > encoding support. The encode_text function seems just to convert > ampersands if they are not in front of an entity (eg. it won't > convert &amp;). The correct (and much simpler) thing to do is > always to encode the ampersand. The current behavior will, for > instance, break html character data by not encoding entities > because the xml parser on the other end will decode those entities. > For example, suppose I have an rss item with a description of: > > The ampersand entity is '&amp;amp;'. > > When I encode that title using the current code, I get: > > <description>The ampersand entity is '&amp;amp;'.</description> > > When I parse that xml with a correct parser, I get the following as > the description cdata: > > The ampersand entity is '&amp;'. > > By not encoding the entity, you've broken the string. > > More importantly, you have a large list of entitites that you will not > replace, but the only standard xml entities are: > > amp, lt, gt, apos, quot > > Every other entity must be declared before it can be used. So, when I > pass a character data value with, for example, an &nbsp; in it and > you do not encode the &nbsp; into &amp;nbsp; but instead just leave > it as is, the parser that tries to read your output, sees the > undeclared (and therefore illegal) entity &nbsp;, and throws a > fatal error. This is in fact the bahavior that caused me to look > at encoding function to see what it was doing, since expat > correctly refused to parse a '&nbsp;' in the output from the > module. > > There is a clear statement of the proper way to encode character data > at: > > http://www.w3.org/TR/REC-xml#dt-chardata > > The short of it is that you must always encode '&' and '<', and you > must always encode '>' when it appears in the string ']]>' but does > not mark the end of a CDATA section.
Hi, This should be fixed since v1.12. - ask