Subject: | encoding support is broken |
I'm running version 1.02 of the module and am having trouble with the encoding support. The encode_text function seems just to convert ampersands if they are not in front of an entity (eg. it won't convert &). The correct (and much simpler) thing to do is always to encode the ampersand. The current behavior will, for instance, break html character data by not encoding entities because the xml parser on the other end will decode those entities. For example, suppose I have an rss item with a description of:
The ampersand entity is '&'.
When I encode that title using the current code, I get:
<description>The ampersand entity is '&amp;'.</description>
When I parse that xml with a correct parser, I get the following as the description cdata:
The ampersand entity is '&'.
By not encoding the entity, you've broken the string.
More importantly, you have a large list of entitites that you will not replace, but the only standard xml entities are:
amp, lt, gt, apos, quot
Every other entity must be declared before it can be used. So, when I pass a character data value with, for example, an in it and you do not encode the into &nbsp; but instead just leave it as is, the parser that tries to read your output, sees the undeclared (and therefore illegal) entity , and throws a fatal error. This is in fact the bahavior that caused me to look at encoding function to see what it was doing, since expat correctly refused to parse a ' ' in the output from the module.
There is a clear statement of the proper way to encode character data at:
http://www.w3.org/TR/REC-xml#dt-chardata
The short of it is that you must always encode '&' and '<', and you must always encode '>' when it appears in the string ']]>' but does not mark the end of a CDATA section.