Subject: | escape amps and add fault tolerance |
I found some RSS feeds may use HTML special chars in the XML fields, in
particular, they may use '&'(amps) in the title field. Such items cannot
be parsed; what's worse, a small error in one item will cause the whole
document corrupted. I've tried the same document on Google RSS reader
and it worked fine. Since we cannot expect every RSS to be well-formed,
I would suggest:
1. treat unrecognized tokens that begin with '&'(amps), '<'(le) and
'>'(gt) as normal text.
2. if some item cannot be parsed, then ignore it and continue from the
next one (the corrupted item may be returned).
Thanks a lot!