Subject: | Error Processing Broken Context |
Date: | Fri, 25 Jun 2010 15:22:31 -0700 |
To: | bug-xml-libxml [...] rt.cpan.org |
From: | "David E. Wheeler" <dwheeler [...] cpan.org> |
Parsing the attached feed (Perl 5.12.1, XML::LibXML 1.70, libxml2 2.7.7) with this line:
perl -MXML::LibXML -e 'XML::LibXML->new->parse_file(shift)' ~/Desktop/thedieline.rss
I get this error:
Malformed UTF-8 character (fatal) at /usr/local/lib/perl5/site_perl/5.12.1/darwin-thread-multi-2level/XML/LibXML/Error.pm line 217.
line 217 is:
$context=~s/[^\t]/ /g;
If I comment it out, I get the full error, albeit uglily formatted:
thedieline.rss:26: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC3 0x26 0x6C 0x64
<snip />
So clearly there's something up with the string that the regex doesn't like but the downside to that is that I'm losing the actual error. I'm not exactly sure what to do about that. In further testing, it appears the the utf8 flag is set on $context, yet it contains invalid utf8. Is XML::LibXML improperly turning on this flag before it is certain that the text is UTF-8?
If I turn off the UTF-8 flag, I get a much better error message. So maybe the code should be updated to catch that exception and turn off the utf8 flag and try again?
Thanks,
David
Message body not shown because it is not plain text.