Bug #43004 for XML-Feed: XML::Feed: Atom feeds come out as bytes, but RSS as Unicode

Tue Feb 03 14:32:07 2009 smcv [...] debian.org - Ticket created

Subject:	XML::Feed: Atom feeds come out as bytes, but RSS as Unicode
Date:	Tue, 3 Feb 2009 19:31:00 +0000
To:	bug-XML-Feed [...] rt.cpan.org
From:	Simon McVittie <smcv [...] debian.org>

XML::Atom has a bizarre API where by default, text is returned as a string of UTF-8 bytes without the Unicode flag set. XML::RSS::Feed doesn't do this. To make the output of XML::Feed the same in both cases, XML::Feed should probably use "{ local $XML::Atom::ForceUnicode = 1; ... }" around each read access to the XML::Atom object's accessor functions, resulting in a switch to Unicode output that matches XML::RSS::Feed. This bug breaks IkiWiki <http://ikiwiki.info/> when aggregating Atom feeds; it ends up "double-escaping" the entries as they're written into the cache. For instance, U+8217 closing single quote goes into the cache file as the 6-byte sequence "\xC3\xA2\xC2\x80\xC2\x99", rather than the correct 3-byte sequence "\xE2\x80\x99"; the effect is as if the string was encoded as UTF-8, decoded as Latin-1, then encoded as UTF-8 again. Simon

Download signature.asc
application/pgp-signature 155b

Message body not shown because it is not plain text.

Tue Jul 07 08:52:13 2009 elmex [...] ta-sa.org - Correspondence added

I too have the same problem. And setting $XML::Atom::ForceUnicode = 1; fixes this for me. But I'm afraid that it's a global variable and I can't set it in my module AnyEvent::Feed which uses XML::Feed. Greetings, Robin

Tue Jul 07 08:52:14 2009 The RT System itself - Status changed from 'new' to 'open'

Mon Nov 16 21:02:41 2009 simonw [...] cpan.org - Correspondence added

Hmm, I'm not entirely sure what the best way to handle this is - setting ForceUnicode is kind of a nuclear option which could screw up other modules in, say, a mod_perl environment. I'm talking to Tatsuhiko Miyagawa about it and I'll get back to you.

Thu May 20 15:18:50 2010 dwheeler [...] cpan.org - Correspondence added

On Mon Nov 16 21:02:41 2009, SIMONW wrote: Show quoted text

> Hmm, I'm not entirely sure what the best way to handle this is - setting > ForceUnicode is kind of a nuclear option which could screw up other > modules in, say, a mod_perl environment. > > I'm talking to Tatsuhiko Miyagawa about it and I'll get back to you.

I discovered this solution myself. I'd love to see XML::Atom have an object attribute to force decoding to utf8. Frankly, it should be enabled by default. Best, David

Thu Nov 24 06:28:26 2011 SHLOMIF [...] cpan.org - Correspondence added

Hi all, I've been bitten by this bug myself now when trying to combine my blogs.perl.org's blog feed, which is only provided in Atom (why??), into the rest of the feeds. The ForceUnicode setting workaround that is described in this thread works nicely, but there should be a more permanent solution. Regards, -- Shlomi Fish

Thu Nov 24 06:37:42 2011 DAVECROSS [...] cpan.org - Correspondence added

On Tue Feb 03 14:32:07 2009, smcv@debian.org wrote: Show quoted text

> XML::Atom has a bizarre API where by default, text is returned as a > string of UTF-8 bytes without the Unicode flag set. XML::RSS::Feed > doesn't do this. > > To make the output of XML::Feed the same in both cases, XML::Feed > should probably use "{ local $XML::Atom::ForceUnicode = 1; ... }" > around each read access to the XML::Atom object's accessor functions, > resulting in a switch to Unicode output that matches XML::RSS::Feed. > > This bug breaks IkiWiki <http://ikiwiki.info/> when aggregating Atom > feeds; it ends up "double-escaping" the entries as they're written > into the cache. For instance, U+8217 closing single quote goes into > the cache file as the 6-byte sequence "\xC3\xA2\xC2\x80\xC2\x99", > rather than the correct 3-byte sequence "\xE2\x80\x99"; the effect is > as if the string was encoded as UTF-8, decoded as Latin-1, then > encoded as UTF-8 again. > > Simon

Does it make sense to discuss this here? Isn't it a bug in XML::Atom? Or am I misunderstanding? Dave...

Thu Nov 24 06:38:28 2011 DAVECROSS [...] cpan.org - Taken

Thu Nov 24 07:01:22 2011 smcv [...] debian.org - Correspondence added

Subject:	Re: [rt.cpan.org #43004] XML::Feed: Atom feeds come out as bytes, but RSS as Unicode
Date:	Thu, 24 Nov 2011 12:01:09 +0000
To:	Dave Cross via RT <bug-XML-Feed [...] rt.cpan.org>
From:	Simon McVittie <smcv [...] debian.org>

On Thu, 24 Nov 2011 at 06:37:43 -0500, Dave Cross via RT wrote: Show quoted text

> Does it make sense to discuss this here? Isn't it a bug in XML::Atom? > > Or am I misunderstanding?

I agree that this needs discussion with the author of XML::Atom. I don't know how you Cc people "correctly" in RT, it's not a bug tracker I'm particularly familiar with. As far as I'm concerned, the bug in X::F is that it doesn't produce the same data type for RSS and Atom feeds (breaking encapsulation), and the underlying bugs in X::A that make it hard for X::F to do the right thing are: 1) produces a byte-string of UTF-8, rather than a Unicode string, by default (might not be considered to be a bug, since it's documented in XML::Atom::Feed; or might be considered to be a bug but unfixable, since that would be an API break) 2) can only be directed to produce Unicode by setting a global variable (this is an API design problem, rather than not behaving as documented) Three possible solutions: * If (1) is considered to be a bug, make XML::Atom::ForceUnicode the default, and XML::Feed doesn't need any changes; requires changes to X::A only. * If (1) is as designed or is unfixable, fix (2) instead (e.g. add $feed->unicode(1) setter) and then change XML::Feed to use it; requires changes to both X::A and X::F. I'd be inclined to say this one is the most correct. * If (1) is as designed, postprocess the XML::Atom output through Encode::decode('utf-8', $bytes) in XML::Feed; requires changes to X::F only, but will break if (1) is changed in a later version of X::A. Which one is correct is up to you and the author of XML::Atom. For now, IkiWiki sets "local $XML::Atom::ForceUnicode = 1" around each invocation of XML::Feed, because we know that it's single-threaded, so the usual problems with global variables are less of a concern. I realise this would be unacceptable in a library, though. S

Fri Jan 11 14:00:28 2019 me [...] eboxr.com - Correspondence added

Ticket migrated to github as https://github.com/davorg/xml-feed/issues/44

Sat Jan 12 01:54:55 2019 DAVECROSS [...] cpan.org - Correspondence added

See Github issue instead. https://github.com/davorg/xml-feed/issues/43

Sat Jan 12 01:54:56 2019 DAVECROSS [...] cpan.org - Status changed from 'open' to 'resolved'