Bug #19722 for XML-Atom-SimpleFeed: Can't make it to work with international charsets

Tue Jun 06 00:56:21 2006 Guest - Ticket created

Subject:

Can't make it to work with international charsets

Encoding is "us-ascii", and any international content I add appears garbled in my RSS reader. I think two things nned to be changed to fix the situation: encoding="UTF-8", and not create ç-type entities out of non-ascii bytes.

Tue Jun 06 20:56:44 2006 pagaltzis [...] gmx.de - Correspondence added

Subject:	Re: [rt.cpan.org #19722] Can't make it to work with international charsets
Date:	Wed, 7 Jun 2006 02:56:30 +0200
To:	bug-XML-Atom-SimpleFeed [...] rt.cpan.org
From:	"A. Pagaltzis" <pagaltzis [...] gmx.de>

Show quoted text

> Subject: XML-Atom-SimpleFeed > Date: Tue, 06 Jun 2006 04:14:32 +0300 > > Is there a way to change the encoding of a feed to UTF-8? I'm > asking because I have a greek feed, which I fill with unicode > data from a database, and your module creates a feed with > "us-ascii" and turns all the greek letters into Ï etc > entities, which makes the rss file unreadable when opening with > a text editor. > > If there is no way to change the encoding, could you please > change the default encoding of your module to UTF-8, as UTF-8 > is standard nowdays for XML? > > Thank you. > > - Alex > > P.S. Another nice addition might be to produce tidy XML code, > with newlines and tabs.

Tue Jun 06 20:56:45 2006 The RT System itself - Status changed from 'new' to 'open'

Tue Jun 06 20:59:17 2006 ARISTOTLE [...] cpan.org - Taken

Tue Jun 06 21:12:50 2006 ARISTOTLE [...] cpan.org - Correspondence added

On Tue Jun 06 00:56:21 2006, guest wrote: Show quoted text

> Encoding is "us-ascii", and any international content I add appears > garbled in my RSS reader.

Garbled? You mean the content displays incorrectly in an aggregator? If so, then that would be a serious problem and I'd like to ask for some sample code that reproduces the problem. Or are you referring to the same issue as in the other mail, ie. the source is merely unreadable, though it gets decoded as it should by aggregators? Show quoted text

> I think two things nned to be changed to fix the situation: > encoding="UTF-8", and not create ç-type entities out of non-ascii > bytes.

I didn't think about feeds with content in non-Latin-based scripts, that is true. They would be unreadable when viewed as source. My decision to output only "us-ascii" as encoding was based on the fact that many servers are misconfigured and will produce wrong headers; it is also harder to handle things perfectly correctly on the Perl side, making sure that non-ASCII / non-UTF8 strings are properly upgraded to Unicode. Restricting output to ASCII seemed like the easiest way to ensure minimum possible breakage. I guess I'll have to think of a way to make it easy to get encodings right... it's not as simple a question as it seems, because I want the module to be somewhat robust about encodings even in the face of people who don't know exactly what they are doing.

Tue Jun 06 22:01:59 2006 karjala [...] karjala.org - Correspondence added

Subject:	Re: [rt.cpan.org #19722] Can't make it to work with international charsets
Date:	Wed, 07 Jun 2006 05:00:26 +0300
To:	bug-XML-Atom-SimpleFeed [...] rt.cpan.org
From:	Alexander Karelas <karjala [...] karjala.org>

I don't know much about the problems that occur when the servers are misconfigured. But the way I found to solve this problem, is to change the module to act as follows: - say UTF-8 instead of us-ascii and - don't encode the characters from \x80 onwards And that's all. I don't know if that helps at all... - Alex Aristotle Pagaltzis via RT wrote: Show quoted text

> <URL: http://rt.cpan.org/Ticket/Display.html?id=19722 > > > On Tue Jun 06 00:56:21 2006, guest wrote: >

>> Encoding is "us-ascii", and any international content I add appears >> garbled in my RSS reader. >>

> > Garbled? You mean the content displays incorrectly in an aggregator? If > so, then that would be a serious problem and I'd like to ask for some > sample code that reproduces the problem. Or are you referring to the > same issue as in the other mail, ie. the source is merely unreadable, > though it gets decoded as it should by aggregators? > >

>> I think two things nned to be changed to fix the situation: >> encoding="UTF-8", and not create ç-type entities out of non-ascii >> bytes. >>

> > I didn't think about feeds with content in non-Latin-based scripts, that > is true. They would be unreadable when viewed as source. > > My decision to output only "us-ascii" as encoding was based on the fact > that many servers are misconfigured and will produce wrong headers; it > is also harder to handle things perfectly correctly on the Perl side, > making sure that non-ASCII / non-UTF8 strings are properly upgraded to > Unicode. Restricting output to ASCII seemed like the easiest way to > ensure minimum possible breakage. > > I guess I'll have to think of a way to make it easy to get encodings > right... it's not as simple a question as it seems, because I want the > module to be somewhat robust about encodings even in the face of people > who don't know exactly what they are doing. > >

Download smime.p7s
application/x-pkcs7-signature 3.1k

Message body not shown because it is not plain text.

Tue Jun 06 22:35:54 2006 ARISTOTLE [...] cpan.org - Correspondence added

On Tue Jun 06 22:01:59 2006, KARJALA wrote: Show quoted text

> I don't know if that helps at all...

I already know that much. And your suggestion will work in the simple case that all input is consistently encoded and the output the string correctly (seems to be the case in your code). But I'm not sure it's enough to deal with more complex scenarios robustly: how do I react if the caller gives me several non-Unicode strings in different encodings? How do I *detect* it? Just encoding everything and capping output at "us-ascii" at least ensures that the feed will always be well-formed XML no matter what. I will have to think about this. It's clearly going to be an issue for others too; people publishing in Asian scripts f.ex. will very likely need to use UTF-16. Hrmf.

Tue Jun 06 23:16:20 2006 karjala [...] karjala.org - Correspondence added

Subject:	Re: [rt.cpan.org #19722] Can't make it to work with international charsets
Date:	Wed, 07 Jun 2006 06:14:58 +0300
To:	bug-XML-Atom-SimpleFeed [...] rt.cpan.org
From:	Alexander Karelas <karjala [...] karjala.org>

I don't know if it's possible to detect the encoding. Maybe you could ask the user to provide during object creation: (1) a optional global encoding for the feed (which defaults to utf-8), and (2) an optional encoding for each feed item (which defaults to the global encoding), and then have one of the ready modules in CPAN translate the items' texts from encoding #2 to encoding #1. That's the only solution I can come up with.

Fri Sep 08 11:46:13 2006 jrockway [...] cpan.org - Correspondence added

From:

JROCKWAY [...] cpan.org

Unless you're working on this right now, I think I'll have a patch for this problem soon. What your module is doing is it's encoding the individual octets of utf-8 characters as entities. You can look at http://blog.jrock.us/feeds/article/%E9%9B%BB%E8%BB%8A%E7%94%B7/xml For an example of this. All the Japanese comes out garbled because the entities represent individual octets of the multi-byte character sequence. My solution to this is to just set the encoding to utf-8 and dump the raw octets that perl uses internally (utf-8). If you want, I could add a charset config option and try to have Encode do a charset conversion (and throw an exception if it's not possible to represent the content in memory in that charset). Regards, Jonathan Rockway On Tue Jun 06 00:56:21 2006, guest wrote: Show quoted text

> Encoding is "us-ascii", and any international content I add appears > garbled in my RSS reader. > > I think two things nned to be changed to fix the situation: > encoding="UTF-8", and not create ç-type entities out of non-ascii > bytes.

-- Jonathan Rockway <jrockway@cpan.org>

Fri Sep 08 18:55:17 2006 pagaltzis [...] gmx.de - Correspondence added

Subject:	Re: [rt.cpan.org #19722] Can't make it to work with international charsets
Date:	Sat, 9 Sep 2006 00:55:24 +0200
To:	Jonathan Rockway via RT <bug-XML-Atom-SimpleFeed [...] rt.cpan.org>
From:	"A. Pagaltzis" <pagaltzis [...] gmx.de>

Hi Jonathan, * Jonathan Rockway via RT <bug-XML-Atom-SimpleFeed@rt.cpan.org> [2006-09-08 17:50]: Show quoted text

> All the Japanese comes out garbled because the entities > represent individual octets of the multi-byte character > sequence. > > My solution to this is to just set the encoding to utf-8 and > dump the raw octets that perl uses internally (utf-8).

I suggest you mark the string as UTF-8 prior to passing it in. Then X::A::SF will do the right thing without any further adjustments. Show quoted text

> If you want, I could add a charset config option and try to > have Encode do a charset conversion (and throw an exception if > it's not possible to represent the content in memory in that > charset).

I have something like that already planned, but I’m still thinking about it. I consciously chose to limit the output to US-ASCII because then it doesn’t matter what the caller does with the string: it will never be double-encoded in any form. But obviously, for people whose language is sufficiently far outside the US-ASCII charset, the result is an unreadable entity forest, so I will need to provide some way to specify an encoding. An issue with that is that with the current internal design, such an option could only be set by the constructor. I am wondering whether to enshrine this limitation in the API or to change the design. I know the encodings problem seems very simple to solve, but there’s more to it than you’d think. I want to make the API as transparent as possible WRT encodings, and that’s going to take some thinking. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>

Mon Sep 11 11:47:58 2006 jrockway [...] cpan.org - Correspondence added

From:

JROCKWAY [...] cpan.org

You are absolutely right -- your module works fine. I must not be setting the utf8 flag somewhere -- OpenBSD has no concept of locales in its C library, so I have to do everything myself. Grumble, grumble. :) Show quoted text

> * Jonathan Rockway via RT <bug-XML-Atom-SimpleFeed@rt.cpan.org> [2006- > 09-08 17:50]:

> > All the Japanese comes out garbled because the entities > > represent individual octets of the multi-byte character > > sequence. > > > > My solution to this is to just set the encoding to utf-8 and > > dump the raw octets that perl uses internally (utf-8).

-- Jonathan Rockway <jrockway@cpan.org>

Tue Jun 23 12:05:02 2009 daxim [...] cpan.org - Correspondence added

Attached documentation patch clarifies the findings about encoding, I reused some wording from above. Therein I also advise to delegate output transformation to specialised tools. Hopefully this is enough to close this bug. I wish to add my proverbial mustard to some other topics from this thread. Show quoted text

> how do I react if > the caller gives me several non-Unicode strings in different encodings?

You cannot outsmart people who are such terminally confused. No one else tries to. The documentation should make explicit that this module accepts Text strings (in perlunitut jargon) and DTRT with them. If someone wants to ignore that, then just let him: garbage in, garbage out. Show quoted text

> How do I *detect* it?

»Oh, that way madness lies; let me shun that.« The best heuristic unsurprisingly comes from NSUniversalDetector <http://www.mozilla.org/projects/intl/detectorsrc.html>, <http://search.cpan.org/dist/Encode-Detect/>, but IMO this has no place in X::A::SF. Show quoted text

> people publishing in Asian scripts f.ex. will very likely > need to use UTF-16.

No, my observation is that on the web, national encodings are the rule: GB18030 (often misdeclared as GB2312), Shift-JIS, Big5... Indic scripts standardised on UTF-8 due to their late-comer status.

From 0861c3e55de3a07829c793560554c655b8ea6b82 Mon Sep 17 00:00:00 2001 From: =?utf-8?q?Lars=20D=C9=AA=E1=B4=87=E1=B4=84=E1=B4=8B=E1=B4=8F=E1=B4=A1=20=E8=BF=AA=E6=8B=89=E6=96=AF?= <daxim@cpan.org> Date: Tue, 23 Jun 2009 16:46:45 +0200 Subject: [PATCH] RT #19722: docs about how encoding is handled --- lib/XML/Atom/SimpleFeed.pm | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/lib/XML/Atom/SimpleFeed.pm b/lib/XML/Atom/SimpleFeed.pm index d401148..2c81def 100644 --- a/lib/XML/Atom/SimpleFeed.pm +++ b/lib/XML/Atom/SimpleFeed.pm @@ -700,6 +700,8 @@ The C<source> element is not and may never be supported. Nothing is done to ensure that text constructs with type C<xhtml> and entry contents using either that or an XML media type are well-formed. So far, this is by design. You should strongly consider using an XML writer if you want to include content with such types in your feed. +The XML representation of the feed is encoded in C<us-ascii> only, characters outside this repertoire are encoded as decimal numeric character references, e.g. C<〹>. This makes output files robust against misconfigured webservers that produce wrong headers. As this module does not depend on an external XML writer, but uses a minimal serialiser internally, it also helps reduce its complexity. Encoding should not matter; feed consuming software will just do the right thing. But sometimes it is convenient to be able to read the XML source without the confusing entities. In that case, filter it through an external tool for pretty-printing, e.g. C<xmllint --format --encode utf-8>, or programmatically through an XML library, e.g. L<XML::LibXML::Document/"setEncoding">. + If you find bugs or you have feature requests, please report them to L<mailto:bug-xml-atom-simplefeed@rt.cpan.org>, or through the web interface at L<http://rt.cpan.org>. -- 1.6.3

Tue Jun 23 14:09:46 2009 ARISTOTLE [...] cpan.org - Correspondence added

Show quoted text

> Attached documentation patch clarifies the findings about > encoding, I reused some wording from above. Therein I also > advise to delegate output transformation to specialised tools. > Hopefully this is enough to close this bug.

Thanks. I’m not sure whether I want to apply this, though. Since the time when this ticket was filed, I have learned quite a few things and changed my position on others. The plan is now to make the charset configurable. I have long wanted to refactor the internals, as they currently produce fragments of XML as you call methods, which are glued together in the end. This approach makes the internals very inflexible. As part of this, I have accepted the reality that there are no good XML emitter modules on CPAN, and taken it upon myself to write on, whose API closely follows HTML::Tiny. Once that is done, SimpleFeed is due for a complete (though incremental) overhaul. And at that time, this ticket will finally be fully addressed. Show quoted text

> > how do I react if the caller gives me several non-Unicode > > strings in different encodings?

> > You cannot outsmart people who are such terminally confused. No > one else tries to. The documentation should make explicit that > this module accepts Text strings (in perlunitut jargon) and > DTRT with them. If someone wants to ignore that, then just let > him: garbage in, garbage out.

Yes. I understand that now. Show quoted text

> > How do I *detect* it?

> > »Oh, that way madness lies; let me shun that.« The best > heuristic unsurprisingly comes from NSUniversalDetector > <http://www.mozilla.org/projects/intl/detectorsrc.html>, > <http://search.cpan.org/dist/Encode-Detect/>, but IMO this has > no place in X::A::SF.

Oh no. I was not trying to actually do something useful with those strings; I was just wondering if it was possible to detect such an error and throw an exception or something. But I have since learned that strings in Perl are completely untyped – that there isn’t even a distinction between text strings and octet strings (like a naïve understanding of the UTF8 flag suggests). So it is in fact entirely impossible to determine the semantics of a string by examining the string. Hence all I can do is as you say: document that the module expects text strings for input and produces an octet sequence as output. Show quoted text

> > people publishing in Asian scripts f.ex. will very likely > > need to use UTF-16.

> > No, my observation is that on the web, national encodings are > the rule: GB18030 (often misdeclared as GB2312), Shift-JIS, > Big5... Indic scripts standardised on UTF-8 due to their > late-comer status.

Aha. Well, I was not making an observation, really. The fact is that XML parsers are not required to support any of those national encodings (not even Latin-1, I think); but they are required to support the various UTF variants, so in that sense UTF-16 is the conservative option. Anyway, this doesn’t actually matter for any of the points at hand. What matters is that I finally have a plan for how I want to proceed with the module.

Tue Sep 22 08:41:33 2015 ARISTOTLE [...] cpan.org - Correspondence added

It’s y’all’s lucky day: this is now *finally* fixed. Please find release 0.9000 on your local CPAN mirror once it’s there. It only took 10 years!

Tue Sep 22 08:41:40 2015 ARISTOTLE [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #19722 for XML-Atom-SimpleFeed: Can't make it to work with international charsets

Preferred bug tracker