Show quoted text> Attached documentation patch clarifies the findings about
> encoding, I reused some wording from above. Therein I also
> advise to delegate output transformation to specialised tools.
> Hopefully this is enough to close this bug.
Thanks. I’m not sure whether I want to apply this, though. Since
the time when this ticket was filed, I have learned quite a few
things and changed my position on others.
The plan is now to make the charset configurable. I have long
wanted to refactor the internals, as they currently produce
fragments of XML as you call methods, which are glued together
in the end. This approach makes the internals very inflexible.
As part of this, I have accepted the reality that there are no
good XML emitter modules on CPAN, and taken it upon myself to
write on, whose API closely follows HTML::Tiny.
Once that is done, SimpleFeed is due for a complete (though
incremental) overhaul. And at that time, this ticket will finally
be fully addressed.
Show quoted text> > how do I react if the caller gives me several non-Unicode
> > strings in different encodings?
>
> You cannot outsmart people who are such terminally confused. No
> one else tries to. The documentation should make explicit that
> this module accepts Text strings (in perlunitut jargon) and
> DTRT with them. If someone wants to ignore that, then just let
> him: garbage in, garbage out.
Yes. I understand that now.
Show quoted text
Oh no. I was not trying to actually do something useful with
those strings; I was just wondering if it was possible to detect
such an error and throw an exception or something. But I have
since learned that strings in Perl are completely untyped – that
there isn’t even a distinction between text strings and octet
strings (like a naïve understanding of the UTF8 flag suggests).
So it is in fact entirely impossible to determine the semantics
of a string by examining the string. Hence all I can do is as you
say: document that the module expects text strings for input and
produces an octet sequence as output.
Show quoted text> > people publishing in Asian scripts f.ex. will very likely
> > need to use UTF-16.
>
> No, my observation is that on the web, national encodings are
> the rule: GB18030 (often misdeclared as GB2312), Shift-JIS,
> Big5... Indic scripts standardised on UTF-8 due to their
> late-comer status.
Aha. Well, I was not making an observation, really. The fact is
that XML parsers are not required to support any of those
national encodings (not even Latin-1, I think); but they are
required to support the various UTF variants, so in that sense
UTF-16 is the conservative option.
Anyway, this doesn’t actually matter for any of the points at
hand.
What matters is that I finally have a plan for how I want to
proceed with the module.