Subject: | Content body with Newlines (and tabs/etc.) is treated as unprintable |
Due to a change in the behaviour of \p{IsPrint} and \P{IsPrint} between
perl-5.10.x and perl-5.12.x, where \p{IsPrint} excludes
newlines/tabs/etc. then when supplying the ::Content class with a "Body
=>" parameter that contains them it is converted to Base64 which makes
it unusable. I don't have a ready self-contained test case, but I will
be able to write on tomorrow based on this XML-Feed bug report:
https://rt.cpan.org/Public/Bug/Display.html?id=44899
Here is the IRC conversation on #p5p about it:
{{{{{{{{{{{{
<rindolf> Hi all. ("\n" =~ m/\P{IsPrint}/) is false on perl-5.10.1
(Mandriva 2010.1) and true on perl-5.12.2 (Mandriva Cooker). Why was the
behaviour changed and what is the correct one? Thinking that newline is
unprintable breaks XML-Atom.
* Zefram recalls something about sticking strictly to the Unicode class
definitions for \p{}
<Zefram> see L<perl5120delta/Unicode overhaul>
<rafl> Zefram: someone doing it, and how "easy" it turned out to be :)
<Zefram> ah, right
<Zefram> I've got a list of others to do
<leont> Zefram++
<leont> I think this is going to be my favorite feature of 5.14
<Zefram> I found with this one that although it was very easy to do
parse_stmtseq it's rather more difficult to use it to parse a custom
type of block
<Zefram> block_start and block_end need to go into the API, but there's
some other lexer magic around braces too
<rafl> i'm not quite decided on what my favourite feature is going to
be, given that they're not even all written yet, but this is definitely
a big one :)
<vincent> that's pretty cool
* rafl will upgrade Digest-MD5 and then apply with the apitest move
<rindolf> Zefram: I don't see anything in perl5120delta there.
<Zefram> # "\p{Print}" no longer matches the line control characters:
Tab, LF, CR, FF, VT, and NEL. This brings it in line with standards and
the documentation.
<rindolf> Zefram: well, I do, but I don't know how it's called.
<rindolf> Zefram: ah.
<rindolf> Zefram: hmmm....
<rindolf> Zefram: so XML-Atom is buggy.
<Zefram> officially yes
<Zefram> if you want to include characters that Unicode doesn't regard
as printable, you'll need to do it explicitly
<Zefram> if you're including all of the characters that got removed from
\p{Print}, then the fixed code will work on both old and new Perl versions
<Zefram> but actually, in an XML context I suspect that you don't want
to defer to Unicode at all
<Zefram> more likely you actually want one of the character classes
defined in the XML spec
<rindolf> Zefram: thanks, I'll deal with it tomorrow.
<rindolf> It's getting late here.
}}}}}}}}}}}}
The suggested solution is to include all these characters in the regex.
Regards,
-- Shlomi Fish