Bug #101536 for Pod-Markdown: Pod::Markdown doesn't translate E<copy> correctly

Wed Jan 14 07:52:06 2015 david [...] weintraub.name - Ticket created

Subject:	Pod::Markdown doesn't translate E<copy> correctly
Date:	Wed, 14 Jan 2015 07:51:56 -0500
To:	bug-Pod-Markdown [...] rt.cpan.org
From:	David Weintraub <david [...] weintraub.name>

Version: Pod::Markdown: 2.002 Perl Version: 5.18.1 Operating System: Mac OS X 10.10 (Yosemite) Description: When parsing code, Pod::Markdown doesn’t correctly translate E<copy>. This becomes 0xA9 rather than © as done in Pod::Html. This maybe an error in Pod::Simple in handling escapes between 0x80 and 0xFF.

Wed Jan 21 09:32:29 2015 RWSTAUNER [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #101536] Pod::Markdown doesn't translate E<copy> correctly
Date:	Wed, 21 Jan 2015 07:31:57 -0700
To:	bug-Pod-Markdown [...] rt.cpan.org
From:	Randy Stauner <rwstauner [...] cpan.org>

Pod::Simple transparently decodes E<> sequences into unicode characters. Pod::Html subclasses Pod::Simple::XHTML which passes text sequences through HTML::Entities. By default HTML::Entites encodes: control chars, high bit chars and '<', '&', '>', ''' and '". I'd be happy to make it an option to pass text sequences through HTML::Entities but I'm not sure what the default should be. I, for one, am perfectly happy encoding the files in utf-8 and embedding the unicode characters. HTML-encoding any printable ascii characters seems excessive in Markdown (it detracts from the simplicity of it). So in my opinion the best default would be to skip printable ascii and encode other characters: [^\n\r\t\x20-\x7e] I'd still prefer to make this an opt-in, so I'm considering an option to encode any explicitly specified characters and adding a shortcut that would expand to the above list. What do you think? ...

Wed Jan 21 09:32:29 2015 The RT System itself - Status changed from 'new' to 'open'

Wed Jan 21 11:44:23 2015 david [...] weintraub.name - Correspondence added

Subject:	Re: [rt.cpan.org #101536] Pod::Markdown doesn't translate E<copy> correctly
Date:	Wed, 21 Jan 2015 11:43:28 -0500
To:	bug-Pod-Markdown [...] rt.cpan.org
From:	David Weintraub <david [...] weintraub.name>

It gets pretty complex: Pod::Man: Always encodes both © and E<copy> as “X”. Pod::Text - with encoding declared in my POD: Both © and E<copy> are translated correctly. Pod::Text - without encoding declared in my POD: Both © and E<copy> are translated as 0xA9. Pod::Html always encodes either © or E<copy> correctly as © — even if I don’t set the encoding. Pod::Markdown always encodes either © or E<copy> as 0xA9 whether or not encoding is declared in my POD. I understand that the whole purpose of Markdown is to be readable even if it isn’t displayed as a formatted document. HTML entities certainly don’t help. What about this: Lower ASCII printable characters (0x20 through 0x7E) are always translated correctly If I declare an encoding scheme, Pod::Markdown should translate the characters like Pod::Text does. If I don’t declare an encoding scheme, and I use E<xxx> in my POD, all characters not in the range 0x20 to 0x7E should be converted into HTML entities. I’ll be happy if this required a command line option. -- David Weintraub qazwart@gmail.com perl -e 'print "Just another second rate Perl Hacker\n";' Show quoted text

> On Jan 21, 2015, at 9:32 AM, Randy Stauner via RT <bug-Pod-Markdown@rt.cpan.org> wrote: > > <URL: https://rt.cpan.org/Ticket/Display.html?id=101536 > > > Pod::Simple transparently decodes E<> sequences into unicode characters. > Pod::Html subclasses Pod::Simple::XHTML which passes text sequences through > HTML::Entities. > By default HTML::Entites encodes: control chars, high bit chars and '<', > '&', '>', ''' and '". > > I'd be happy to make it an option to pass text sequences through > HTML::Entities but I'm not sure what the default should be. > I, for one, am perfectly happy encoding the files in utf-8 and embedding > the unicode characters. > HTML-encoding any printable ascii characters seems excessive in Markdown > (it detracts from the simplicity of it). > > So in my opinion the best default would be to skip printable ascii and > encode other characters: [^\n\r\t\x20-\x7e] > > I'd still prefer to make this an opt-in, so I'm considering an option to > encode any explicitly specified characters > and adding a shortcut that would expand to the above list. > > What do you think? > > > ... >

Sat Aug 15 12:01:47 2015 RWSTAUNER [...] cpan.org - Correspondence added

Sorry for the delay, but I finally got the time to finish this. I was confused originally because I didn't realize we were talking about the bin scripts (pod2man, pod2markdown, etc). I was thinking about the Pod::Simple API which expects octets and returns a character string and leaves it up to the caller to encode the result. Obviously that's not the case for the bin script, where we need to be outputting bytes. So, pod2markdown never specified any sort of encoding, it was just dumping the characters. So if the only high-bit characters in the output stream are < 256 perl would print native bytes. From what I can tell this is actually historical functionality... The original version was based on Pod::Parser and trustingly converted `E<x>` to `&x;` but literal high-bit chars passed through directly. I looked at the modules you mentioned to see how they do what they do: Pod::Man's pod2man accepts a -u option which will enable utf-8 characters, otherwise it tries for maximum compatibility with unknown *roff implementations and just outputs an X. Pod::Text has embedded logic to try to guess the best output encoding if it wasn't specified. It will match the input encoding of the pod (if one was specified) or utf-8 if that was specified as an argument. Pod::Html uses Pod::Simple::XHTML which passes all text through HTML::Entities by default (so `E<copy>` becomes U+00A9 which is then converted to `©`). Without an `=encoding` specified, Pod::Simple (on which all of the aforementioned modules are currently based) will guess the encoding when it sees a high-bit character (so \xa9 is guessed to be CP1252 and \xc2\xa9 is guessed to be UTF-8). So, I've added some options to handle encoding for pod2markdown. You can specify that it should use the same encoding as the input pod (like Pod::Text). You can specify the desired output encoding explicitly. If you do neither it will output UTF-8 by default. I considered making ascii (and converting non-ascii to html entities) the default, however it's not 100% safe. So I opted for simplicity and consistency and defaulted to UTF-8 while allowing options for alternate configurations.

Sun Aug 16 18:19:02 2015 RWSTAUNER [...] cpan.org - Status changed from 'open' to 'resolved'