Bug #68741 for podlators: Replacement of some characters with X

Fri Jun 10 10:31:16 2011 matt.lawrence [...] virgin.net - Ticket created

Subject:

Replacement of some characters with X

I noticed that things like E<copy> and E<pound> are rendered as X by Pod::Man, though these are rendered as expected in HTML etc. The correct behaviour seems to occur with "pod2man --utf8", but I couldn't find an easy way of making perldoc pass that option on. The attached patch worked for me, it adds mappings to roff escapes for latin-1 characters 0xa0 to 0xbf, 0xd7 and 0xf7. Based on the groff_chars man page.

Subject:

pod_man_escapes.patch

--- lib/Pod/Man.pm +++ lib/Pod/Man.pm @@ -1315,22 +1315,56 @@ # This only works in an ASCII world. What to do in a non-ASCII world is very # unclear -- hopefully we can assume UTF-8 and just leave well enough alone. @ESCAPES{0xA0 .. 0xFF} = ( - "\\ ", undef, undef, undef, undef, undef, undef, undef, - undef, undef, undef, undef, undef, "\\%", undef, undef, - - undef, undef, undef, undef, undef, undef, undef, undef, - undef, undef, undef, undef, undef, undef, undef, undef, - + # 0xa0 + "\\ ", # non-breaking space + "\\[r!]", # inverted exclamation mark + "\\[ct]", # cent + "\\[Po]", # pound sterling + "\\[Cs]", # currency symbol + "\\[Ye]", # yen + "\\[bb]", # broken bar + "\\[sc]", # section + "\\[ad]", # diaresis + "\\[co]", # copyright + "\\[Of]", # feminine ordinal indicator + "\\[Fo]", # left guillemot + "\\[no]", # logical not + "\\%", # roff special + "\\[rg]", # registered + "\\[a-]", # macron + + # 0xb0 + "\\[de]", # degree + "\\[+-]", # plusminus + "\\[S2]", # superscript 2 + "\\[S3]", # superscript 3 + "\\[aa]", # acute accent + "\\[mc]", # micro sign + "\\[ps]", # paragraph + "\\[pc]", # centered period + "\\[ac]", # cedilla accent + "\\[S1]", # superscript 1 + "\\[Om]", # masculine ordinal indicator + "\\[Fc]", # right guillemot + "\\[14]", # one quarter + "\\[12]", # one half + "\\[34]", # three quarters + "\\[r?]", # inverted question mark + + # 0xc0 "A\\*`", "A\\*'", "A\\*^", "A\\*~", "A\\*:", "A\\*o", "\\*(AE", "C\\*,", "E\\*`", "E\\*'", "E\\*^", "E\\*:", "I\\*`", "I\\*'", "I\\*^", "I\\*:", - "\\*(D-", "N\\*~", "O\\*`", "O\\*'", "O\\*^", "O\\*~", "O\\*:", undef, + # 0xd0 + "\\*(D-", "N\\*~", "O\\*`", "O\\*'", "O\\*^", "O\\*~", "O\\*:", "\\[mu]", "O\\*/", "U\\*`", "U\\*'", "U\\*^", "U\\*:", "Y\\*'", "\\*(Th", "\\*8", + # 0xe0 "a\\*`", "a\\*'", "a\\*^", "a\\*~", "a\\*:", "a\\*o", "\\*(ae", "c\\*,", "e\\*`", "e\\*'", "e\\*^", "e\\*:", "i\\*`", "i\\*'", "i\\*^", "i\\*:", - "\\*(d-", "n\\*~", "o\\*`", "o\\*'", "o\\*^", "o\\*~", "o\\*:", undef, + # 0xf0 + "\\*(d-", "n\\*~", "o\\*`", "o\\*'", "o\\*^", "o\\*~", "o\\*:", "\\[di]", "o\\*/" , "u\\*`", "u\\*'", "u\\*^", "u\\*:", "y\\*'", "\\*(th", "y\\*:", ) if ASCII; --- t/man.t +++ t/man.t @@ -226,11 +226,11 @@ ### =head1 YEN -It cost me E<165>12345! That should be an X. +It cost me E<165>12345! That should not be an X. ### .SH "YEN" .IX Header "YEN" -It cost me X12345! That should be an X. +It cost me \[Ye]12345! That should not be an X. ### ###

Fri Jun 10 13:47:26 2011 rra [...] stanford.edu - Correspondence added

Subject:	Re: [rt.cpan.org #68741] Replacement of some characters with X
Date:	Fri, 10 Jun 2011 10:47:16 -0700
To:	bug-podlators [...] rt.cpan.org
From:	Russ Allbery <rra [...] stanford.edu>

"Matthew Lawrence via RT" <bug-podlators@rt.cpan.org> writes: Show quoted text

> I noticed that things like E<copy> and E<pound> are rendered as X by > Pod::Man, though these are rendered as expected in HTML etc. The correct > behaviour seems to occur with "pod2man --utf8", but I couldn't find an > easy way of making perldoc pass that option on.

There's some discussion about changing the default to assume UTF-8 output under at least some circumstances. The current behavior is not a bug -- it's intentional, because old versions of *roff on some platforms will segfault and core dump when given 8-bit characters. pod2man has always produced maximally conservative output by default because the generated output is intended for distribution. However, it looks like those platforms have mostly died out, and it's probably time to start doing something else. The question is: what else to do? The problem with character sets is that you don't know which one to choose. We can blindly output UTF-8, but that means that if someone views the page in a locale that isn't UTF-8, they're going to get mangled garbage. (Of course, the X's are already mangled garbage, so this is probably not that much of a drawback.) I'm currently leaning towards outputing UTF-8 by default, but I'm kicking around the idea of trying to use the user's locale. Show quoted text

> The attached patch worked for me, it adds mappings to roff escapes for > latin-1 characters 0xa0 to 0xbf, 0xd7 and 0xf7. Based on the groff_chars > man page.

This we definitely cannot do, since those escapes are groff-specific and Perl supports platforms other than Linux. -- Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>

Fri Jun 10 13:47:27 2011 The RT System itself - Status changed from 'new' to 'open'

Wed Jan 02 13:42:20 2013 RRA [...] cpan.org - Severity Important added

Wed Jan 02 13:42:20 2013 RRA [...] cpan.org - Broken in 1.00 added

Wed Jan 02 13:42:20 2013 RRA [...] cpan.org - Broken in 2.4.0 deleted