Subject: | unwarranted UTF-8 from ToMan |
Date: | Thu, 14 Dec 2017 05:28:09 +0000 |
To: | bug-Pod-Perldoc [...] rt.cpan.org |
From: | Zefram <zefram [...] fysh.org> |
Given a sufficiently recent groff, Pod::Perldoc::ToMan will pass it the
"-Tutf8" option, requesting UTF-8 output, which will then be passed to the
pager. This makes an unwarranted assumption that the user's environment
(particularly the terminal and the pager) will interpret bytes flung at
them as UTF-8. If that is not the type of environment in which perldoc
is being executed, then various kinds of visual failure will ensue.
For example, in my ASCII environment, my pager (less(1)) helpfully
renders non-ASCII-printable bytes in a highlighted manner that shows
their numeric value. The most common such bytes for ToMan to emit
are 0xc2 and 0xb7, representing U+b7 "middle dot", which groff uses
as a list item bullet if it's told the output device can accept it.
Minus the highlighting, which I can't show in this plain text message,
here is how a segment of perlretut(1) is rendered:
--------------------------------------------------------------------------------
<C2><B7> "\d" matches a digit, not just "[0-9]" but also digits from no
n-
roman scripts
<C2><B7> "\s" matches a whitespace character, the set "[\ \t\r\n\f]" an
d
others
<C2><B7> "\w" matches a word character (alphanumeric or '_'), not just
"[0-9a-zA-Z_]" but also digits and characters from non-roman
scripts
<C2><B7> "\D" is a negated "\d"; it represents any other character than
a
digit, or "[^\d]"
<C2><B7> "\S" is a negated "\s"; it represents any non-whitespace chara
cter
"[^\s]"
<C2><B7> "\W" is a negated "\w"; it represents any non-word character
"[^\w]"
<C2><B7> The period '.' matches any character but "\n" (unless the modi
fier
"/s" is in effect, as explained below).
<C2><B7> "\N", like the period, matches any character but "\n", but it
does
so regardless of whether the modifier "/s" is in effect.
--------------------------------------------------------------------------------
Observe that not only is the bullet ugly, but the eight-column rendering
of it means that many of these lines end up exceeding the 80 column width
of the terminal, so the pager wraps them onto a second screen line.
Thus words are broken up and the layout is very uneven, negating all
the work that groff put into justifying the text for the output width.
For comparison, here's the same documentation excerpt rendered in the
same environment but using "-Tascii":
--------------------------------------------------------------------------------
o "\d" matches a digit, not just "[0-9]" but also digits from non-
roman scripts
o "\s" matches a whitespace character, the set "[\ \t\r\n\f]" and
others
o "\w" matches a word character (alphanumeric or '_'), not just
"[0-9a-zA-Z_]" but also digits and characters from non-roman
scripts
o "\D" is a negated "\d"; it represents any other character than a
digit, or "[^\d]"
o "\S" is a negated "\s"; it represents any non-whitespace character
"[^\s]"
o "\W" is a negated "\w"; it represents any non-word character
"[^\w]"
o The period '.' matches any character but "\n" (unless the modifier
"/s" is in effect, as explained below).
o "\N", like the period, matches any character but "\n", but it does
so regardless of whether the modifier "/s" is in effect.
--------------------------------------------------------------------------------
As you can see, when it's clued in that the output device can't accept
middle dot, groff uses an appropriate ASCII character as its list
item bullet.
The bad rendering that I've shown here is on the mild end of the spectrum.
Pagers that aren't so diligent about rendering unprintable data will
send these bytes to the terminal, where they can have much more exciting
effects.
Pod::Perldoc::ToMan should not tell groff to produce UTF-8 output
without being reasonably confident that the output environment actually
expects UTF-8. The locale settings in environment variables would be
a good place to look for evidence of what character encoding is expected.
-zefram