Bug #123878 for Pod-Perldoc: unwarranted UTF-8 from ToMan

Subject:	unwarranted UTF-8 from ToMan
Date:	Thu, 14 Dec 2017 05:28:09 +0000
To:	bug-Pod-Perldoc [...] rt.cpan.org
From:	Zefram <zefram [...] fysh.org>

Given a sufficiently recent groff, Pod::Perldoc::ToMan will pass it the "-Tutf8" option, requesting UTF-8 output, which will then be passed to the pager. This makes an unwarranted assumption that the user's environment (particularly the terminal and the pager) will interpret bytes flung at them as UTF-8. If that is not the type of environment in which perldoc is being executed, then various kinds of visual failure will ensue. For example, in my ASCII environment, my pager (less(1)) helpfully renders non-ASCII-printable bytes in a highlighted manner that shows their numeric value. The most common such bytes for ToMan to emit are 0xc2 and 0xb7, representing U+b7 "middle dot", which groff uses as a list item bullet if it's told the output device can accept it. Minus the highlighting, which I can't show in this plain text message, here is how a segment of perlretut(1) is rendered: -------------------------------------------------------------------------------- <C2><B7> "\d" matches a digit, not just "[0-9]" but also digits from no n- roman scripts <C2><B7> "\s" matches a whitespace character, the set "[\ \t\r\n\f]" an d others <C2><B7> "\w" matches a word character (alphanumeric or '_'), not just "[0-9a-zA-Z_]" but also digits and characters from non-roman scripts <C2><B7> "\D" is a negated "\d"; it represents any other character than a digit, or "[^\d]" <C2><B7> "\S" is a negated "\s"; it represents any non-whitespace chara cter "[^\s]" <C2><B7> "\W" is a negated "\w"; it represents any non-word character "[^\w]" <C2><B7> The period '.' matches any character but "\n" (unless the modi fier "/s" is in effect, as explained below). <C2><B7> "\N", like the period, matches any character but "\n", but it does so regardless of whether the modifier "/s" is in effect. -------------------------------------------------------------------------------- Observe that not only is the bullet ugly, but the eight-column rendering of it means that many of these lines end up exceeding the 80 column width of the terminal, so the pager wraps them onto a second screen line. Thus words are broken up and the layout is very uneven, negating all the work that groff put into justifying the text for the output width. For comparison, here's the same documentation excerpt rendered in the same environment but using "-Tascii": -------------------------------------------------------------------------------- o "\d" matches a digit, not just "[0-9]" but also digits from non- roman scripts o "\s" matches a whitespace character, the set "[\ \t\r\n\f]" and others o "\w" matches a word character (alphanumeric or '_'), not just "[0-9a-zA-Z_]" but also digits and characters from non-roman scripts o "\D" is a negated "\d"; it represents any other character than a digit, or "[^\d]" o "\S" is a negated "\s"; it represents any non-whitespace character "[^\s]" o "\W" is a negated "\w"; it represents any non-word character "[^\w]" o The period '.' matches any character but "\n" (unless the modifier "/s" is in effect, as explained below). o "\N", like the period, matches any character but "\n", but it does so regardless of whether the modifier "/s" is in effect. -------------------------------------------------------------------------------- As you can see, when it's clued in that the output device can't accept middle dot, groff uses an appropriate ASCII character as its list item bullet. The bad rendering that I've shown here is on the mild end of the spectrum. Pagers that aren't so diligent about rendering unprintable data will send these bytes to the terminal, where they can have much more exciting effects. Pod::Perldoc::ToMan should not tell groff to produce UTF-8 output without being reasonably confident that the output environment actually expects UTF-8. The locale settings in environment variables would be a good place to look for evidence of what character encoding is expected. -zefram