Subject: | unprintable diagnostics |
Date: | Sun, 23 Jul 2017 10:39:46 +0100 |
To: | bug-Test-Simple [...] rt.cpan.org |
From: | Zefram <zefram [...] fysh.org> |
There's a group of related problems around the printability of
Test::More's diagnostics, which have a common cause and a common solution.
The attached patch implements the solution.
The most obvious of these problems is that, where a data string relevant
to a diagnostic contains a control character, that control character is
copied straight into the diagnostic. For example, given the test case
$ perl -MTest::More -e 'is "", "\a"; done_testing'
, one gets output that on a terminal looks like
not ok 1
# Failed test at -e line 1.
# got: ''
# expected: ''
1..1
# Looks like you failed 1 test of 1.
and the terminal beeps. The visible diagnostic gives the impression
that the two strings being compared are identical, making the failure
nonsensical. Unfavourable test runs can easily output a great variety of
control characters: consider comparisons of byte strings resulting from
cryptographic algorithms. Not only beeps but also messed-up terminal
settings result.
Related trouble occurs with characters that, while notionally printable,
aren't as portable. Consider the test case
$ perl -MTest::More -e 'is "\x{e9}", "\x{e2}\x{98}\x{83}"; is "\x{e9}", "\x{2603}"; done_testing'
. Firstly, in any case the output contains some C1 control characters,
but let's ignore that. By default, i.e., if there's no environment
setting to tell Perl to encode its output in UTF-8, then the output
shows differing `got' strings and identical `expected' strings, which is
the opposite of the truth. The reason for this is revealed by a "Wide
character in print" warning: the second diagnostic contains a literal
snowman character, which of course can't be sent to a byte stream, and in
a terrible decision from 5.6 the core handles this by implicitly encoding
just that diagnostic in UTF-8. The practical upshot is that different
diagnostics in a single test script run are encoded inconsistently.
This warning always means there's a bug: it is a bug that Test::More
attempts to output an arbitrary Unicode character to a stream that it
doesn't know can accept non-bytes.
Even in a fully Unicode-capable environment, in the testing context
there are problems with displaying Unicode characters literally.
Supposing that the terminal expects UTF-8 and can fully render Unicode,
$ perl -MTest::More -e 'is "A", "\x{391}"; is "\x{e9}", "e\x{301}"; done_testing'
(optionally with environment settings for output encoding) produces two
diagnostics that show remarkably similar `got' and `expected' strings.
In the first case, Latin A versus Greek alpha, these are different
graphemes, but will (intentionally) have identical appearance in
some fonts. In the second case, precomposed e-acute versus combining
sequence, both character sequences represent the same grapheme, and should
therefore appear identical in any correct rendering. In both cases,
rendering these printable Unicode character sequences impedes the user
in comprehending the differences between them, and hence damages the
usefulness of the diagnostic for debugging purposes.
Furthermore, even where Unicode characters don't cause these problems,
they impede communication of the diagnostics to anyone else, by email
or other means. Sometimes they would get through correctly, but it's
common for encoding problems to arise along the way, and so even when
they actually do get through correctly the receiver can't rely on them
having done so.
The only characters that can be safely used in diagnostics are the
printable ASCII characters. The solution to all the above problems
is that all other characters in data strings should be described in
diagnostics by non-literal means. The attached patch borrows some logic
from Carp's stack trace code to represent data strings in Perl syntax,
using only printable ASCII.
-zefram
Message body is not shown because sender requested not to inline it.