Subject: | \w should not be used to decide what character to quote |
Date: | Tue, 25 Oct 2016 22:38:16 +0100 |
To: | bug-String-ShellQuote [...] rt.cpan.org |
From: | Stephane Chazelas <stephane.chazelas [...] gmail.com> |
$ echo 'vis-à-vis' | perl -MString::ShellQuote -lne 'print shell_quote $_'
'vis-à-vis'
$ echo 'vis-à-vis' | perl -C -MString::ShellQuote -lne 'print shell_quote $_'
vis-à-vis
Above, in the second case, with -C, that à --the U+00E0
character encoded in UTF-8 as 0xc3 0xa0-- is matched by \w,
because \w then matches all Unicode letters in UTF-8 locales.
Leaving that à unquoted is dangerous because some shells
including bash and yash have their parsing dependant on the
locale. In particular, they treat any character considered as
"blank" in the current locale as a token separator just like
space or tab (does not currently work for multi-byte characters
for bash though).
The 0xa0 byte above, in the ISO-8859-1 character set is the
non-breaking-space character, and on Solaris at least, in
locales that use that character set, that character happens to
be considered a blank.
So if the "echo vis-à-vis" code (where that à is written as 0xc3
0xa0) ends up in a file that ends up interpreted by bash
on a Solaris system in a iso-8859-1 locale, it won't be
interpreted properly.
It could be worse in some situations. For instance ε (epsilon)
is 0xa3 0x60 in the BIG5 encoding and 0x60 happens to be ` in
ASCII/ISO-8859-1/UTF-8.
$ printf 'a\243`foo\243`\n' | LC_ALL=zh_HK.big5hkscs perl -Mopen=:locale -MString::ShellQuote -lne 'print shell_quote $_' | sed -n l
a\243`foo\243`$
Above, with -Mopen=:locale, shell_quote is used in a context
where input and output are interpreted in the locale's character
encoding. So that \243` is interpreted as the epsilon character
which happens to be a /word/ character so is not quoted.
However, even in that same zh_HK.big5hkscs, many shells
(including dash, mksh, zsh) will fail to see that \243` as an
epsilon character but instead will treat that ` as a backtick.
IMO, instead of using \w, we should match only the ASCII
letters, digits and underscore. That could be done by using \w
but with the /a flag (needs perl 5.14 or above) or just use
[a-zA-Z0-9_] explicitely.
--
Stephane