Bug #118508 for String-ShellQuote: \w should not be used to decide what character to quote

Subject:	\w should not be used to decide what character to quote
Date:	Tue, 25 Oct 2016 22:38:16 +0100
To:	bug-String-ShellQuote [...] rt.cpan.org
From:	Stephane Chazelas <stephane.chazelas [...] gmail.com>

$ echo 'vis-à-vis' | perl -MString::ShellQuote -lne 'print shell_quote $_' 'vis-à-vis' $ echo 'vis-à-vis' | perl -C -MString::ShellQuote -lne 'print shell_quote $_' vis-à-vis Above, in the second case, with -C, that à --the U+00E0 character encoded in UTF-8 as 0xc3 0xa0-- is matched by \w, because \w then matches all Unicode letters in UTF-8 locales. Leaving that à unquoted is dangerous because some shells including bash and yash have their parsing dependant on the locale. In particular, they treat any character considered as "blank" in the current locale as a token separator just like space or tab (does not currently work for multi-byte characters for bash though). The 0xa0 byte above, in the ISO-8859-1 character set is the non-breaking-space character, and on Solaris at least, in locales that use that character set, that character happens to be considered a blank. So if the "echo vis-à-vis" code (where that à is written as 0xc3 0xa0) ends up in a file that ends up interpreted by bash on a Solaris system in a iso-8859-1 locale, it won't be interpreted properly. It could be worse in some situations. For instance ε (epsilon) is 0xa3 0x60 in the BIG5 encoding and 0x60 happens to be ` in ASCII/ISO-8859-1/UTF-8. $ printf 'a\243`foo\243`\n' | LC_ALL=zh_HK.big5hkscs perl -Mopen=:locale -MString::ShellQuote -lne 'print shell_quote $_' | sed -n l a\243`foo\243`$ Above, with -Mopen=:locale, shell_quote is used in a context where input and output are interpreted in the locale's character encoding. So that \243` is interpreted as the epsilon character which happens to be a /word/ character so is not quoted. However, even in that same zh_HK.big5hkscs, many shells (including dash, mksh, zsh) will fail to see that \243` as an epsilon character but instead will treat that ` as a backtick. IMO, instead of using \w, we should match only the ASCII letters, digits and underscore. That could be done by using \w but with the /a flag (needs perl 5.14 or above) or just use [a-zA-Z0-9_] explicitely. -- Stephane