Bug #57466 for Template-Toolkit: Patch to support \x{263A} notation in double quoted strings

Thu May 13 13:15:12 2010 dmuey [...] cpan.org - Ticket created

Subject:

Patch to support \x{263A} notation in double quoted strings

Hello, In perl I can do (w/ "Wide character" warnings typically, but that isn't really relevant at this point): print "Smile \x{263A}\n"; and will get Smile ☺ (hopefully rt doesn't corrupt the smiley face that should follow 'Smile') It'd be great to be able to do the same with double quoted strings in TT. (without having to override it and set PARSER) [% "Smile \x{263A}" %] The attached patch makes that work, without "Wide character" warnings, all the way back to perl 5.6 Thanks, TT rocks! -- Dan Muey [ -- Example -- ] Same results on 5.8.9 and 5.6.2 [ -- Before patch -- ] hal9000$ perl -Mstrict -MTemplate -wle 'my $tt=Template->new;$tt->process(\qq([% "$ARGV[0]" %]));' 'Smile \x{263A}' Smile x{263A} hal9000$ [ -- After patch -- ] hal9000$ perl -Mstrict -MTemplate -wle 'my $tt=Template->new;$tt->process(\qq([% "$ARGV[0]" %]));' 'Smile \x{263A}' Smile ☺ hal9000$ (hopefully rt doesn't corrupt the smiley face that should follow 'Smile')

Subject:

add_slash_x_hex_support_to_TT_interpolation_of_double_quote_context.patch

--- Parser.pm.orig 2010-05-13 09:10:48.000000000 -0500 +++ Parser.pm 2010-05-13 12:08:40.000000000 -0500 @@ -513,6 +513,7 @@ # is set and undef is returned. #------------------------------------------------------------------------ +my $has_encode; # necessary for tokenise_directive() to handle \x{263A} notation sub tokenise_directive { my ($self, $text, $line) = @_; my ($token, $uctoken, $type, $lookup); @@ -568,6 +569,8 @@ # quoted string if (defined ($token = $3)) { # double-quoted string may include $variable references + $self->{'ENCODE_HEX_AS'} ||= 'utf-8'; # necessary for tokenise_directive() to handle \x{263A} notation + if ($2 eq '"') { if ($token =~ /[\$\\]/) { $type = 'QUOTED'; @@ -576,6 +579,21 @@ # as a variable reference # $token =~ s/\\([\\"])/$1/g; for ($token) { + # necessary for tokenise_directive() to handle \x{263A} notation + # only do regex once, only do require once (i.e. if Encode is not available why try to require it for each token?) + if (!defined $has_encode && m/\\x\{[0-9a-f]{1,4}\}/i) { + $has_encode = 0; + eval { require Encode; $has_encode = 1; }; + } + if ( $has_encode ) { # perl 5.007003 and up + s{\\x\{([0-9a-f]{1,4})\}}{Encode::encode($self->{'ENCODE_HEX_AS'}, chr(hex("$1")))}egi; + } + else { # i.e. perl 5.6 + # It could be argued that this only needs done when hex($_) < 0x100 but it works so leave it like this for consistency and in case it is needed under specific circumstances + s{\\x\{([0-9a-f]{1,4})\}}{eval '"\x{'."$1".'}"'}egi; + } + # /necessary for tokenise_directive() to handle \x{263A} notation + s/\\([^\$nrt])/$1/g; s/\\([nrt])/$QUOTED_ESCAPES->{ $1 }/ge; }

Thu May 13 13:48:45 2010 dmuey [...] cpan.org - Correspondence added

Of note, the 'ENCODE_HEX_AS' key might be in the wrong object or wrong part of the object. It needs to be settable like other “Template Style and Parsing Options”. If it needs done elswhere/differently, let me know and I'll address it, thanks!

Thu May 13 13:48:47 2010 dmuey [...] cpan.org - Status changed from 'new' to 'open'

Thu May 20 14:49:00 2010 dmuey [...] cpan.org - Correspondence added

2 more items I can address if you all decide to incorporate this patch: #1 has to be done #2 could be dropped as a "it's not meant to be 100% the same as perl" type thing. 1) handle characters represented by multiple \x{} sequences: \x{e3}\x{80}\x{b9} == 〹 2) Support '\x without brackets' notation (also multiple \xNN sequences): \xC2\xAE == ® \xE2\x80\x9Chowdy\xE2\x80\x9D == “howdy”

Fri May 21 12:58:43 2010 dmuey [...] cpan.org - Correspondence added

Actually the patch is good as-is I think. I believe documentation would suffice. The reasons I make that statement appear inline below: On Thu May 20 14:49:00 2010, DMUEY wrote: Show quoted text

> 2 more items I can address if you all decide to incorporate this patch: > > #1 has to be done #2 could be dropped as a "it's not meant to be 100% > the same as perl" type thing.

On second thought, the patch currently is very simple and supporting all of that makes the code much more complex (i.e. slower) (plus it makes it less ambiguous and more readable when you look at the template source). I'd say as long as the docs say something like this: Hexidecimal interpolation support is intended to be simple not comprehensive. To keep it as simple as possible and avoid ambiguity it only supports the single-slash-x-bracketed-notation-that-stands-for-a-single-character format. For example, if you want chr(12345) (i.e. 〹) you can do \x{3039} but you can't do \x{e3}\x{80}\x{b9}. Also note it does not support non-bracketed \x notation. For example, you can't do "\xC2\xAE" for "®" (you'd use "\x{00AE}") or "\xE2\x80\x9Chowdy\xE2\x80\x9D" for "“howdy”" (you'd use "\x{201C}howdy\x{201D}")

Wed Oct 19 15:14:59 2011 dmuey [...] cpan.org - Correspondence added

I'll just stick with byte-string-character-itself instead of this, thanks

Wed Oct 19 15:15:00 2011 dmuey [...] cpan.org - Status changed from 'open' to 'rejected'