Bug #49607 for JE: Problems w/ string literal scanning

Thu Sep 10 18:23:35 2009 kevin [...] cameron.com - Ticket created

Subject:

Problems w/ string literal scanning

I'm using JE::Parser version "0.035"; Problem #1: I'm trying to use the WWW::Scripter module with WWW::Scripter::Plugin::JavaScript to automtically parse a website. The plugin uses JE. The website gives me a very long string literal (>50000 chars) to supply data for a graph in the javascript section. This causes JE::Parser::str() to fail with this internal perl bug: Complex regular subexpression recursion limit (32766) exceeded I traced the failure to this regex: /\G (?: '((?>(?:[^'\\] | \\.)*))' | "((?>(?:[^"\\] | \\.)*))" )/xcgs or return; I know nothing about the internal structure of perl's regex handler but I don't see why this regex should cause a deep recursion. It seems to me that a reasonable implementation whould only recurse to a level of maybe 5-10. Nonetheless, after trying many different versions of the regex, I couldn't find an alternative that produced the same result but didn't hit the recursion limit. As a workaround, I've created a version of the rexex that doesn't hit the recusion limit but requires more perl code to parse strings: # Use a simpler pattern (but more code) to break strings up into extents bounded by the quote or escape my $yarn=""; my $qt = substr($_,pos($_),1); $qt =~ /['"]/ or return; # not a string literal if first char not a quote pos($_)++; my $done = 0; while (defined(substr($_,pos($_),1))) { my ($part) = /\G([^\\$qt]*)/xcgs; defined($part) or $part = ""; $yarn .= $part; my $next = substr($_,pos($_)++,1); if ($next eq "\\") { #pass on any escaped char $next = substr($_,pos($_)++,1); $yarn .= "\\$next"; } else { # handle end quote $done = 1; last; } } Problem #2: The latest ECMA language reference (Edition 5) allows line breaks in the source only when escaped with '\'. I've confirmed that at least the Firefox javascript parser follows this rule. JE::Parser::str was allowing linefeeds anywhere between quotes. I fixed str() to disallow unescaped line feeds by default. Problem #3: Per the ECMA spec, the usage of "\" at the end of a line within a string literal is intended as a continuation of the string. For example, the following assertion should be true: "AB" === "A\ B"; JE::Parser::str was retaining the line feed after "\" which would render the above assertion false. I've changed the code to remove the escape and line feed. It now also removes escaped DOS-stype line breaks "\\\r\n". Problem #4: There is one test file "t/10.02-entering-execution-contexts.t" which assumes the illegal inclusion of unescaped line feeds mentioned in Problem #2. Rather than fix the test, I figured that there may be some other environments in the world that need it. Therefore, I added a variable "$JE::Parser::allow_unescaped_lf" which can be set if desired. I'm not sure this is the best solution. Perhaps it should be set in some kind of "set_options" call or maybe it shouldn't be allowed at all. "t/10.02-entering-execution-contexts.t" passes with this flag set. I've created a new test file "t/parse00.1-long_strings.t" to test the above issues. I've included the modified files in hopes they can be integrated into the next release. Thanks for your attention to this. Kevin Cameron San Jose, CA

Subject:

Parser.pm

Message body is not shown because it is too large.

Subject:

10.02-entering-execution-contexts.t

#!perl -T # WARNING: This is a brilliant example of how not to write code. (Though # I must admit that the sextuple backslash was fun.) BEGIN { require './t/test.pl' } use Test::More tests => 30; use Scalar::Util 'refaddr'; use strict; use utf8; # Test 1: See if the module loads BEGIN { use_ok('JE') }; my $j = new JE; # allow non-spec compliant line feeds in string literals below $JE::Parser::allow_unescaped_lf = 1; # Tests 2-3: Bind the ok and diag functions isa_ok( $j->new_function( ok => \&ok ), 'JE::Object::Function' ); isa_ok( $j->new_function( diag => \&diag ), 'JE::Object::Function' ); # Test 4: Bind Perl's eval to JS so we can do naughty stuff. isa_ok( $j->new_function( peval => sub { eval shift } ), 'JE::Object::Function' ); # Run JS tests defined $j->eval( <<'--end--' ) or die; // --------------------------------------------------- /* Tests 5-9: global code */ // These line breaks would cause a syntax error according to spec. // JE supports line breaks in string literals. Don't tell anyone. peval(' my @scope = @$JE::Code::scope; # Don\'t do this at # home--I mean in pro- # duction code! These # internal details are sub- # ject to change. ok @scope == 1, "scope chain in global code contains one object"; ok refaddr $scope[0] == refaddr $j, "that object is the global object"; '); var thing; ok(this.hasOwnProperty('thing'), 'global object is the variable object') ok(!delete thing, 'vars declared in global code are undeletable') peval('ok refaddr shift @_ == refaddr $j, "the \'this\' value is the global object"', this) // --------------------------------------------------- /* Tests 10-17: function code */ var obj = Object(); with(obj) { obj.f = function(){ peval(' my @scope = @$JE::Code::scope; my @fscope = @{$${$j->eval("obj.f")->get}{scope}}; ok @scope == 3, q/@scope == 3/; ok @fscope == 2, q/@fscope == 2/; ok refaddr $scope[0] == refaddr $fscope[0] && refaddr $scope[0] == refaddr $j, "first object in function\'s scope chain" . " is the global obj"; ok refaddr $scope[1] == refaddr $fscope[1] && refaddr $scope[1] == refaddr $j->prop("obj"), "\'with\' object in scope chain"; ok ref $scope[2] eq \'JE::Object::Function::Call\', "\\\$scope has a call object"; '); var thing; ok(peval('exists $JE::Code::scope->[-1]{thing} '), "activation object is the variable object") ok(!delete thing, 'vars declared in function code are undeletable') ok(this === obj, '"this" value in function code') } } obj.f() // --------------------------------------------------- /* Tests 18-22: 'global-eval code' (eval code called from global code) */ eval(" peval(' my @scope = @$JE::Code::scope; ok @scope == 1, \"scope chain in global-eval code contains one object\"; ok refaddr $scope[0] == refaddr $j, \"object in global-eval scope chain is the global object\"; '); var thing2; ok(this.hasOwnProperty('thing2'), 'global object is the variable object in global-eval code') ok(delete thing2, 'vars declared in global-eval code are deletable') peval('ok refaddr shift(@_) == refaddr $j, \"the \\'this\\' value in global-eval code is the global object\"', this) ") // --------------------------------------------------- /* Tests 23-30: function-eval code */ var obj2 = Object(); with(obj2) { obj2.f = function(){ eval(" peval(' my @scope = @$JE::Code::scope; my @fscope = @{$${$j->eval(\"obj2.f\")->get}{scope}}; ok @scope == 3, q/@scope == 3 (function-eval)/; ok @fscope == 2, q/@fscope == 2 (function-eval)/; ok refaddr $scope[0] == refaddr $fscope[0] && refaddr $scope[0] == refaddr $j, \"first object in function\\'s scope \" . \" chain is the global obj (in eval code)\"; ok refaddr $scope[1] == refaddr $fscope[1] && refaddr $scope[1] == refaddr $j->prop(\"obj2\"), \"\\'with\\' object in function-eval \" . \"scope chain\"; ok ref $scope[2] eq \\'JE::Object::Function::Call\\', \"\\\\\\$scope has a call object\"; '); var thing; ok(peval('exists $JE::Code::scope->[-1]{thing} '), \"activation object is the variable object \" + \"in function-eval code\") ok(delete thing, 'vars declared in function-eval code are deletable') ok(this === obj2, '\"this\" value in function-eval code') ") // end of eval } } obj2.f() --end--

Subject:

parse00.1-long_strings.t

#!perl use Test::More tests => 33; use strict; no warnings 'utf8'; # build long strings for tests # has imbedded \n and different kinds of quotes # parts from which to build long strings my $part_plain = "x" x 50; my $part_quotes = ("y" x 8 . "\\\"" . "z" x 8 . "\\\'") x 3; my $part_nl = "a" x 49 . "\n"; my $part_cont_nl = "b" x 49 . "\\\n"; my $part_dosnl = "a" x 49 . "\r\n"; my $part_cont_dosnl = "b" x 49 . "\\\r\n"; my $part_quote_nl = $part_quotes . "\n"; my $part_quote_cont_nl = $part_quotes . "\\\n"; my $part_quote_dosnl = $part_quotes . "\r\n"; my $part_quote_cont_dosnl = $part_quotes . "\\\r\n"; # legal test strings my $long_legal_nl = ($part_plain . $part_cont_nl . $part_quotes . $part_cont_dosnl . $part_plain . $part_quote_cont_nl . $part_quote_cont_dosnl) x 100; my $long_legal_nonl = ($part_plain . $part_quotes) x 500; my $long_legal_noqt_nonl = ($part_plain) x 10000; my $short_legal_cont_nl = ($part_cont_nl . $part_quote_cont_nl . $part_cont_dosnl . $part_quote_cont_dosnl) x 3; my $short_legal_nonl = ($part_plain . $part_quotes); # illegal test strings (line continuations w/o \) my $short_illegal = ($part_nl . $part_quote_nl . $part_dosnl . $part_quote_dosnl) x 3; my $long_illegal = ($part_plain . $part_cont_nl . $part_quotes . $part_nl . $part_plain . $part_quote_nl) x 200; my $single = "\'"; my $double = "\""; my %qname = ($single, 'single', $double, 'double'); my $j; # add arbitrary quotes sub add_quote { my ($qt, $str) = @_; return $qt . $str . $qt; } # compute expected values # only works for cases in this test sub get_exp { my ($qt, $in) = @_; $in =~ s/\\\r?\n//gx; # concatenate legal line continuations $in =~ s/\\(.)/$1/gx; $in; } # test legal string literal # try both single and double quoted versions sub test_legal { my ($str, $msg) = @_; test_legal_1 ($single, $str, $msg); test_legal_1 ($double, $str, $msg); } sub test_legal_1 { my ($qt, $str, $msg) = @_; my $code; my $got = ""; my $qmsg = "$qname{$qt}-quoted $msg"; # test for legal syntax ok ($code = $j->parse ("var a = " . add_quote($qt, $str) . ";\n"), "Parse $qmsg"); # test for correct value $code->execute if $code; $got = $j->prop("a") if $code; my $exp = get_exp ($qt, $str); if (length ($str) < 1000) { is ($got , $exp, "Value of $qmsg"); # more descriptive error message } else { ok ($got eq $exp, "Value of $qmsg"); # shorter error message for gigantic strings } } # test illegal string litteral sub test_illegal { my ($str, $msg) = @_; # test for catch of illegal syntax (parse should fail) ok (!$j->parse ("var a = " . add_quote($single, $str) . ";\n"), "single-quoted $msg"); ok (!$j->parse ("var a = " . add_quote($double, $str) . ";\n"), "double-quoted $msg"); } #--------------------------------------------------------------------# # Test 1: See if the module loads BEGIN { use_ok('JE') }; $j = new JE; #--------------------------------------------------------------------# # Tests 2-21: Parse legal strings test_legal ($long_legal_nl , 'long string with escaped quotes and legal line continuations'); test_legal ($long_legal_nonl , 'long string with escaped quotes and no line continuations' ); test_legal ($long_legal_noqt_nonl, 'long string with no escaped quotes or line continuations' ); test_legal ($short_legal_cont_nl , 'short string with legal line continuations' ); test_legal ($short_legal_nonl , 'short string with escaped quotes and no line continuations' ); #--------------------------------------------------------------------# # Tests 22-25: Parse illegal strings (must fail) test_illegal ($long_illegal , 'long string with illegal line continuations' ); test_illegal ($short_illegal , 'short string with illegal line continuations'); #--------------------------------------------------------------------# # Tests 26-33: Illegal line feeds can be allowed by setting a parser option $JE::Parser::allow_unescaped_lf = 1; test_legal ($long_illegal , 'long string with illegal line continuations (allow unescape line feeds)' ); test_legal ($short_illegal , 'short string with illegal line continuations (allow unescape line feeds)');

Thu Sep 10 21:39:54 2009 kevin [...] cameron.com - Correspondence added

BTW: mojo:~/try/JE/build/JE%perl --version This is perl, v5.10.0 built for x86_64-linux-thread-multi

Thu Sep 10 21:39:56 2009 kevin [...] cameron.com - Status changed from 'new' to 'open'

Sun Sep 13 15:26:19 2009 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

Thank you for the files! (This is the first time anyone else has done any serious hacking on JE.) I have found a more efficient way of making this work. I address each of the issues below. On Thu Sep 10 18:23:35 2009, kcameron wrote: Show quoted text

> I'm using JE::Parser version "0.035"; > > Problem #1: > I'm trying to use the WWW::Scripter module with > WWW::Scripter::Plugin::JavaScript to automtically parse a website. The > plugin uses JE. > > The website gives me a very long string literal (>50000 chars) to supply > data for a graph in the javascript section. This causes > JE::Parser::str() to fail with this internal perl bug: > Complex regular subexpression recursion limit (32766) exceeded > > I traced the failure to this regex: > /\G (?: '((?>(?:[^'\\] | \\.)*))' > | > "((?>(?:[^"\\] | \\.)*))" )/xcgs or return; > > I know nothing about the internal structure of perl's regex handler but > I don't see why this regex should cause a deep recursion.

I haven’t delved into perl’s regexp engine myself, but in the engine that ECMAScript describes every iteration of a repetition (*) calls another function, so a series of ‘returns’ can allow backtracking. perl is probably similar. Show quoted text

> It seems to > me that a reasonable implementation whould only recurse to a level of > maybe 5-10. Nonetheless, after trying many different versions of the > regex, I couldn't find an alternative that produced the same result but > didn't hit the recursion limit.

I’ve found that if I remove the alternation (|) by using Jeffrey Friedl’s ‘unrolling’ technique (see the Owl Book [Mastering Regular Expressions]), perl considers the pattern simpler, and it theoretically runs faster than before and faster than the method you propose. Here it is: /\G (?: '([^'\\]*(?:\\.[^'\\]*)*)' | "([^"\\]*(?:\\.[^"\\]*)*)" )/xcgs Show quoted text

> > As a workaround, I've created a version of the rexex that doesn't hit > the recusion limit but requires more perl code to parse strings: > > [...code...] > > Problem #2: > The latest ECMA language reference (Edition 5)

I didn’t know they were up to 5. I’d better get myself a copy! Show quoted text

> allows line breaks in the > source only when escaped with '\'. I've confirmed that at least the > Firefox javascript parser

aka SpiderMonkey Show quoted text

> follows this rule. JE::Parser::str was > allowing linefeeds anywhere between quotes. I fixed str() to disallow > unescaped line feeds by default.

I did this since it was more work to disallow it and ECMA allows extensions to the syntax. I see no reason to disallow it. Show quoted text

> > Problem #3: > Per the ECMA spec, the usage of "\" at the end of a line within a string > literal is intended as a continuation of the string. For example, the > following assertion should be true: > "AB" === "A\ > B";

Ah, yes. That’s new in ECMAScript 5. Show quoted text

> > JE::Parser::str was retaining the line feed after "\" which would render > the above assertion false. I've changed the code to remove the escape > and line feed. It now also removes escaped DOS-stype line breaks "\\\r\n".

[\x{2028}\x{2029}] and a plain \r are allowed as well. Show quoted text

> > Problem #4: > There is one test file "t/10.02-entering-execution-contexts.t" which > assumes the illegal inclusion of unescaped line feeds mentioned in > Problem #2. Rather than fix the test, I figured that there may be some > other environments in the world that need it. Therefore, I added a > variable "$JE::Parser::allow_unescaped_lf" which can be set if desired. > I'm not sure this is the best solution. Perhaps it should be set in > some kind of "set_options" call or maybe it shouldn't be allowed at all. > "t/10.02-entering-execution-contexts.t" passes with this flag set.

JE already allows arguments (max_ops and html_mode) to its constructor. There are accessors that correspond to this. It this case, however, I don’t think it is necessary. See above. Show quoted text

> > I've created a new test file "t/parse00.1-long_strings.t" to test the > above issues. > > I've included the modified files in hopes they can be integrated into > the next release.

Thank you. I have incorporated your changes along with my suggested tweaks and released a new version: cpan/authors/id/S/SP/SPROUT/JE-0.036.tar.gz

Sun Sep 13 15:26:22 2009 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Status changed from 'open' to 'resolved'

Sun Sep 13 23:55:56 2009 kevin [...] cameron.com - Correspondence added

Subject:	Re: [rt.cpan.org #49607] Problems w/ string literal scanning
Date:	Sun, 13 Sep 2009 20:55:31 -0700
To:	"Father Chrysostomos via RT" <bug-JE [...] rt.cpan.org>
From:	Kevin Cameron <kevin [...] cameron.com>

Thanks for the quick update Father. Of course, I don't agree with your decision to let JE be more permissive than the published spec. IMHO, it's not a good way to promote cross-platform code compatibility. Still, it's your module so I won't say anything more. I'm starting to look at the next two issues: Error: There is no default confirm function at /home/kcameron/myperl/lib/perl5/site_perl/5.10.0/WWW/Scripter/Plugin/JavaScript/JE.pm line 39 TypeError: undefined has no properties, not even one named length at http://www.pvwatch.com/scripts/calendarDateInput.js, [1] line 361. I'll let you know if I figure something out. Regards, Kev On Sun 13/09/09 1:26 PM , "Father Chrysostomos via RT" bug-JE@rt.cpan.org sent: https://rt.cpan.org/Ticket/Display.html?id=49607 [2] > Thank you for the files! (This is the first time anyone else has done any serious hacking on JE.) I have found a more efficient way of making this work. I address each of the issues below. On Thu Sep 10 18:23:35 2009, kcameron wrote: Show quoted text

> I'm using JE::Parser version "0.035"; > > Problem #1: > I'm trying to use the WWW::Scripter module with > WWW::Scripter::Plugin::JavaScript to automtically parse a website. The > plugin uses JE. > > The website gives me a very long string literal (>50000 chars) to

supply Show quoted text

> data for a graph in the javascript section. This causes > JE::Parser::str() to fail with this internal perl bug: > Complex regular subexpression recursion limit (32766) exceeded > > I traced the failure to this regex: > /G (?: '((?>(?:[^'\] | \.)*))' > | > "((?>(?:[^"\] | \.)*))" )/xcgs or return; > > I know nothing about the internal structure of perl's regex handler but > I don't see why this regex should cause a deep recursion.

I haven’t delved into perl’s regexp engine myself, but in the engine that ECMAScript describes every iteration of a repetition (*) calls another function, so a series of ‘returns’ can allow backtracking. perl is probably similar. Show quoted text

> It seems to > me that a reasonable implementation whould only recurse to a level of > maybe 5-10. Nonetheless, after trying many different versions of the > regex, I couldn't find an alternative that produced the same result but > didn't hit the recursion limit.

I’ve found that if I remove the alternation (|) by using Jeffrey Friedl’s ‘unrolling’ technique (see the Owl Book [Mastering Regular Expressions]), perl considers the pattern simpler, and it theoretically runs faster than before and faster than the method you propose. Here it is: /G (?: '([^'\]*(?:\.[^'\]*)*)' | "([^"\]*(?:\.[^"\]*)*)" )/xcgs Show quoted text

> > As a workaround, I've created a version of the rexex that doesn't hit > the recusion limit but requires more perl code to parse strings: > > [...code...] > > Problem #2: > The latest ECMA language reference (Edition 5)

I didn’t know they were up to 5. I’d better get myself a copy! Show quoted text

> allows line breaks in the > source only when escaped with ''. I've confirmed that at least the > Firefox javascript parser

aka SpiderMonkey Show quoted text

> follows this rule. JE::Parser::str was > allowing linefeeds anywhere between quotes. I fixed str() to disallow > unescaped line feeds by default.

I did this since it was more work to disallow it and ECMA allows extensions to the syntax. I see no reason to disallow it. Show quoted text

> > Problem #3: > Per the ECMA spec, the usage of "" at the end of a line within a string > literal is intended as a continuation of the string. For example, the > following assertion should be true: > "AB" === "A > B";

Ah, yes. That’s new in ECMAScript 5. Show quoted text

> > JE::Parser::str was retaining the line feed after "" which would render > the above assertion false. I've changed the code to remove the escape > and line feed. It now also removes escaped DOS-stype line breaks "\rn".

[x{2028}x{2029}] and a plain r are allowed as well. Show quoted text

> > Problem #4: > There is one test file "t/10.02-entering-execution-contexts.t" which > assumes the illegal inclusion of unescaped line feeds mentioned in > Problem #2. Rather than fix the test, I figured that there may be some > other environments in the world that need it. Therefore, I added a > variable "$JE::Parser::allow_unescaped_lf" which can be set if desired. > I'm not sure this is the best solution. Perhaps it should be set in > some kind of "set_options" call or maybe it shouldn't be allowed at

all. Show quoted text

> "t/10.02-entering-execution-contexts.t" passes with this flag set.

JE already allows arguments (max_ops and html_mode) to its constructor. There are accessors that correspond to this. It this case, however, I don’t think it is necessary. See above. Show quoted text

> > I've created a new test file "t/parse00.1-long_strings.t" to test the > above issues. > > I've included the modified files in hopes they can be integrated into > the next release.

Thank you. I have incorporated your changes along with my suggested tweaks and released a new version: cpan/authors/id/S/SP/SPROUT/JE-0.036.tar.gz Links: ------ [1] http://www.pvwatch.com/scripts/calendarDateInput.js, [2] http://webmail.sonic.net/parse.php?redirect=https://rt.cpan.org/Ticket/Display.html%3Fid%3D49607

Sun Sep 13 23:55:58 2009 The RT System itself - Status changed from 'resolved' to 'open'

Mon Oct 12 11:43:27 2009 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Sun Sep 13 23:55:56 2009, kcameron wrote: Show quoted text

> I'll let you know if I figure something out.

If you do find anything, please open a new ticket. I’m marking this one as resolved (again).

Mon Oct 12 11:43:28 2009 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Status changed from 'open' to 'resolved'