Bug #129481 for Template-Generate: code to handle [% ... %] in Template::Generate

Subject:	Re: [rt.cpan.org #129481] AutoReply: code to handle [% ... %] in Template::Generate
Date:	Tue, 14 May 2019 15:10:21 -0400
To:	bug-Template-Generate [...] rt.cpan.org
From:	Jack Langsdorf <jacklangsdorf [...] gmail.com>

Hello - I rewrote the entire module, which I submit for your review. This version uses index/substr instead of regex. It handles [% ... %] correctly, and also handles cases where one of the data items appears more than once correctly. Compared to my previous submission on this bug which added the [% ... %] handling, this version is 400x faster on my large testcase (from 2.5 minutes to 0.33 seconds). In the testcase, I supply data for two items in a list found on a webpage (xkcd_blag.htm) and then use Template::Generate to generate the template for those items, then use Template::Extract to recover the full list. (During testing I renamed the module to Generate2.pm and Generate3.pm for comparison). - Jack Langsdorf On Tue, May 7, 2019 at 12:08 PM Bugs in Template-Generate via RT < bug-Template-Generate@rt.cpan.org> wrote: Show quoted text

> > Greetings, > > This message has been automatically generated in response to the > creation of a trouble ticket regarding: > "code to handle [% ... %] in Template::Generate", > a summary of which appears below. > > There is no need to reply to this message right now. Your ticket has been > assigned an ID of [rt.cpan.org #129481]. Your ticket is accessible > on the web at: > > https://rt.cpan.org/Ticket/Display.html?id=129481 > > Please include the string: > > [rt.cpan.org #129481] > > in the subject line of all future correspondence about this issue. To do > so, > you may reply to this message. > > Thank you, > bug-Template-Generate@rt.cpan.org > > ------------------------------------------------------------------------- > Hi! > > I wrote some code that gives simple but notrivial generation of [% ... %] > in Template::Generate. > > My concept is that every fixed string of length > 1 in the template is > potentially replaced with the combination of a prefix, a [% ... %], and a > suffix. The prefix and suffix match for all cases. > > The diff is attached. > > Handling [% ... %] makes Template::Generate much more powerful when it is > being used to build a web scraping template - you no longer need to work > out all of the pieces of data that were used to generate the original > page. Given > a web page with a list of items with Template style formatting, if you > identify the data you want to grab from two of them, the script can find > the common template (ignoring other junk in each listing) and then you can > push that template back into Template::Extract to extract the data from the > entire list. See the attached example file. (You do have to contribute the > strings that are associated with FOREACH and END manually.) > > ALSO, I noticed that Generate seems to sometimes miss cases if one of the > data items appears multiple times in the text, but the desired template > needs to ignore one case of the data item. In the attached > generate_and_extract.pl, if you search for google rather than slashdot, it > fails to find the template needed. All of the suggested templates have [% > rate %] before [% url %], because it picks up the unrelated A+ given to > slashdot.) The case that we need is like the 0th case, but deleting > everything before the first ". > > 0 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... > %] [% comment %].[% ... %] [% rate %]' > > 1 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... > %] [% rate %] [% ... %] [% comment %]' > > 2 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... > %] [% comment %]' > > 3 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... > %] [% rate %] [% ... %] [% comment %].[% ... %] [% rate %]' > > > I will see if I can find a fix for that bug. > > > Also, I notice that it is pretty slow when I run on large documents, like > 1000 lines of html code. I will poke around and see if maybe there is a > faster way to implement it, perhaps using the index function rather than > regex. So I may send you another note at some point. > > > - Jack Langsdorf >

Message body is not shown because sender requested not to inline it.

Subject:	code to handle [% ... %] in Template::Generate
Date:	Tue, 7 May 2019 12:08:11 -0400
To:	autrijus [...] autrijus.org, bug-Template-Generate [...] rt.cpan.org
From:	Jack Langsdorf <jacklangsdorf [...] gmail.com>