Skip Menu |

This queue is for tickets about the Template-Generate CPAN distribution.

Report information
The Basics
Id: 129481
Status: new
Priority: 0/
Queue: Template-Generate

People
Owner: Nobody in particular
Requestors: jacklangsdorf [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: code to handle [% ... %] in Template::Generate
Date: Tue, 7 May 2019 12:08:11 -0400
To: autrijus [...] autrijus.org, bug-Template-Generate [...] rt.cpan.org
From: Jack Langsdorf <jacklangsdorf [...] gmail.com>
Hi! I wrote some code that gives simple but notrivial generation of [% ... %] in Template::Generate. My concept is that every fixed string of length > 1 in the template is potentially replaced with the combination of a prefix, a [% ... %], and a suffix. The prefix and suffix match for all cases. The diff is attached. Handling [% ... %] makes Template::Generate much more powerful when it is being used to build a web scraping template - you no longer need to work out all of the pieces of data that were used to generate the original page. Given a web page with a list of items with Template style formatting, if you identify the data you want to grab from two of them, the script can find the common template (ignoring other junk in each listing) and then you can push that template back into Template::Extract to extract the data from the entire list. See the attached example file. (You do have to contribute the strings that are associated with FOREACH and END manually.) ALSO, I noticed that Generate seems to sometimes miss cases if one of the data items appears multiple times in the text, but the desired template needs to ignore one case of the data item. In the attached generate_and_extract.pl, if you search for google rather than slashdot, it fails to find the template needed. All of the suggested templates have [% rate %] before [% url %], because it picks up the unrelated A+ given to slashdot.) The case that we need is like the 0th case, but deleting everything before the first ". 0 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... %] [% comment %].[% ... %] [% rate %]' 1 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... %] [% rate %] [% ... %] [% comment %]' 2 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... %] [% comment %]' 3 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... %] [% rate %] [% ... %] [% comment %].[% ... %] [% rate %]' I will see if I can find a fix for that bug. Also, I notice that it is pretty slow when I run on large documents, like 1000 lines of html code. I will poke around and see if maybe there is a faster way to implement it, perhaps using the index function rather than regex. So I may send you another note at some point. - Jack Langsdorf

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

Subject: Re: [rt.cpan.org #129481] AutoReply: code to handle [% ... %] in Template::Generate
Date: Tue, 14 May 2019 15:10:21 -0400
To: bug-Template-Generate [...] rt.cpan.org
From: Jack Langsdorf <jacklangsdorf [...] gmail.com>
Hello - I rewrote the entire module, which I submit for your review. This version uses index/substr instead of regex. It handles [% ... %] correctly, and also handles cases where one of the data items appears more than once correctly. Compared to my previous submission on this bug which added the [% ... %] handling, this version is 400x faster on my large testcase (from 2.5 minutes to 0.33 seconds). In the testcase, I supply data for two items in a list found on a webpage (xkcd_blag.htm) and then use Template::Generate to generate the template for those items, then use Template::Extract to recover the full list. (During testing I renamed the module to Generate2.pm and Generate3.pm for comparison). - Jack Langsdorf On Tue, May 7, 2019 at 12:08 PM Bugs in Template-Generate via RT < bug-Template-Generate@rt.cpan.org> wrote: Show quoted text
> > Greetings, > > This message has been automatically generated in response to the > creation of a trouble ticket regarding: > "code to handle [% ... %] in Template::Generate", > a summary of which appears below. > > There is no need to reply to this message right now. Your ticket has been > assigned an ID of [rt.cpan.org #129481]. Your ticket is accessible > on the web at: > > https://rt.cpan.org/Ticket/Display.html?id=129481 > > Please include the string: > > [rt.cpan.org #129481] > > in the subject line of all future correspondence about this issue. To do > so, > you may reply to this message. > > Thank you, > bug-Template-Generate@rt.cpan.org > > ------------------------------------------------------------------------- > Hi! > > I wrote some code that gives simple but notrivial generation of [% ... %] > in Template::Generate. > > My concept is that every fixed string of length > 1 in the template is > potentially replaced with the combination of a prefix, a [% ... %], and a > suffix. The prefix and suffix match for all cases. > > The diff is attached. > > Handling [% ... %] makes Template::Generate much more powerful when it is > being used to build a web scraping template - you no longer need to work > out all of the pieces of data that were used to generate the original > page. Given > a web page with a list of items with Template style formatting, if you > identify the data you want to grab from two of them, the script can find > the common template (ignoring other junk in each listing) and then you can > push that template back into Template::Extract to extract the data from the > entire list. See the attached example file. (You do have to contribute the > strings that are associated with FOREACH and END manually.) > > ALSO, I noticed that Generate seems to sometimes miss cases if one of the > data items appears multiple times in the text, but the desired template > needs to ignore one case of the data item. In the attached > generate_and_extract.pl, if you search for google rather than slashdot, it > fails to find the template needed. All of the suggested templates have [% > rate %] before [% url %], because it picks up the unrelated A+ given to > slashdot.) The case that we need is like the 0th case, but deleting > everything before the first ". > > 0 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... > %] [% comment %].[% ... %] [% rate %]' > > 1 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... > %] [% rate %] [% ... %] [% comment %]' > > 2 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... > %] [% comment %]' > > 3 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... > %] [% rate %] [% ... %] [% comment %].[% ... %] [% rate %]' > > > I will see if I can find a fix for that bug. > > > Also, I notice that it is pretty slow when I run on large documents, like > 1000 lines of html code. I will poke around and see if maybe there is a > faster way to implement it, perhaps using the index function rather than > regex. So I may send you another note at some point. > > > - Jack Langsdorf >

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.