Subject: | code to handle [% ... %] in Template::Generate |
Date: | Tue, 7 May 2019 12:08:11 -0400 |
To: | autrijus [...] autrijus.org, bug-Template-Generate [...] rt.cpan.org |
From: | Jack Langsdorf <jacklangsdorf [...] gmail.com> |
Hi!
I wrote some code that gives simple but notrivial generation of [% ... %]
in Template::Generate.
My concept is that every fixed string of length > 1 in the template is
potentially replaced with the combination of a prefix, a [% ... %], and a
suffix. The prefix and suffix match for all cases.
The diff is attached.
Handling [% ... %] makes Template::Generate much more powerful when it is
being used to build a web scraping template - you no longer need to work
out all of the pieces of data that were used to generate the original
page. Given
a web page with a list of items with Template style formatting, if you
identify the data you want to grab from two of them, the script can find
the common template (ignoring other junk in each listing) and then you can
push that template back into Template::Extract to extract the data from the
entire list. See the attached example file. (You do have to contribute the
strings that are associated with FOREACH and END manually.)
ALSO, I noticed that Generate seems to sometimes miss cases if one of the
data items appears multiple times in the text, but the desired template
needs to ignore one case of the data item. In the attached
generate_and_extract.pl, if you search for google rather than slashdot, it
fails to find the template needed. All of the suggested templates have [%
rate %] before [% url %], because it picks up the unrelated A+ given to
slashdot.) The case that we need is like the 0th case, but deleting
everything before the first ".
0 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ...
%] [% comment %].[% ... %] [% rate %]'
1 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ...
%] [% rate %] [% ... %] [% comment %]'
2 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ...
%] [% comment %]'
3 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ...
%] [% rate %] [% ... %] [% comment %].[% ... %] [% rate %]'
I will see if I can find a fix for that bug.
Also, I notice that it is pretty slow when I run on large documents, like
1000 lines of html code. I will poke around and see if maybe there is a
faster way to implement it, perhaps using the index function rather than
regex. So I may send you another note at some point.
- Jack Langsdorf
Message body is not shown because sender requested not to inline it.
Message body is not shown because sender requested not to inline it.