Subject: | Patch to _entry_to_hash to wiping HTML from content at digest calculation |
Hi Robin!
I try out AE-Feed and find its helpful and nice, but one thing must be improved.
I talk about _entry_to_hash algo, it good, but "naive" for real world.
For example, exist russian geek magazine - overclockers_ru, and RSS -
http://www.overclockers.ru/rss/all.rss . So, this bad guy use HTML, embedded into item
content, where contained counter or something like it, which generate different link. At the
end I have some double for one item, because one item have different digest.
I put one item dump, you may see it by yourself. This results gotten by triple _continuously_
re-loading feed.
At the patch used HTML::Strip - fast&furious html wiper, writing with XS. It not so smart as
HTML::Parser and have some bugs, but I suppose it acceptable for digest.
Thanks at advance,
Dmitry.
Subject: | proof |
Message body not shown because it is not plain text.
Subject: | html_ignore.patch |
--- Feed.pm 2011-07-31 23:38:14.000000000 +0400
+++ Feed.pm 2011-07-31 23:55:45.000000000 +0400
@@ -9,6 +9,7 @@
use AnyEvent::HTTP;
use Digest::SHA1 qw/sha1_base64/;
use Scalar::Util qw/weaken/;
+use HTML::Strip;
our $VERSION = '0.3';
@@ -124,6 +125,12 @@
which means that an entry hash is removed from the C<entry_ages> hash after it
has not been seen in the feed for 2 fetches.
+=item no_html_in_hash => $boolean
+
+This setting will enable "smart" mode of C<entry_ages> calculation. All html will
+be ignored. May be useful if item contained dynamic html and it proceed doubles
+at output.
+
=back
=cut
@@ -135,6 +142,11 @@
bless $self, $class;
$self->{entry_ages} ||= {};
+
+ # add strip object if it needed
+ if (defined $self->{no_html_in_hash}){
+ $self->{__hs} = HTML::Strip->new();
+ }
if (defined $self->{interval}) {
unless (defined $self->{on_fetch}) {
@@ -163,17 +175,30 @@
}
-sub _entry_to_hash {
- my ($entry) = @_;
- my $x = sha1_base64
+sub _entry_to_hash{
+ my ( $entry, $hs ) = ( @_ );
+ # we are wipe all html if it need it, or return 'body'
+ my @data = map {
+ $_ && $_->body ?
+ ( $hs ?
+ do {
+ my $a = $hs->parse( $_->body );
+ $hs->eof;
+ $a;}
+ :
+ $_->body )
+ : ''
+ } ($entry->summary, $entry->content);
+
+ my $x = sha1_base64
encode 'utf-8',
- (my $a = join '/',
+ ( join '/',
$entry->title,
- ($entry->summary ? $entry->summary->body : ''),
- ($entry->content ? $entry->content->body : ''),
+ @data,
$entry->id,
$entry->link);
- $x
+ $x;
+
}
sub _new_entries {
@@ -189,7 +214,7 @@
$self->{entry_ages}->{$_}++ for keys %{$self->{entry_ages}};
for my $ent (@ents) {
- my $hash = _entry_to_hash ($ent);
+ my $hash = _entry_to_hash ( $ent, $self->{__hs} );
unless (exists $self->{entry_ages}->{$hash}) {
push @new, [$hash, $ent];
@@ -252,7 +277,7 @@
sub _get_headers {
my ($self, %hdrs) = @_;
- my %hdrs = %{$self->{headers} || {}};
+ %hdrs = %{$self->{headers} || {}};
if (defined $self->{last_mod}) {
$hdrs{'If-Modified-Since'} = $self->{last_mod};