Skip Menu |

This queue is for tickets about the Text-TFIDF CPAN distribution.

Report information
The Basics
Id: 124091
Status: resolved
Priority: 0/
Queue: Text-TFIDF

People
Owner: LMETCALF [...] cpan.org
Requestors: gene [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 0.03
Fixed in: (no value)



Subject: IDF computation improvement
According to https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency_2 the IDF can be computed without the addition of 1 to the count. So, https://metacpan.org/source/LMETCALF/Text-TFIDF-0.03/lib/Text/TFIDF.pm#L70 could be changed to this: return - log( $count / scalar( keys %{ $t->{file} } ) ) / log(10); This makes the wikipedia examples work out nicely. :-) -Gene
Whoops. That $t should of course be $self. On Mon Jan 15 11:57:26 2018, GENE wrote: Show quoted text
> return - log( $count / scalar( keys %{ $t->{file} } ) ) / log(10);
On Mon Jan 15 14:57:26 2018, GENE wrote: Show quoted text
> According to > https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency_2 > the IDF can be computed without the addition of 1 to the count. So, > https://metacpan.org/source/LMETCALF/Text-TFIDF- > 0.03/lib/Text/TFIDF.pm#L70 could be changed to this: > > return - log( $count / scalar( keys %{ $t->{file} } ) ) / log(10); > > This makes the wikipedia examples work out nicely. :-) > > -Gene
The new version of TF-IDF should be correct. The Changes file cites the URLs with the computational source