Skip Menu |

This queue is for tickets about the Bencher-Scenario-LevenshteinModules CPAN distribution.

Report information
The Basics
Id: 111029
Status: resolved
Priority: 0/
Queue: Bencher-Scenario-LevenshteinModules

People
Owner: Nobody in particular
Requestors: SREZIC [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: 0.03
Fixed in: (no value)



Subject: Wishlist: add dataset with unicode strings
It would be nice if there was a dataset with characters >= \x{0100}. Especially it would also be interesting if unicode is supported at all by the participants.
On Fri, 8 Jan 2016 22:15:03 GMT, SREZIC wrote: Show quoted text
> It would be nice if there was a dataset with characters >= \x{0100}. > Especially it would also be interesting if unicode is supported at all > by the participants.
Yes it would. BTW, could you think of a pair of Unicode texts that might give different distance answer when fed to a Unicode-supporting vs non-Unicode-supporting module?
On 2016-01-08 22:33:31, PERLANCAR wrote: Show quoted text
> On Fri, 8 Jan 2016 22:15:03 GMT, SREZIC wrote:
> > It would be nice if there was a dataset with characters >= \x{0100}. > > Especially it would also be interesting if unicode is supported at > > all > > by the participants.
> > Yes it would. BTW, could you think of a pair of Unicode texts that > might give different distance answer when fed to a Unicode-supporting > vs non-Unicode-supporting module?
It seems that Text::LevenshteinXS does not support Unicode correctly. The correct answer would be 1 here: $ perl5.22.1 -MText::LevenshteinXS=distance -e 'warn distance("Euro", "\x{20ac}uro")' 3 at -e line 1. $ perl5.22.1 -MText::Levenshtein::XS=distance -e 'warn distance("Euro", "\x{20ac}uro")' 1 at -e line 1.
On Sat, 9 Jan 2016 07:46:30 GMT, SREZIC wrote: Show quoted text
> On 2016-01-08 22:33:31, PERLANCAR wrote:
> > On Fri, 8 Jan 2016 22:15:03 GMT, SREZIC wrote:
> > > It would be nice if there was a dataset with characters >= > > > \x{0100}. > > > Especially it would also be interesting if unicode is supported at > > > all > > > by the participants.
> > > > Yes it would. BTW, could you think of a pair of Unicode texts that > > might give different distance answer when fed to a Unicode-supporting > > vs non-Unicode-supporting module?
> > It seems that Text::LevenshteinXS does not support Unicode correctly. > The correct answer would be 1 here: > > $ perl5.22.1 -MText::LevenshteinXS=distance -e 'warn distance("Euro", > "\x{20ac}uro")' > 3 at -e line 1. > $ perl5.22.1 -MText::Levenshtein::XS=distance -e 'warn > distance("Euro", "\x{20ac}uro")' > 1 at -e line 1.
Thanks, added.
On 2016-01-09 22:08:38, PERLANCAR wrote: Show quoted text
> On Sat, 9 Jan 2016 07:46:30 GMT, SREZIC wrote:
> > On 2016-01-08 22:33:31, PERLANCAR wrote:
> > > On Fri, 8 Jan 2016 22:15:03 GMT, SREZIC wrote:
> > > > It would be nice if there was a dataset with characters >= > > > > \x{0100}. > > > > Especially it would also be interesting if unicode is supported at > > > > all > > > > by the participants.
> > > > > > Yes it would. BTW, could you think of a pair of Unicode texts that > > > might give different distance answer when fed to a Unicode-supporting > > > vs non-Unicode-supporting module?
> > > > It seems that Text::LevenshteinXS does not support Unicode correctly. > > The correct answer would be 1 here: > > > > $ perl5.22.1 -MText::LevenshteinXS=distance -e 'warn distance("Euro", > > "\x{20ac}uro")' > > 3 at -e line 1. > > $ perl5.22.1 -MText::Levenshtein::XS=distance -e 'warn > > distance("Euro", "\x{20ac}uro")' > > 1 at -e line 1.
> > Thanks, added.
Well, I have to re-open this ticket. I would have expected that the euro+Text::LevenshteinXS line wouldn't appear in the table, because the result is wrong with this module and thus the benchmark results misleading. I see that the expected result is included in the scenario description --- how about checking if the got result matches the expected result and mark the row specially? E.g. it could look like this (the wrong result moved to the top, and no benchmark numbers shown): +-----+-------------------------------------------------------------------------------+----------+----------+---------+---------+ | seq | name | rate | time | errors | samples | +-----+-------------------------------------------------------------------------------+----------+----------+---------+---------+ | 3 | {dataset=>"euro",participant=>"Text::LevenshteinXS::distance"} | -- wrong result -- | | 4 | {dataset=>"euro",participant=>"Text::Levenshtein::Damerau::PP::pp_edistance"} | 1.71e+04 | 58.4μs | 1.1e-07 | 20 | | 1 | {dataset=>"euro",participant=>"Text::Levenshtein::fastdistance"} | 18669 | 53.565μs | 9.6e-10 | 20 | | 0 | {dataset=>"euro",participant=>"PERLANCAR::Text::Levenshtein::editdist"} | 3.13e+04 | 31.9μs | 1.3e-08 | 20 | | 5 | {dataset=>"euro",participant=>"Text::Levenshtein::Damerau::XS::xs_edistance"} | 3.6e+05 | 2.8μs | 1e-08 | 20 | | 2 | {dataset=>"euro",participant=>"Text::Levenshtein::XS::distance"} | 3.8e+05 | 2.63μs | 4.2e-09 | 20 | +-----+-------------------------------------------------------------------------------+----------+----------+---------+---------+ Or just remove Text::LevenshteinXS completely from the participants. BTW, it seems that the expected results are wrong --- the expected result for the euro dataset is listed as 2, but it should be 1.
Subject: Re: [rt.cpan.org #111029] Wishlist: add dataset with unicode strings
Date: Sun, 10 Jan 2016 22:33:54 +0700
To: bug-Bencher-Scenario-LevenshteinModules [...] rt.cpan.org
From: Perl Ancar <perlancar [...] gmail.com>
Fixed the expected result and excluding Text::LevenshteinXS from the Unicode test, sorry about that. I do plan to add an option to let bencher die (or skip, or warn, or ignore) when an item's result is not as expected. On Sun, Jan 10, 2016 at 5:23 PM, Slaven_Rezic via RT < bug-Bencher-Scenario-LevenshteinModules@rt.cpan.org> wrote: Show quoted text
> Queue: Bencher-Scenario-LevenshteinModules > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=111029 > > > On 2016-01-09 22:08:38, PERLANCAR wrote:
> > On Sat, 9 Jan 2016 07:46:30 GMT, SREZIC wrote:
> > > On 2016-01-08 22:33:31, PERLANCAR wrote:
> > > > On Fri, 8 Jan 2016 22:15:03 GMT, SREZIC wrote:
> > > > > It would be nice if there was a dataset with characters >= > > > > > \x{0100}. > > > > > Especially it would also be interesting if unicode is supported at > > > > > all > > > > > by the participants.
> > > > > > > > Yes it would. BTW, could you think of a pair of Unicode texts that > > > > might give different distance answer when fed to a Unicode-supporting > > > > vs non-Unicode-supporting module?
> > > > > > It seems that Text::LevenshteinXS does not support Unicode correctly. > > > The correct answer would be 1 here: > > > > > > $ perl5.22.1 -MText::LevenshteinXS=distance -e 'warn distance("Euro", > > > "\x{20ac}uro")' > > > 3 at -e line 1. > > > $ perl5.22.1 -MText::Levenshtein::XS=distance -e 'warn > > > distance("Euro", "\x{20ac}uro")' > > > 1 at -e line 1.
> > > > Thanks, added.
> > Well, I have to re-open this ticket. I would have expected that the > euro+Text::LevenshteinXS line wouldn't appear in the table, because the > result is wrong with this module and thus the benchmark results misleading. > I see that the expected result is included in the scenario description --- > how about checking if the got result matches the expected result and mark > the row specially? E.g. it could look like this (the wrong result moved to > the top, and no benchmark numbers shown): > > > +-----+-------------------------------------------------------------------------------+----------+----------+---------+---------+ > | seq | name > | rate | time | errors | samples | > > +-----+-------------------------------------------------------------------------------+----------+----------+---------+---------+ > | 3 | {dataset=>"euro",participant=>"Text::LevenshteinXS::distance"} > | -- wrong result -- | > | 4 | > {dataset=>"euro",participant=>"Text::Levenshtein::Damerau::PP::pp_edistance"} > | 1.71e+04 | 58.4μs | 1.1e-07 | 20 | > | 1 | {dataset=>"euro",participant=>"Text::Levenshtein::fastdistance"} > | 18669 | 53.565μs | 9.6e-10 | 20 | > | 0 | > {dataset=>"euro",participant=>"PERLANCAR::Text::Levenshtein::editdist"} > | 3.13e+04 | 31.9μs | 1.3e-08 | 20 | > | 5 | > {dataset=>"euro",participant=>"Text::Levenshtein::Damerau::XS::xs_edistance"} > | 3.6e+05 | 2.8μs | 1e-08 | 20 | > | 2 | {dataset=>"euro",participant=>"Text::Levenshtein::XS::distance"} > | 3.8e+05 | 2.63μs | 4.2e-09 | 20 | > > +-----+-------------------------------------------------------------------------------+----------+----------+---------+---------+ > > Or just remove Text::LevenshteinXS completely from the participants. > > BTW, it seems that the expected results are wrong --- the expected result > for the euro dataset is listed as 2, but it should be 1. > >
RT-Send-CC: perlancar [...] gmail.com
A final thing: rêves <-> reve is one substitution and one deletion, hence edit distance 2, not 3 (fixing this would remove all of the notes for dataset=reve). On 2016-01-10 10:34:13, perlancar@gmail.com wrote: Show quoted text
> Fixed the expected result and excluding Text::LevenshteinXS from the > Unicode test, sorry about that. > > I do plan to add an option to let bencher die (or skip, or warn, or > ignore) > when an item's result is not as expected. > > On Sun, Jan 10, 2016 at 5:23 PM, Slaven_Rezic via RT < > bug-Bencher-Scenario-LevenshteinModules@rt.cpan.org> wrote: >
> > Queue: Bencher-Scenario-LevenshteinModules > > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=111029 > > > > > On 2016-01-09 22:08:38, PERLANCAR wrote:
> > > On Sat, 9 Jan 2016 07:46:30 GMT, SREZIC wrote:
> > > > On 2016-01-08 22:33:31, PERLANCAR wrote:
> > > > > On Fri, 8 Jan 2016 22:15:03 GMT, SREZIC wrote:
> > > > > > It would be nice if there was a dataset with characters >= > > > > > > \x{0100}. > > > > > > Especially it would also be interesting if unicode is > > > > > > supported at > > > > > > all > > > > > > by the participants.
> > > > > > > > > > Yes it would. BTW, could you think of a pair of Unicode texts > > > > > that > > > > > might give different distance answer when fed to a Unicode- > > > > > supporting > > > > > vs non-Unicode-supporting module?
> > > > > > > > It seems that Text::LevenshteinXS does not support Unicode > > > > correctly. > > > > The correct answer would be 1 here: > > > > > > > > $ perl5.22.1 -MText::LevenshteinXS=distance -e 'warn > > > > distance("Euro", > > > > "\x{20ac}uro")' > > > > 3 at -e line 1. > > > > $ perl5.22.1 -MText::Levenshtein::XS=distance -e 'warn > > > > distance("Euro", "\x{20ac}uro")' > > > > 1 at -e line 1.
> > > > > > Thanks, added.
> > > > Well, I have to re-open this ticket. I would have expected that the > > euro+Text::LevenshteinXS line wouldn't appear in the table, because > > the > > result is wrong with this module and thus the benchmark results > > misleading. > > I see that the expected result is included in the scenario > > description --- > > how about checking if the got result matches the expected result and > > mark > > the row specially? E.g. it could look like this (the wrong result > > moved to > > the top, and no benchmark numbers shown): > > > > > > +----- > > +------------------------------------------------------------------------------- > > +----------+----------+---------+---------+ > > | seq | name > > | rate | time | errors | samples | > > > > +----- > > +------------------------------------------------------------------------------- > > +----------+----------+---------+---------+ > > | 3 | > > {dataset=>"euro",participant=>"Text::LevenshteinXS::distance"} > > | -- wrong result -- | > > | 4 | > > {dataset=>"euro",participant=>"Text::Levenshtein::Damerau::PP::pp_edistance"} > > | 1.71e+04 | 58.4μs | 1.1e-07 | 20 | > > | 1 | > > {dataset=>"euro",participant=>"Text::Levenshtein::fastdistance"} > > | 18669 | 53.565μs | 9.6e-10 | 20 | > > | 0 | > > {dataset=>"euro",participant=>"PERLANCAR::Text::Levenshtein::editdist"} > > | 3.13e+04 | 31.9μs | 1.3e-08 | 20 | > > | 5 | > > {dataset=>"euro",participant=>"Text::Levenshtein::Damerau::XS::xs_edistance"} > > | 3.6e+05 | 2.8μs | 1e-08 | 20 | > > | 2 | > > {dataset=>"euro",participant=>"Text::Levenshtein::XS::distance"} > > | 3.8e+05 | 2.63μs | 4.2e-09 | 20 | > > > > +----- > > +------------------------------------------------------------------------------- > > +----------+----------+---------+---------+ > > > > Or just remove Text::LevenshteinXS completely from the participants. > > > > BTW, it seems that the expected results are wrong --- the expected > > result > > for the euro dataset is listed as 2, but it should be 1. > > > >
On Tue, 12 Jan 2016 21:56:58 GMT, SREZIC wrote: Show quoted text
> A final thing: rêves <-> reve is one substitution and one deletion, > hence edit distance 2, not 3 (fixing this would remove all of the > notes for dataset=reve).
Fixed, closing ticket now.