Bug #111029 for Bencher-Scenario-LevenshteinModules: Wishlist: add dataset with unicode strings

Fri Jan 08 17:15:03 2016 SREZIC [...] cpan.org - Ticket created

Subject:

Wishlist: add dataset with unicode strings

It would be nice if there was a dataset with characters >= \x{0100}. Especially it would also be interesting if unicode is supported at all by the participants.

Fri Jan 08 22:33:31 2016 PERLANCAR [...] cpan.org - Correspondence added

On Fri, 8 Jan 2016 22:15:03 GMT, SREZIC wrote: Show quoted text

> It would be nice if there was a dataset with characters >= \x{0100}. > Especially it would also be interesting if unicode is supported at all > by the participants.

Yes it would. BTW, could you think of a pair of Unicode texts that might give different distance answer when fed to a Unicode-supporting vs non-Unicode-supporting module?

Fri Jan 08 22:33:31 2016 The RT System itself - Status changed from 'new' to 'open'

Sat Jan 09 02:46:30 2016 SREZIC [...] cpan.org - Correspondence added

On 2016-01-08 22:33:31, PERLANCAR wrote: Show quoted text

> On Fri, 8 Jan 2016 22:15:03 GMT, SREZIC wrote:

> > It would be nice if there was a dataset with characters >= \x{0100}. > > Especially it would also be interesting if unicode is supported at > > all > > by the participants.

> > Yes it would. BTW, could you think of a pair of Unicode texts that > might give different distance answer when fed to a Unicode-supporting > vs non-Unicode-supporting module?

It seems that Text::LevenshteinXS does not support Unicode correctly. The correct answer would be 1 here: $ perl5.22.1 -MText::LevenshteinXS=distance -e 'warn distance("Euro", "\x{20ac}uro")' 3 at -e line 1. $ perl5.22.1 -MText::Levenshtein::XS=distance -e 'warn distance("Euro", "\x{20ac}uro")' 1 at -e line 1.

Sat Jan 09 22:08:38 2016 PERLANCAR [...] cpan.org - Correspondence added

On Sat, 9 Jan 2016 07:46:30 GMT, SREZIC wrote: Show quoted text

> On 2016-01-08 22:33:31, PERLANCAR wrote:

> > On Fri, 8 Jan 2016 22:15:03 GMT, SREZIC wrote:

> > > It would be nice if there was a dataset with characters >= > > > \x{0100}. > > > Especially it would also be interesting if unicode is supported at > > > all > > > by the participants.

> > > > Yes it would. BTW, could you think of a pair of Unicode texts that > > might give different distance answer when fed to a Unicode-supporting > > vs non-Unicode-supporting module?

> > It seems that Text::LevenshteinXS does not support Unicode correctly. > The correct answer would be 1 here: > > $ perl5.22.1 -MText::LevenshteinXS=distance -e 'warn distance("Euro", > "\x{20ac}uro")' > 3 at -e line 1. > $ perl5.22.1 -MText::Levenshtein::XS=distance -e 'warn > distance("Euro", "\x{20ac}uro")' > 1 at -e line 1.

Thanks, added.

Sat Jan 09 22:08:39 2016 PERLANCAR [...] cpan.org - Status changed from 'open' to 'resolved'

Sun Jan 10 05:23:08 2016 SREZIC [...] cpan.org - Correspondence added

On 2016-01-09 22:08:38, PERLANCAR wrote: Show quoted text

> On Sat, 9 Jan 2016 07:46:30 GMT, SREZIC wrote:

> > On 2016-01-08 22:33:31, PERLANCAR wrote:

> > > On Fri, 8 Jan 2016 22:15:03 GMT, SREZIC wrote:

> > > > It would be nice if there was a dataset with characters >= > > > > \x{0100}. > > > > Especially it would also be interesting if unicode is supported at > > > > all > > > > by the participants.

> > > > > > Yes it would. BTW, could you think of a pair of Unicode texts that > > > might give different distance answer when fed to a Unicode-supporting > > > vs non-Unicode-supporting module?

> > > > It seems that Text::LevenshteinXS does not support Unicode correctly. > > The correct answer would be 1 here: > > > > $ perl5.22.1 -MText::LevenshteinXS=distance -e 'warn distance("Euro", > > "\x{20ac}uro")' > > 3 at -e line 1. > > $ perl5.22.1 -MText::Levenshtein::XS=distance -e 'warn > > distance("Euro", "\x{20ac}uro")' > > 1 at -e line 1.

> > Thanks, added.

Well, I have to re-open this ticket. I would have expected that the euro+Text::LevenshteinXS line wouldn't appear in the table, because the result is wrong with this module and thus the benchmark results misleading. I see that the expected result is included in the scenario description --- how about checking if the got result matches the expected result and mark the row specially? E.g. it could look like this (the wrong result moved to the top, and no benchmark numbers shown): +-----+-------------------------------------------------------------------------------+----------+----------+---------+---------+ | seq | name | rate | time | errors | samples | +-----+-------------------------------------------------------------------------------+----------+----------+---------+---------+ | 3 | {dataset=>"euro",participant=>"Text::LevenshteinXS::distance"} | -- wrong result -- | | 4 | {dataset=>"euro",participant=>"Text::Levenshtein::Damerau::PP::pp_edistance"} | 1.71e+04 | 58.4μs | 1.1e-07 | 20 | | 1 | {dataset=>"euro",participant=>"Text::Levenshtein::fastdistance"} | 18669 | 53.565μs | 9.6e-10 | 20 | | 0 | {dataset=>"euro",participant=>"PERLANCAR::Text::Levenshtein::editdist"} | 3.13e+04 | 31.9μs | 1.3e-08 | 20 | | 5 | {dataset=>"euro",participant=>"Text::Levenshtein::Damerau::XS::xs_edistance"} | 3.6e+05 | 2.8μs | 1e-08 | 20 | | 2 | {dataset=>"euro",participant=>"Text::Levenshtein::XS::distance"} | 3.8e+05 | 2.63μs | 4.2e-09 | 20 | +-----+-------------------------------------------------------------------------------+----------+----------+---------+---------+ Or just remove Text::LevenshteinXS completely from the participants. BTW, it seems that the expected results are wrong --- the expected result for the euro dataset is listed as 2, but it should be 1.

Sun Jan 10 05:23:14 2016 SREZIC [...] cpan.org - Status changed from 'resolved' to 'open'

Sun Jan 10 10:34:13 2016 perlancar [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #111029] Wishlist: add dataset with unicode strings
Date:	Sun, 10 Jan 2016 22:33:54 +0700
To:	bug-Bencher-Scenario-LevenshteinModules [...] rt.cpan.org
From:	Perl Ancar <perlancar [...] gmail.com>

Fixed the expected result and excluding Text::LevenshteinXS from the Unicode test, sorry about that. I do plan to add an option to let bencher die (or skip, or warn, or ignore) when an item's result is not as expected. On Sun, Jan 10, 2016 at 5:23 PM, Slaven_Rezic via RT < bug-Bencher-Scenario-LevenshteinModules@rt.cpan.org> wrote: Show quoted text

> Queue: Bencher-Scenario-LevenshteinModules > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=111029 > > > On 2016-01-09 22:08:38, PERLANCAR wrote:

> > On Sat, 9 Jan 2016 07:46:30 GMT, SREZIC wrote:

> > > On 2016-01-08 22:33:31, PERLANCAR wrote:

> > > > On Fri, 8 Jan 2016 22:15:03 GMT, SREZIC wrote:

> > > > > It would be nice if there was a dataset with characters >= > > > > > \x{0100}. > > > > > Especially it would also be interesting if unicode is supported at > > > > > all > > > > > by the participants.

> > > > > > > > Yes it would. BTW, could you think of a pair of Unicode texts that > > > > might give different distance answer when fed to a Unicode-supporting > > > > vs non-Unicode-supporting module?

> > > > > > It seems that Text::LevenshteinXS does not support Unicode correctly. > > > The correct answer would be 1 here: > > > > > > $ perl5.22.1 -MText::LevenshteinXS=distance -e 'warn distance("Euro", > > > "\x{20ac}uro")' > > > 3 at -e line 1. > > > $ perl5.22.1 -MText::Levenshtein::XS=distance -e 'warn > > > distance("Euro", "\x{20ac}uro")' > > > 1 at -e line 1.

> > > > Thanks, added.

> > Well, I have to re-open this ticket. I would have expected that the > euro+Text::LevenshteinXS line wouldn't appear in the table, because the > result is wrong with this module and thus the benchmark results misleading. > I see that the expected result is included in the scenario description --- > how about checking if the got result matches the expected result and mark > the row specially? E.g. it could look like this (the wrong result moved to > the top, and no benchmark numbers shown): > > > +-----+-------------------------------------------------------------------------------+----------+----------+---------+---------+ > | seq | name > | rate | time | errors | samples | > > +-----+-------------------------------------------------------------------------------+----------+----------+---------+---------+ > | 3 | {dataset=>"euro",participant=>"Text::LevenshteinXS::distance"} > | -- wrong result -- | > | 4 | > {dataset=>"euro",participant=>"Text::Levenshtein::Damerau::PP::pp_edistance"} > | 1.71e+04 | 58.4μs | 1.1e-07 | 20 | > | 1 | {dataset=>"euro",participant=>"Text::Levenshtein::fastdistance"} > | 18669 | 53.565μs | 9.6e-10 | 20 | > | 0 | > {dataset=>"euro",participant=>"PERLANCAR::Text::Levenshtein::editdist"} > | 3.13e+04 | 31.9μs | 1.3e-08 | 20 | > | 5 | > {dataset=>"euro",participant=>"Text::Levenshtein::Damerau::XS::xs_edistance"} > | 3.6e+05 | 2.8μs | 1e-08 | 20 | > | 2 | {dataset=>"euro",participant=>"Text::Levenshtein::XS::distance"} > | 3.8e+05 | 2.63μs | 4.2e-09 | 20 | > > +-----+-------------------------------------------------------------------------------+----------+----------+---------+---------+ > > Or just remove Text::LevenshteinXS completely from the participants. > > BTW, it seems that the expected results are wrong --- the expected result > for the euro dataset is listed as 2, but it should be 1. > >

Tue Jan 12 16:56:58 2016 SREZIC [...] cpan.org - Correspondence added

RT-Send-CC:

perlancar [...] gmail.com

A final thing: rêves <-> reve is one substitution and one deletion, hence edit distance 2, not 3 (fixing this would remove all of the notes for dataset=reve). On 2016-01-10 10:34:13, perlancar@gmail.com wrote: Show quoted text

> Fixed the expected result and excluding Text::LevenshteinXS from the > Unicode test, sorry about that. > > I do plan to add an option to let bencher die (or skip, or warn, or > ignore) > when an item's result is not as expected. > > On Sun, Jan 10, 2016 at 5:23 PM, Slaven_Rezic via RT < > bug-Bencher-Scenario-LevenshteinModules@rt.cpan.org> wrote: >

> > Queue: Bencher-Scenario-LevenshteinModules > > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=111029 > > > > > On 2016-01-09 22:08:38, PERLANCAR wrote:

> > > On Sat, 9 Jan 2016 07:46:30 GMT, SREZIC wrote:

> > > > On 2016-01-08 22:33:31, PERLANCAR wrote:

> > > > > On Fri, 8 Jan 2016 22:15:03 GMT, SREZIC wrote:

> > > > > > It would be nice if there was a dataset with characters >= > > > > > > \x{0100}. > > > > > > Especially it would also be interesting if unicode is > > > > > > supported at > > > > > > all > > > > > > by the participants.

> > > > > > > > > > Yes it would. BTW, could you think of a pair of Unicode texts > > > > > that > > > > > might give different distance answer when fed to a Unicode- > > > > > supporting > > > > > vs non-Unicode-supporting module?

> > > > > > > > It seems that Text::LevenshteinXS does not support Unicode > > > > correctly. > > > > The correct answer would be 1 here: > > > > > > > > $ perl5.22.1 -MText::LevenshteinXS=distance -e 'warn > > > > distance("Euro", > > > > "\x{20ac}uro")' > > > > 3 at -e line 1. > > > > $ perl5.22.1 -MText::Levenshtein::XS=distance -e 'warn > > > > distance("Euro", "\x{20ac}uro")' > > > > 1 at -e line 1.

> > > > > > Thanks, added.

> > > > Well, I have to re-open this ticket. I would have expected that the > > euro+Text::LevenshteinXS line wouldn't appear in the table, because > > the > > result is wrong with this module and thus the benchmark results > > misleading. > > I see that the expected result is included in the scenario > > description --- > > how about checking if the got result matches the expected result and > > mark > > the row specially? E.g. it could look like this (the wrong result > > moved to > > the top, and no benchmark numbers shown): > > > > > > +----- > > +------------------------------------------------------------------------------- > > +----------+----------+---------+---------+ > > | seq | name > > | rate | time | errors | samples | > > > > +----- > > +------------------------------------------------------------------------------- > > +----------+----------+---------+---------+ > > | 3 | > > {dataset=>"euro",participant=>"Text::LevenshteinXS::distance"} > > | -- wrong result -- | > > | 4 | > > {dataset=>"euro",participant=>"Text::Levenshtein::Damerau::PP::pp_edistance"} > > | 1.71e+04 | 58.4μs | 1.1e-07 | 20 | > > | 1 | > > {dataset=>"euro",participant=>"Text::Levenshtein::fastdistance"} > > | 18669 | 53.565μs | 9.6e-10 | 20 | > > | 0 | > > {dataset=>"euro",participant=>"PERLANCAR::Text::Levenshtein::editdist"} > > | 3.13e+04 | 31.9μs | 1.3e-08 | 20 | > > | 5 | > > {dataset=>"euro",participant=>"Text::Levenshtein::Damerau::XS::xs_edistance"} > > | 3.6e+05 | 2.8μs | 1e-08 | 20 | > > | 2 | > > {dataset=>"euro",participant=>"Text::Levenshtein::XS::distance"} > > | 3.8e+05 | 2.63μs | 4.2e-09 | 20 | > > > > +----- > > +------------------------------------------------------------------------------- > > +----------+----------+---------+---------+ > > > > Or just remove Text::LevenshteinXS completely from the participants. > > > > BTW, it seems that the expected results are wrong --- the expected > > result > > for the euro dataset is listed as 2, but it should be 1. > > > >

Tue Jan 12 22:46:44 2016 PERLANCAR [...] cpan.org - Correspondence added

On Tue, 12 Jan 2016 21:56:58 GMT, SREZIC wrote: Show quoted text

> A final thing: rêves <-> reve is one substitution and one deletion, > hence edit distance 2, not 3 (fixing this would remove all of the > notes for dataset=reve).

Fixed, closing ticket now.

Tue Jan 12 22:46:45 2016 PERLANCAR [...] cpan.org - Status changed from 'open' to 'resolved'