Bug #122096 for Unicode-Security: should use Script_Extensions property for mixed

Wed Jun 14 21:59:15 2017 rjbs [...] cpan.org - Ticket created

Subject:

should use Script_Extensions property for mixed_script

Mixed script detection should be using Script_Extensions, not Script (charscript) to compute soss, and thus mixed script-y-ness. Some characters are in multiple scripts. For example, this script should not be mixed script: qq(\x{a81b}\x{a80d}\x{a80e} \x{09EA}) The first four characters are Sylo. The last one has script Bengali, but has Script_Extensions=Sylo, and so the string can be construed as entirely Sylo + Common. I could write a patch if you're no longer working on this. See http://www.unicode.org/reports/tr39/#Mixed_Script_Detection -- rjbs

Thu Jun 15 12:28:02 2017 gray [...] cpan.org - Correspondence added

On Wed Jun 14 21:59:15 2017, RJBS wrote: Show quoted text

> Mixed script detection should be using Script_Extensions, not Script > (charscript) to compute soss, and thus mixed script-y-ness. > > Some characters are in multiple scripts. For example, this script > should not be mixed script: > > qq(\x{a81b}\x{a80d}\x{a80e} \x{09EA}) > > The first four characters are Sylo. The last one has script Bengali, > but has Script_Extensions=Sylo, and so the string can be construed as > entirely Sylo + Common. > > I could write a patch if you're no longer working on this. > > See http://www.unicode.org/reports/tr39/#Mixed_Script_Detection

Thanks for the report. It looks like Script_Extensions were only made available in perl since 5.21.9, so that explains why I didn't implement this correctly from the start. I'll work on a fix, but I don't think it will be trivial.

Thu Jun 15 12:28:03 2017 The RT System itself - Status changed from 'new' to 'open'

Thu Jun 15 13:46:33 2017 gray [...] cpan.org - Correspondence added

On Thu Jun 15 12:28:02 2017, GRAY wrote: Show quoted text

> On Wed Jun 14 21:59:15 2017, RJBS wrote:

> > Mixed script detection should be using Script_Extensions, not Script > > (charscript) to compute soss, and thus mixed script-y-ness. > > > > Some characters are in multiple scripts. For example, this script > > should not be mixed script: > > > > qq(\x{a81b}\x{a80d}\x{a80e} \x{09EA}) > > > > The first four characters are Sylo. The last one has script Bengali, > > but has Script_Extensions=Sylo, and so the string can be construed as > > entirely Sylo + Common. > > > > I could write a patch if you're no longer working on this. > > > > See http://www.unicode.org/reports/tr39/#Mixed_Script_Detection

> > Thanks for the report. It looks like Script_Extensions were only made > available in perl since 5.21.9, so that explains why I didn't > implement this correctly from the start. I'll work on a fix, but I > don't think it will be trivial.

wow, performance is going to take a serious hit. calling Unicode::UCD::charscript($c) on every unicode character takes .3 seconds, whereis calling Unicode::UCD::charprop($c, 'Script_Extensions') took 223 seconds.

Fri Jun 16 10:44:45 2017 rjbs [...] cpan.org - Correspondence added

I haven't tested this, but it popped in my head and I wanted to write it down before I forgot: What if you used this algorithm: my @candidate-scripts for each char c in string: next if c is common or inherited unless @candidate-scripts: @candidate-scripts = split /,/, charprop(c, "Scx") next candidate-scripts = grep { c =~ /\p{$_}/ } @candidate-scripts return 1 unless candidate-scripts return 0 So: instead of getting every property for every character in the string, you get them all for the first relevant character, then check each subsequent character against that list, which never gets larger. This might be slower, but maybe not. -- rjbs

Fri Jun 16 11:57:13 2017 gray [...] cpan.org - Correspondence added

On Fri Jun 16 10:44:45 2017, RJBS wrote: Show quoted text

> I haven't tested this, but it popped in my head and I wanted to write > it down before I forgot: > > What if you used this algorithm: > > my @candidate-scripts > > for each char c in string: > next if c is common or inherited > > unless @candidate-scripts: > @candidate-scripts = split /,/, charprop(c, "Scx") > next > > candidate-scripts = grep { c =~ /\p{$_}/ } @candidate-scripts > return 1 unless candidate-scripts > > return 0 > > So: instead of getting every property for every character in the > string, you get them all for the first relevant character, then check > each subsequent character against that list, which never gets larger. > > This might be slower, but maybe not.

I like that in some cases your algorithm can be faster, but I need access to the set of script sets (SOSS) to implement other algorithms. I'll keep this in mind for potential future optimizations.

Bug #122096 for Unicode-Security: should use Script_Extensions property for mixed_script