Skip Menu |

This queue is for tickets about the Unicode-Security CPAN distribution.

Report information
The Basics
Id: 122096
Status: open
Priority: 0/
Queue: Unicode-Security

People
Owner: Nobody in particular
Requestors: rjbs [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: should use Script_Extensions property for mixed_script
Mixed script detection should be using Script_Extensions, not Script (charscript) to compute soss, and thus mixed script-y-ness. Some characters are in multiple scripts. For example, this script should not be mixed script: qq(\x{a81b}\x{a80d}\x{a80e} \x{09EA}) The first four characters are Sylo. The last one has script Bengali, but has Script_Extensions=Sylo, and so the string can be construed as entirely Sylo + Common. I could write a patch if you're no longer working on this. See http://www.unicode.org/reports/tr39/#Mixed_Script_Detection -- rjbs
On Wed Jun 14 21:59:15 2017, RJBS wrote: Show quoted text
> Mixed script detection should be using Script_Extensions, not Script > (charscript) to compute soss, and thus mixed script-y-ness. > > Some characters are in multiple scripts. For example, this script > should not be mixed script: > > qq(\x{a81b}\x{a80d}\x{a80e} \x{09EA}) > > The first four characters are Sylo. The last one has script Bengali, > but has Script_Extensions=Sylo, and so the string can be construed as > entirely Sylo + Common. > > I could write a patch if you're no longer working on this. > > See http://www.unicode.org/reports/tr39/#Mixed_Script_Detection
Thanks for the report. It looks like Script_Extensions were only made available in perl since 5.21.9, so that explains why I didn't implement this correctly from the start. I'll work on a fix, but I don't think it will be trivial.
On Thu Jun 15 12:28:02 2017, GRAY wrote: Show quoted text
> On Wed Jun 14 21:59:15 2017, RJBS wrote:
> > Mixed script detection should be using Script_Extensions, not Script > > (charscript) to compute soss, and thus mixed script-y-ness. > > > > Some characters are in multiple scripts. For example, this script > > should not be mixed script: > > > > qq(\x{a81b}\x{a80d}\x{a80e} \x{09EA}) > > > > The first four characters are Sylo. The last one has script Bengali, > > but has Script_Extensions=Sylo, and so the string can be construed as > > entirely Sylo + Common. > > > > I could write a patch if you're no longer working on this. > > > > See http://www.unicode.org/reports/tr39/#Mixed_Script_Detection
> > Thanks for the report. It looks like Script_Extensions were only made > available in perl since 5.21.9, so that explains why I didn't > implement this correctly from the start. I'll work on a fix, but I > don't think it will be trivial.
wow, performance is going to take a serious hit. calling Unicode::UCD::charscript($c) on every unicode character takes .3 seconds, whereis calling Unicode::UCD::charprop($c, 'Script_Extensions') took 223 seconds.
I haven't tested this, but it popped in my head and I wanted to write it down before I forgot: What if you used this algorithm: my @candidate-scripts for each char c in string: next if c is common or inherited unless @candidate-scripts: @candidate-scripts = split /,/, charprop(c, "Scx") next candidate-scripts = grep { c =~ /\p{$_}/ } @candidate-scripts return 1 unless candidate-scripts return 0 So: instead of getting every property for every character in the string, you get them all for the first relevant character, then check each subsequent character against that list, which never gets larger. This might be slower, but maybe not. -- rjbs
On Fri Jun 16 10:44:45 2017, RJBS wrote: Show quoted text
> I haven't tested this, but it popped in my head and I wanted to write > it down before I forgot: > > What if you used this algorithm: > > my @candidate-scripts > > for each char c in string: > next if c is common or inherited > > unless @candidate-scripts: > @candidate-scripts = split /,/, charprop(c, "Scx") > next > > candidate-scripts = grep { c =~ /\p{$_}/ } @candidate-scripts > return 1 unless candidate-scripts > > return 0 > > So: instead of getting every property for every character in the > string, you get them all for the first relevant character, then check > each subsequent character against that list, which never gets larger. > > This might be slower, but maybe not.
I like that in some cases your algorithm can be faster, but I need access to the set of script sets (SOSS) to implement other algorithms. I'll keep this in mind for potential future optimizations.