Subject: | Needs updating for TR29 revision 29 |
Unicode 9.0's TR29 has changed the rules for extended grapheme clusters, and this module no longer complies.
Most specifically:
Show quoted text
> Revised rule GB10 (as of Revision 28) to handle characters of class Extend,
> such as variation selectors, in emoji modifier sequences, as may be found in
> existing data.
There is a new Grapheme_Cluster_Break property, ZWJ, for ZWJ, and section ยง3.1.1 details its use.
Consider this program:
use v5.26.0;
use warnings;
use Unicode::GCString;
my $ZWJ = "\N{ZERO WIDTH JOINER}";
my $dude = "\N{MAN}";
my $love = "\N{HEAVY BLACK HEART}\N{U+0FE0F}";
my $kiss = "\N{KISS MARK}";
my $in_love = join $ZWJ, $dude, $love, $kiss, $dude;
say length $in_love;
my @matches = $in_love =~ /\G(\X)/g;
say "M: $_" for @matches;
my @gc = split /\b{gcb}/, $in_love;
say "S: $_" for @gc;
say "GCB: " . @gc;
say "COL: " . Unicode::GCString->new($in_love)->columns;
say "LEN: " . Unicode::GCString->new($in_love)->length;
----
This demonstrates both the new, correct behavior (from Unicode 9.0 and implemented in perl v5.26.0) of treating a Unicode display cluster. We see that perl treats it as 1 extended grapheme cluster, but Unicode::GCString treats it as four. Presumably, if it treated it as one ECG, it would also count as one column.
--
rjbs