Subject: | Regex looks wrong |
In the following regex:
our $valid_utf8_regexp = <<'REGEX' ;
[\x{00}-\x{7f}]
| [\x{c2}-\x{df}][\x{80}-\x{bf}]
| \x{e0} [\x{a0}-\x{bf}][\x{80}-\x{bf}]
| [\x{e1}-\x{ec}][\x{80}-\x{bf}][\x{80}-\x{bf}]
| \x{ed} [\x{80}-\x{9f}][\x{80}-\x{bf}]
| [\x{ee}-\x{ef}][\x{80}-\x{bf}][\x{80}-\x{bf}]
| \x{f0} [\x{90}-\x{bf}][\x{80}-\x{bf}]
| [\x{f1}-\x{f3}][\x{80}-\x{bf}][\x{80}-\x{bf}][\x{80}-\x{bf}]
| \x{f4} [\x{80}-\x{8f}][\x{80}-\x{bf}][\x{80}-\x{bf}]
REGEX
the line starting with \x{f0} is wrong. If \x{f0} is the first byte, then there are three more bytes.
Here are some UTF-8 bytes to test it on:
πΌπ°π² π²π»π΄π πΉΜππ°π½, π½πΉ πΌπΉπ π
πΏ π½π³π°π½ π±ππΉπ²π²πΉπΈ.
This is from
http://www.columbia.edu/~fdc/utf8/
See also here:
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
Quote:
"U-00010000 β U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx"