Skip Menu |

This queue is for tickets about the Test-utf8 CPAN distribution.

Report information
The Basics
Id: 91537
Status: new
Priority: 0/
Queue: Test-utf8

People
Owner: Nobody in particular
Requestors: bkb [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Regex looks wrong
In the following regex: our $valid_utf8_regexp = <<'REGEX' ; [\x{00}-\x{7f}] | [\x{c2}-\x{df}][\x{80}-\x{bf}] | \x{e0} [\x{a0}-\x{bf}][\x{80}-\x{bf}] | [\x{e1}-\x{ec}][\x{80}-\x{bf}][\x{80}-\x{bf}] | \x{ed} [\x{80}-\x{9f}][\x{80}-\x{bf}] | [\x{ee}-\x{ef}][\x{80}-\x{bf}][\x{80}-\x{bf}] | \x{f0} [\x{90}-\x{bf}][\x{80}-\x{bf}] | [\x{f1}-\x{f3}][\x{80}-\x{bf}][\x{80}-\x{bf}][\x{80}-\x{bf}] | \x{f4} [\x{80}-\x{8f}][\x{80}-\x{bf}][\x{80}-\x{bf}] REGEX the line starting with \x{f0} is wrong. If \x{f0} is the first byte, then there are three more bytes. Here are some UTF-8 bytes to test it on: 𐌼𐌰𐌲 πŒ²πŒ»πŒ΄πƒ πŒΉΜˆπ„πŒ°πŒ½, 𐌽𐌹 πŒΌπŒΉπƒ π…πŒΏ 𐌽𐌳𐌰𐌽 πŒ±π‚πŒΉπŒ²πŒ²πŒΉπŒΈ. This is from http://www.columbia.edu/~fdc/utf8/ See also here: http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 Quote: "U-00010000 – U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx"