Bug #91537 for Test-utf8: Regex looks wrong

Subject:

Regex looks wrong

In the following regex: our $valid_utf8_regexp = <<'REGEX' ; [\x{00}-\x{7f}] | [\x{c2}-\x{df}][\x{80}-\x{bf}] | \x{e0} [\x{a0}-\x{bf}][\x{80}-\x{bf}] | [\x{e1}-\x{ec}][\x{80}-\x{bf}][\x{80}-\x{bf}] | \x{ed} [\x{80}-\x{9f}][\x{80}-\x{bf}] | [\x{ee}-\x{ef}][\x{80}-\x{bf}][\x{80}-\x{bf}] | \x{f0} [\x{90}-\x{bf}][\x{80}-\x{bf}] | [\x{f1}-\x{f3}][\x{80}-\x{bf}][\x{80}-\x{bf}][\x{80}-\x{bf}] | \x{f4} [\x{80}-\x{8f}][\x{80}-\x{bf}][\x{80}-\x{bf}] REGEX the line starting with \x{f0} is wrong. If \x{f0} is the first byte, then there are three more bytes. Here are some UTF-8 bytes to test it on: 𐌼𐌰𐌲 𐌲𐌻𐌴𐍃 𐌹̈𐍄𐌰𐌽, 𐌽𐌹 𐌼𐌹𐍃 𐍅𐌿 𐌽𐌳𐌰𐌽 𐌱𐍂𐌹𐌲𐌲𐌹𐌸. This is from http://www.columbia.edu/~fdc/utf8/ See also here: http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 Quote: "U-00010000 – U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx"