Subject: | Entity name match should allow dots and hyphens in download-entities.pl |
Hi,
I was regenerating the Data module against the entity sets at https://www.w3.org/2003/entities/2007/ and noticed some entities were missing.
Most notably, the isogrk4 set was empty, where I was expecting to find entity mappings.
Going through the source of download-entities.pl, the problem is that the entity name regex only wants word characters, and misses any matches for names containing dot or hyphen characters. The attached patch fixes the problem.
Subject: | download-entities.patch |
diff --git a/bin/download-entities.pl b/bin/download-entities.pl
index 97ce11d..2336d41
--- a/bin/download-entities.pl
+++ b/bin/download-entities.pl
@@ -158,8 +158,8 @@ sub report_error {
sub parse_ent {
my ($ent_file_ref) = @_;
if (not ref $ent_file_ref) { $ent_file_ref = \$ent_file_ref }
- my @raw_defs = $$ent_file_ref =~ /(?<=<!ENTITY) \s* \w+ \s+ "&[^"]+" (?=\s*>)/sgx;
- my @name_value_pairs = map {my ($n, $v) = /(\w+) \s* "&\# ([^"]+) "/sx; [$n, $v]} @raw_defs;
+ my @raw_defs = $$ent_file_ref =~ /(?<=<!ENTITY) \s* [\w\.\-]+ \s+ "&[^"]+" (?=\s*>)/sgx;
+ my @name_value_pairs = map {my ($n, $v) = /(\w[\w\.\-]*) \s* "&\# ([^"]+) "/sx; [$n, $v]} @raw_defs;
for (@name_value_pairs) {
my $v = $$_[1];
# For some reason, some entities like < are defined like &#60; instead of < - just get rid of 38;#