Subject: | Feature request + patch: return first attribute occurrence instead of last |
Hi,
This is more a behavior change request than a bug report.
When the same HTML attribute is specified multiple times in a single element, Internet Explorer and Mozilla both honor the first occurrence, but HTML::Parser honors the last.
For example, if a spammer specifies "<body background=white text=white text=black>random garbage<font color=black>advertisement</font></body>" in an HTML-formatted email message, most Windows users won't see the random garbage, but my Perl-based anti-spam filter will.
The attached patch emulates IE/Mozilla behavior by storing the first rather than the last attribute in the hash passed as the "attr" argument to event handlers.
Incidentally, I didn't find any mention of this ambiguity in a quick scan of the HTML 4.1 spec.
Thanks!
Nick Duffek
html-parser@duffek.com
diff -r -u -p HTML-Parser-3.35.orig/hparser.c HTML-Parser-3.35/hparser.c
--- HTML-Parser-3.35.orig/hparser.c 2003-10-27 16:14:24.000000000 -0500
+++ HTML-Parser-3.35/hparser.c 2004-02-27 14:20:59.000000000 -0500
@@ -414,7 +414,8 @@ report_event(PSTATE* p_state,
sv_lower(aTHX_ attrname);
if (argcode == ARG_ATTR) {
- if (!hv_store_ent(hv, attrname, attrval, 0)) {
+ if (hv_exists_ent(hv, attrname, 0) ||
+ !hv_store_ent(hv, attrname, attrval, 0)) {
SvREFCNT_dec(attrval);
}
SvREFCNT_dec(attrname);