Subject: | Disambiguate wiki characters from text nodes |
The text-node code in 0.21 does not escape literal characters returned in text nodes that can be confused with wiki markup. For instance, in MediaWiki markup the text "[unintelligable]" in a web page will be indistinguishable from wiki-encoded link. Also, a < in the HTML turns into a literal '<' in the text node, and this can result in unintended embedded <tags> ending up in the output text.
So, it looks like the code in _wikify() needs to call a member function to have each wiki format escape special wiki characters in a manner compatible with that wiki's escaping rules.
I've just kluged in a fix directly into WikiConverter.pm on my system because I only use the MediaWiki format. My current fix looks like this:
my $output = $node->attr('text');
$output =~ s/&(#\d+;|[a-z]+;)/&$1/g;
$output =~ s/</</g;
$output =~ s/\[/[/g;
$output =~ s/''/''/g;
return $output;
} else {
This uses MediaWiki's support for HTML entities to encode some literal characters in the text. I don't wish all ampersands to be turned into "&", so the code only escapes the '&' if it is in front of something that can be parsed as a valid HTML entity. Then, it escapes '<' as "<", '[' as "[", and all double-apostrophes as "''". These are the biggest potential problems, in my mind. Other start-of-line characters could be escaped too, if the code were improved to know when the text-node was going to be rendered at the start of a line, as well as whatever else I'm neglected.