Bug #12303 for HTML-WikiConverter: Disambiguate wiki characters from text nodes

Fri Apr 15 15:39:03 2005 Guest - Ticket created

Subject:

Disambiguate wiki characters from text nodes

The text-node code in 0.21 does not escape literal characters returned in text nodes that can be confused with wiki markup. For instance, in MediaWiki markup the text "[unintelligable]" in a web page will be indistinguishable from wiki-encoded link. Also, a < in the HTML turns into a literal '<' in the text node, and this can result in unintended embedded <tags> ending up in the output text. So, it looks like the code in _wikify() needs to call a member function to have each wiki format escape special wiki characters in a manner compatible with that wiki's escaping rules. I've just kluged in a fix directly into WikiConverter.pm on my system because I only use the MediaWiki format. My current fix looks like this: my $output = $node->attr('text'); $output =~ s/&(#\d+;|[a-z]+;)/&$1/g; $output =~ s/</</g; $output =~ s/\[/[/g; $output =~ s/''/''/g; return $output; } else { This uses MediaWiki's support for HTML entities to encode some literal characters in the text. I don't wish all ampersands to be turned into "&", so the code only escapes the '&' if it is in front of something that can be parsed as a valid HTML entity. Then, it escapes '<' as "<", '[' as "[", and all double-apostrophes as "''". These are the biggest potential problems, in my mind. Other start-of-line characters could be escaped too, if the code were improved to know when the text-node was going to be rendered at the start of a line, as well as whatever else I'm neglected.

Mon Apr 18 15:13:17 2005 diberri [...] cpan.org - Correspondence added

Show quoted text

> The text-node code in 0.21 does not escape literal characters returned > in text nodes that can be confused with wiki markup. For instance, > in MediaWiki markup the text "[unintelligable]" in a web page will > be indistinguishable from wiki-encoded link.

MediaWiki's parsing of bracketed content would prevent your example from being converted into an HTML link; the start of the bracketed content must look like a URL for it to be turned into a link. A more relevant example is "[http://example.org]", which *would* be converted to a link. But simply substituting "[" for "[" doesn't seem like the right fix; that would yield the awkward "[http://example.org]". IMO, this should be converted to "<nowiki>[http://example.org]</nowiki>". I've implemented a fix for MediaWiki using preprocess_node (see attached patch). Any bracketed expression that would be recognized by MediaWiki as an ext. link reference (based on [1]) is escaped using <nowiki> and </nowiki>. This will appear in 0.22, but I've attached a patch for 0.21. Show quoted text

> Also, a < in the > HTML turns into a literal '<' in the text node, and this can result > in unintended embedded <tags> ending up in the output text.

Can you please provide an example of this? Thanks, David

ï»¿Index: MediaWiki.pm =================================================================== RCS file: /usr/local/cvsroot/html2wiki/HTML/WikiConverter/MediaWiki.pm,v retrieving revision 1.32 diff -u -r1.32 MediaWiki.pm --- MediaWiki.pm 18 Mar 2005 19:55:33 -0000 1.32 +++ MediaWiki.pm 18 Apr 2005 18:17:08 -0000 @@ -167,6 +167,18 @@ my $tag = $node->tag || ''; $pkg->_strip_extra($wc, $node); $pkg->_strip_aname($wc, $node) if $tag eq 'a'; + $pkg->_fix_extlinks_in_text($wc, $node) if $tag eq '~text'; +} + +my $URL_PROTOCOLS = 'http|https|ftp|irc|gopher|news|mailto'; +my $EXT_LINK_URL_CLASS = '[^]<>"\\x00-\\x20\\x7F]'; +my $EXT_LINK_TEXT_CLASS = '[^\]\\x00-\\x1F\\x7F]'; + +sub _fix_extlinks_in_text { + my( $pkg, $wc, $node ) = @_; + my $text = $node->attr('text') || ''; + $text =~ s~(\[\b(?:$URL_PROTOCOLS):$EXT_LINK_URL_CLASS+ *$EXT_LINK_TEXT_CLASS*?\])~<nowiki>$1</nowiki>~go; + $node->attr( text => $text ); } sub _strip_aname {

Mon Apr 18 15:13:18 2005 diberri [...] cpan.org - Status changed from 'new' to 'open'

Mon Apr 18 15:14:40 2005 diberri [...] cpan.org - Taken

Sat Apr 23 12:17:29 2005 Guest - Correspondence added

Show quoted text

> > Also, a < in the > > HTML turns into a literal '<' in the text node, and this can result > > in unintended embedded <tags> ending up in the output text.

> > Can you please provide an example of this?

Sure, let's say an HTML file is talking about tags: To underline, put the tag prior to the text & the tag after it.

Sat Apr 23 12:27:17 2005 Guest - Correspondence added

[...Sorry, I accidentally pressed TAB followed by a space, which submitted my last comment before it was done...] The example in my partial reply shows how talking about ... tags can turn into actual underlined text. The same thing could happen if the text was talking about an entity: To enter a '<' in your input, use "&lt;" That would change the latter entity into "<", which MediaWiki would then convert into a literal '<'.

Wed Apr 27 14:39:53 2005 Guest - Correspondence added

Show quoted text

> MediaWiki's parsing of bracketed content would prevent your example from > being converted into an HTML link; the start of the bracketed content > must look like a URL for it to be turned into a link.

Yes, my example was incorrect. However, don't forget about [[foo]] links that could be in the literal HTML data. I want the converter to be able to handle any HTML file, even if it is talking about Wiki formatting in its text, so I also want things like double-apostrophes in the text to be escaped (as mentioned before). If you don't want to add additional escaping, perhaps you would support the ability for the caller to add a call-back function to tweak the text nodes (and have it called before any internal tweaking happens, such as the new link-escaping code for 0.22).

Thu Apr 28 17:17:37 2005 diberri [...] cpan.org - Correspondence added

Show quoted text

> Yes, my example was incorrect. However, don't forget about [[foo]] > links that could be in the literal HTML data. I want the converter to > be able to handle any HTML file, even if it is talking about Wiki > formatting in its text, so I also want things like double-apostrophes in > the text to be escaped (as mentioned before).

I absolutely agree. My concern is how to properly escape these wiki strings; I'm not sure that literal wiki markup in HTML should be replaced with entities. E.g. I don't really want to turn this A pair of quotes makes ''italics''. into this A pair of quotes makes ''italics''. IMO, that's pretty ugly. Plus, that should really be converted into A pair of quotes makes <nowiki>''italics''</nowiki>. I first thought to add a preprocess_node() step in H::WC::MW that would use Text::Balanced to extract wiki-like markup from the incoming HTML. But unfortunately T::B's extract_delimited() only allows single-character delimiters so e.g. "''" for italics wouldn't work. And apparently I don't understand extract_tagged() fully yet. (I can post details if you're interested.) I know I could use regexes instead of T::B, but you know, reinventing the wheel and all... My other idea was to preprocess each text node, and if it contained any wiki-like markup, then the text node would be enveloped in <nowiki> and </nowiki>. Which would turn the above example into <nowiki>A pair of quotes makes ''italics''.</nowiki> It's not as pretty as the T::B solution would be, but it's better than just replacing the quotes with ugly HTML entities, IMO. And that's the direction I'm heading in right now (until I figure out T::B). Currently the <nowiki> tag will be placed around text nodes whose content matches any of these patterns: my @wikitext_patterns = ( qr/''/, qr/^(?:\*|\#|\;|\:)/m, qr/^----/m, qr/^\{\|/m, qr/\[\[/m, qr/{{/m ); The web interface is running this development version, so feel free to try it out [http://diberri.dyndns.org/html2wiki.html]. Show quoted text

> If you don't want to add additional escaping, perhaps you would support > the ability for the caller to add a call-back function to tweak the text > nodes (and have it called before any internal tweaking happens, such as > the new link-escaping code for 0.22).

I'd like to support it at the dialect level (i.e. in H::WC::MW), not relying on the client to manually escape things. -- David Iberri

Tue May 24 12:15:58 2005 Guest - Ticket #12944: Ticket created

Subject:

doesn't handle non-breaking space correctly

Running HTML-WikiConverter-0.23 with Perl v5.8.6 on FreeBSD 5.3-RELEASE. Older versions of this module non-breaking spaces ( ) passed into the output unmodified, now they get converted into a character of some kind. Here is a test case: use HTML::WikiConverter; use Test::More tests => 1; my $wiki = new HTML::WikiConverter(dialect => "MediaWiki", wrap_in_html => 1); is ($wiki->html2wiki(" "), "&nbps;", "non-breaking space"); This test fails. not ok 1 - non-breaking space # Failed test (./nbsp.pl at line 9) # got: '\uffff' # expected: '&nbps;' # Looks like you failed 1 test of 1.

Tue May 24 14:31:07 2005 diberri [...] cpan.org - Ticket #12944: Taken

Tue May 24 14:31:12 2005 diberri [...] cpan.org - Ticket #12944: Status changed from 'new' to 'open'

Tue May 24 14:37:10 2005 diberri [...] cpan.org - Ticket #12944: Broken in 0.21 added

Tue May 24 14:37:10 2005 diberri [...] cpan.org - Ticket #12944: Severity Normal added

Tue May 24 14:37:11 2005 diberri [...] cpan.org - Ticket #12944: Ticket 12944 MergedInto ticket 12303.

Tue May 24 14:46:52 2005 diberri [...] cpan.org - TimeWorked changed from (no value) to '120'

Tue May 24 14:46:53 2005 diberri [...] cpan.org - Fixed in 0.21 added

Tue May 24 14:46:53 2005 diberri [...] cpan.org - Status changed from 'open' to 'resolved'