Skip Menu |

This queue is for tickets about the HTML-WikiConverter CPAN distribution.

Report information
The Basics
Id: 12303
Status: resolved
Worked: 2 hours (120 min)
Priority: 0/
Queue: HTML-WikiConverter

People
Owner: diberri [...] cpan.org
Requestors: bugzilla [...] blorf.net
edward [...] debian.org
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 0.21
Fixed in: 0.21



Subject: Disambiguate wiki characters from text nodes
The text-node code in 0.21 does not escape literal characters returned in text nodes that can be confused with wiki markup. For instance, in MediaWiki markup the text "[unintelligable]" in a web page will be indistinguishable from wiki-encoded link. Also, a &lt; in the HTML turns into a literal '<' in the text node, and this can result in unintended embedded <tags> ending up in the output text. So, it looks like the code in _wikify() needs to call a member function to have each wiki format escape special wiki characters in a manner compatible with that wiki's escaping rules. I've just kluged in a fix directly into WikiConverter.pm on my system because I only use the MediaWiki format. My current fix looks like this: my $output = $node->attr('text'); $output =~ s/&(#\d+;|[a-z]+;)/&amp;$1/g; $output =~ s/</&lt;/g; $output =~ s/\[/&#91;/g; $output =~ s/''/&#39;&#39;/g; return $output; } else { This uses MediaWiki's support for HTML entities to encode some literal characters in the text. I don't wish all ampersands to be turned into "&amp;", so the code only escapes the '&' if it is in front of something that can be parsed as a valid HTML entity. Then, it escapes '<' as "&lt;", '[' as "&#91;", and all double-apostrophes as "&#39;&#39;". These are the biggest potential problems, in my mind. Other start-of-line characters could be escaped too, if the code were improved to know when the text-node was going to be rendered at the start of a line, as well as whatever else I'm neglected.
Show quoted text
> The text-node code in 0.21 does not escape literal characters returned > in text nodes that can be confused with wiki markup. For instance, > in MediaWiki markup the text "[unintelligable]" in a web page will > be indistinguishable from wiki-encoded link.
MediaWiki's parsing of bracketed content would prevent your example from being converted into an HTML link; the start of the bracketed content must look like a URL for it to be turned into a link. A more relevant example is "[http://example.org]", which *would* be converted to a link. But simply substituting "[" for "&#91;" doesn't seem like the right fix; that would yield the awkward "&#91;http://example.org]". IMO, this should be converted to "<nowiki>[http://example.org]</nowiki>". I've implemented a fix for MediaWiki using preprocess_node (see attached patch). Any bracketed expression that would be recognized by MediaWiki as an ext. link reference (based on [1]) is escaped using <nowiki> and </nowiki>. This will appear in 0.22, but I've attached a patch for 0.21. Show quoted text
> Also, a &lt; in the > HTML turns into a literal '<' in the text node, and this can result > in unintended embedded <tags> ending up in the output text.
Can you please provide an example of this? Thanks, David
Index: MediaWiki.pm =================================================================== RCS file: /usr/local/cvsroot/html2wiki/HTML/WikiConverter/MediaWiki.pm,v retrieving revision 1.32 diff -u -r1.32 MediaWiki.pm --- MediaWiki.pm 18 Mar 2005 19:55:33 -0000 1.32 +++ MediaWiki.pm 18 Apr 2005 18:17:08 -0000 @@ -167,6 +167,18 @@ my $tag = $node->tag || ''; $pkg->_strip_extra($wc, $node); $pkg->_strip_aname($wc, $node) if $tag eq 'a'; + $pkg->_fix_extlinks_in_text($wc, $node) if $tag eq '~text'; +} + +my $URL_PROTOCOLS = 'http|https|ftp|irc|gopher|news|mailto'; +my $EXT_LINK_URL_CLASS = '[^]<>"\\x00-\\x20\\x7F]'; +my $EXT_LINK_TEXT_CLASS = '[^\]\\x00-\\x1F\\x7F]'; + +sub _fix_extlinks_in_text { + my( $pkg, $wc, $node ) = @_; + my $text = $node->attr('text') || ''; + $text =~ s~(\[\b(?:$URL_PROTOCOLS):$EXT_LINK_URL_CLASS+ *$EXT_LINK_TEXT_CLASS*?\])~<nowiki>$1</nowiki>~go; + $node->attr( text => $text ); } sub _strip_aname {
Show quoted text
> > Also, a &lt; in the > > HTML turns into a literal '<' in the text node, and this can result > > in unintended embedded <tags> ending up in the output text.
> > Can you please provide an example of this?
Sure, let's say an HTML file is talking about tags: To underline, put the &lt;u> tag prior to the text &amp; the &lt;/u> tag after it.
[...Sorry, I accidentally pressed TAB followed by a space, which submitted my last comment before it was done...] The example in my partial reply shows how talking about <u>...</u> tags can turn into actual underlined text. The same thing could happen if the text was talking about an entity: To enter a '&lt;' in your input, use "&amp;lt;" That would change the latter entity into "&lt;", which MediaWiki would then convert into a literal '<'.
Show quoted text
> MediaWiki's parsing of bracketed content would prevent your example from > being converted into an HTML link; the start of the bracketed content > must look like a URL for it to be turned into a link.
Yes, my example was incorrect. However, don't forget about [[foo]] links that could be in the literal HTML data. I want the converter to be able to handle any HTML file, even if it is talking about Wiki formatting in its text, so I also want things like double-apostrophes in the text to be escaped (as mentioned before). If you don't want to add additional escaping, perhaps you would support the ability for the caller to add a call-back function to tweak the text nodes (and have it called before any internal tweaking happens, such as the new link-escaping code for 0.22).
Show quoted text
> Yes, my example was incorrect. However, don't forget about [[foo]] > links that could be in the literal HTML data. I want the converter to > be able to handle any HTML file, even if it is talking about Wiki > formatting in its text, so I also want things like double-apostrophes in > the text to be escaped (as mentioned before).
I absolutely agree. My concern is how to properly escape these wiki strings; I'm not sure that literal wiki markup in HTML should be replaced with entities. E.g. I don't really want to turn this <p> A pair of quotes makes ''italics''. </p> into this A pair of quotes makes &#39;&#39;italics&#39;&#39;. IMO, that's pretty ugly. Plus, that should really be converted into A pair of quotes makes <nowiki>''italics''</nowiki>. I first thought to add a preprocess_node() step in H::WC::MW that would use Text::Balanced to extract wiki-like markup from the incoming HTML. But unfortunately T::B's extract_delimited() only allows single-character delimiters so e.g. "''" for italics wouldn't work. And apparently I don't understand extract_tagged() fully yet. (I can post details if you're interested.) I know I could use regexes instead of T::B, but you know, reinventing the wheel and all... My other idea was to preprocess each text node, and if it contained any wiki-like markup, then the text node would be enveloped in <nowiki> and </nowiki>. Which would turn the above example into <nowiki>A pair of quotes makes ''italics''.</nowiki> It's not as pretty as the T::B solution would be, but it's better than just replacing the quotes with ugly HTML entities, IMO. And that's the direction I'm heading in right now (until I figure out T::B). Currently the <nowiki> tag will be placed around text nodes whose content matches any of these patterns: my @wikitext_patterns = ( qr/''/, qr/^(?:\*|\#|\;|\:)/m, qr/^----/m, qr/^\{\|/m, qr/\[\[/m, qr/{{/m ); The web interface is running this development version, so feel free to try it out [http://diberri.dyndns.org/html2wiki.html]. Show quoted text
> If you don't want to add additional escaping, perhaps you would support > the ability for the caller to add a call-back function to tweak the text > nodes (and have it called before any internal tweaking happens, such as > the new link-escaping code for 0.22).
I'd like to support it at the dialect level (i.e. in H::WC::MW), not relying on the client to manually escape things. -- David Iberri
Subject: doesn't handle non-breaking space correctly
Running HTML-WikiConverter-0.23 with Perl v5.8.6 on FreeBSD 5.3-RELEASE. Older versions of this module non-breaking spaces (&nbsp;) passed into the output unmodified, now they get converted into a character of some kind. Here is a test case: use HTML::WikiConverter; use Test::More tests => 1; my $wiki = new HTML::WikiConverter(dialect => "MediaWiki", wrap_in_html => 1); is ($wiki->html2wiki("&nbsp;"), "&nbps;", "non-breaking space"); This test fails. not ok 1 - non-breaking space # Failed test (./nbsp.pl at line 9) # got: '\uffff' # expected: '&nbps;' # Looks like you failed 1 test of 1.