Skip Menu |

This queue is for tickets about the Parse-MediaWikiDump CPAN distribution.

Report information
The Basics
Id: 16981
Status: resolved
Priority: 0/
Queue: Parse-MediaWikiDump

People
Owner: Nobody in particular
Requestors: leurent [...] clipper.ens.fr
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 0.30
Fixed in: (no value)



Subject: MediaWikiDump doest not handle #redirect : [[foo]]
MediaWikiDump handles '#redirect [[foo]]' correctly, but does not know about '#redirect : [[foo]]'. The later form is not descibed in the Wikipedia help pages, but it is used on some pages and seem to works in Wikipedia, eg: http://fr.wikipedia.org/w/index.php?title=Myrtac%C3%A9es&action=edit I can provide a patch if you like, but I guess you'd better write one yourself...
Hello, I've fixed the bug and have a version ready that should solve the problem. Can you please test the version attached to this ticket? If it works for you I will release that version as the next version of Parse::MediaWikiDump. Thanks for the bug report! Tyler Riddle [guest - Mon Jan 9 19:53:07 2006]: Show quoted text
> MediaWikiDump handles '#redirect [[foo]]' correctly, but does not know > about '#redirect : [[foo]]'. The later form is not descibed in the > Wikipedia help pages, but it is used on some pages and seem to > works in Wikipedia, eg: > http://fr.wikipedia.org/w/index.php?title=Myrtac%C3%A9es&action=edit > > I can provide a patch if you like, but I guess you'd better write one > yourself...
Download Parse-MediaWikiDump-0.31.tar.gz
application/x-gzip 16.4k

Message body not shown because it is not plain text.

From: leurent [...] clipper.ens.fr
[TRIDDLE - Tue Jan 10 14:47:32 2006]: Show quoted text
> Hello, > > I've fixed the bug and have a version ready that should solve the > problem. Can you please > test the version attached to this ticket?
Yes, it works fine for me. In the meantime, I found another bug in MediaWikiDump: the title of the article <URL: http://en.wikipedia.org/w/index.php?title=%C2%A0&redirect=no > is a single no-break space, and it is considered to be empty by your module. I think you should use something like [ \t\n\r] inststead of \s in the regexp that matches empty nodes [attached is a patch that does this]. (The XML specification defines witespace as spaces, tabs, and blank lines, and I added \r in case of some strange Windows-Mac-Unix interaction). I don't know exactly what are valid titles for MediaWiki, and if whitespace should be preserved in the title tag, so maybe you need to do something more clever, but this seems to fix empty titles.
diff -r -u Parse-MediaWikiDump-0.31/lib/Parse/MediaWikiDump.pm Parse-MediaWikiDump-0.31.patch/lib/Parse/MediaWikiDump.pm --- Parse-MediaWikiDump-0.31/lib/Parse/MediaWikiDump.pm 2006-01-10 20:43:18.000000000 +0100 +++ Parse-MediaWikiDump-0.31.patch/lib/Parse/MediaWikiDump.pm 2006-01-11 20:34:43.000000000 +0100 @@ -687,7 +687,7 @@ } if ($ignore_ws_only) { - return 1 if $chars =~ m/^\s+$/m; + return 1 if $chars =~ m/^[ \t\r\n]+$/m; } push(@$buffer, [T_TEXT, \$chars]);
Show quoted text
> In the meantime, I found another bug in MediaWikiDump: the title of the > article <URL: > http://en.wikipedia.org/w/index.php?title=%C2%A0&redirect=no > is a > single no-break space, and it is considered to be empty by your module. > I think you should use something like [ \t\n\r] inststead of \s in the > regexp that matches empty nodes [attached is a patch that does this]. > (The XML specification defines witespace as spaces, tabs, and blank > lines, and I added \r in case of some strange Windows-Mac-Unix > interaction).
I'm not sure that is the right fix. It seems to me that if MediaWiki is going to allow whitespace only to exist in a title name it should mark the title node as whitespace preserving, as the text node is: <text xml:space="preserve">{{AprilCalendar}} There was a similar problem with usernames that contain only spaces in bug #16583. I contacted the MediaWiki developers about that and it was a known problem with the software. I think the proper solution here is like the other bug: a hack to force the title node to be whitespace preserving. This is done just like the last bug fix by adding a new if condition right under if ($$curent[1] eq 'username') { $ignore_ws_only = 0; } Just add if ($$curent[1] eq 'title') { $ignore_ws_only = 0; } Can you do that and let me know if it solves your problem? I'll file a bug report with MediaWiki as well to verify that it is bug. Thanks, Tyler Riddle
From: leurent [...] clipper.ens.fr
[TRIDDLE - Wed Jan 11 15:50:34 2006]: Show quoted text
> I'm not sure that is the right fix. It seems to me that if MediaWiki > is going to allow whitespace > only to exist in a title name it should mark the title node as > whitespace preserving, as the > text node is: > <text xml:space="preserve">{{AprilCalendar}}
As far as I understand, a no-break space is not whitespace in XML, and I didn't manage to create a page on Wikipedia whose title is only XML whitespace (ie space, tab, and newline, as far as I understand) Show quoted text
> Just add > if ($$curent[1] eq 'title') { > $ignore_ws_only = 0; > } > > Can you do that and let me know if it solves your problem?
Yes, this also solves the problem.
Sorry it took so long to get back to you; rt.cpan.org was down for a while then I got engrossed in another software project. I used your original suggestion and changed the regex in char_handler(). Attached is the version of Parse::MediaWikiDump version .31 that I intend to release. Thanks for your help and letting me know about the right way to solve the problem. Any more bug reports would be appreciated. =) Tyler Riddle
Download Parse-MediaWikiDump-0.31.tar.gz
application/x-gzip 12.2k

Message body not shown because it is not plain text.