Subject: | Problem with unicode in article names |
Date: | Sat, 24 Jul 2010 18:12:41 +0400 |
To: | bug-MediaWiki-API [...] rt.cpan.org |
From: | Nikolay Shaplov <n [...] shaplov.ru> |
I am trying to parse french wiktionary using MediaWiki::API.
I've met some problems with unicode. Here is an example:
use strict;
use MediaWiki::API;
my $mw = MediaWiki::API->new();
$mw->{config}->{api_url} = 'http://fr.wiktionary.org/w/api.php';
$mw->{config}->{use_http_get}=1;
$mw->{config}->{ skip_encoding } =1;
my $articles = $mw->list ( {
action => 'query',
list => 'categorymembers',
cmtitle => 'Catégorie:français',
cmcontinue => 'campisoliennes !Campisoliennes|',
cmlimit => 'max' } , {skip_encoding => 1}) || die
$mw->{error}->{code} . ': ' . $mw->{error}->{details};.
When it gets to the cmcontinue = "canaux darrosage ! !canaux d’arrosage|" script fails:
Can't escape \x{2019}, try uri_escape_utf8() instead at MediaWiki/API.pm line 754
Right now this example reproduces the error in one step, but if wiki maintainers add some more
words to the category the behavior might changed...
To solve this problem I've forced cleaning of utf-8 flag before url escaping in _make_querystring
sub _make_querystring {
my ($ref) = @_;
print $ref->{cmcontinue}, "\n";
my @qs = ();
for my $key ( keys %{$ref} ) {
my $val=$ref->{$key};
Encode::_utf8_off($val);
my $keyval = uri_escape($key) . '=' . uri_escape($val);
push(@qs, $keyval);
}
return '?' . join('&',@qs);
}
With this patch everything works well