Skip Menu |

This queue is for tickets about the Text-Corpus-Summaries-Wikipedia CPAN distribution.

Report information
The Basics
Id: 83324
Status: resolved
Worked: 8 hours (480 min)
Priority: 0/
Queue: Text-Corpus-Summaries-Wikipedia

People
Owner: KUBINA [...] cpan.org
Requestors: juan-manuel.torres [...] univ-avignon.fr
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Bug REPPORT
Date: Thu, 14 Feb 2013 18:48:44 +0100
To: bug-Text-Corpus-Summaries-Wikipedia [...] rt.cpan.org
From: Juan-Manuel Torres <juan-manuel.torres [...] univ-avignon.fr>
Text::Corpus::Summaries::Wikipedia 0.21 perl 5, version 14, subversion 2 (v5.14.2) Linux 3.5.0-24-generic Ubuntu 12.10 BUG: ./Text-Corpus-Summaries-Wikipedia-0.21$ perl scripts/examples_Text_Corpus_Summaries_Wikipedia.pl Could not extract corresponding English name of the language codes from page 'http://meta.wikimedia.org/wiki/List_of_Wikipedias_by_language_family'; format change my require a new xpath expression. Could not extract corresponding English name of the language codes from page 'http://meta.wikimedia.org/wiki/Wikipedia_featured_articles'; format change my require a new xpath expression. No featured article URLs were extracted from the 'en' Wikipedia featured article pages. Perhaps the formatting of the links has changed. MERCI! THANKS -- Juan-Manuel TORRES Responsable TALNE Laboratoire Informatique d'Avignon / Université d'Avignon BP 91228, 84911 Avignon Cedex 9, FRANCE Téléphone : (+33) 04 90 84 35 68 Télécopie : (+33) 04 90 84 35 01
I don't have time to fix the parsing anytime soon but you can download a zip archive of the summaries and articles from 2011 from here

http://jeffkubina.org/data/summaries/wikipedia_fa.zip

Subject: Re: [rt.cpan.org #83324] Bug REPPORT
Date: Fri, 15 Feb 2013 10:52:41 +0100
To: bug-Text-Corpus-Summaries-Wikipedia [...] rt.cpan.org
From: Juan-Manuel Torres <juan-manuel.torres [...] univ-avignon.fr>
Le 14/02/2013 20:48, Jeff Kubina via RT a écrit : Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=83324 > > > I don't have time to fix the parsing anytime soon but you can download a zip > archive of the summaries and articles from 2011 from here > > http://jeffkubina.org/data/summaries/wikipedia_fa.zip > >
Thank you very much! A little question: this corpus, from Wikipedias was extracted (only) from the year 2011 or from 2011 to now ? Merci! -- Juan-Manuel TORRES Responsable TALNE Laboratoire Informatique d'Avignon / Université d'Avignon BP 91228, 84911 Avignon Cedex 9, FRANCE Téléphone : (+33) 04 90 84 35 68 Télécopie : (+33) 04 90 84 35 01
Show quoted text
> A little question: this corpus, from Wikipedias was extracted (only) 
> from the year 2011 or from 2011 to now ?

The data set contains about 12,000 articles created from the featured wikipedia articles in May 2010 (not from May 2010 to now) from the following 41 languages: afrikaans, arabic, bulgarian, catalan, czech, german, greek, english, esperanto, spanish, basque, persian, finnish, french, hebrew, croatian, hungarian, indonesian, italian, japanese, georgian, korean, malayalam, marathi, malay, dutch, norwegian, norwegian, polish, portuguese, romanian, serbo-croatian, slovak, slovenian, serbian, swedish, thai, turkish, vietnamese, and chinese.
Updated the xpath queries for extracting the links to the featured articles from all the supported languages and the article pages. Improved the article parsing code. All updates incorporated into version 0.22. Corpus generated February 2013 is at http://jeffkubina.org/data/wfa/.