Skip Menu |

This queue is for tickets about the CPAN2RT CPAN distribution.

Report information
The Basics
Id: 80020
Status: resolved
Priority: 0/
Queue: CPAN2RT

People
Owner: Nobody in particular
Requestors: i.norton [...] shadowcat.co.uk
Cc: mst [...] shadowcat.co.uk
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



CC: mst [...] shadowcat.co.uk
Subject: Importing external bug tracker information into rt.cpan.org
Hi Peeps, Please see the attached git diff for the addition of external bug tracking information to the rt.cpan.org database. Commit message as follows: This patch adds the queue Attribute 'DistributionBugtracker' which is populated using data from the MetaCPAN API. A two parse approach is taken, one to set or update the information from MetaCPAN into the RT database and a second that removes the information from queues that no longer require it. It also includes a fix to correctly parse the skip options at the CLI. Please let me know if you require any further information or changes. This work has been sponsored by Shadowcat Systems. Thanks, Ian.
Subject: cpan2rt-external-bugtracker.patch
diff --git a/bin/cpan2rt b/bin/cpan2rt index b334258..9cc172b 100755 --- a/bin/cpan2rt +++ b/bin/cpan2rt @@ -81,7 +81,7 @@ $commands{ $command }->(); sub cmd_update { my %opt = ( sync => 1, force => 0, debug => 0, skip => [] ); GetOptions( \%opt, 'sync!', 'force!', 'debug!', 'home=s', 'datadir=s', 'mirror=s', 'skip=s@' ); - $opt{'skip'} = { map $_ => 1, @{$opt{'skip'}} }; + $opt{'skip'} = { map { $_ => 1 } @{$opt{'skip'}} }; my $importer = CPAN2RT->new( %opt ); $importer->sync_files( $opt{'mirror'} ) if $opt{'sync'}; @@ -89,6 +89,7 @@ sub cmd_update { $importer->sync_distributions( $opt{'force'} ) unless $opt{'skip'}{'distributions'}; $importer->sync_versions( $opt{'force'} ) unless $opt{'skip'}{'versions'}; $importer->sync_maintainers( $opt{'force'} ) unless $opt{'skip'}{'maintainers'}; + $importer->sync_bugtracker( $opt{'force'} ) unless $opt{'skip'}{'bugtrackers'}; } sub usage { diff --git a/lib/CPAN2RT.pm b/lib/CPAN2RT.pm index 1f39c69..4262e4d 100644 --- a/lib/CPAN2RT.pm +++ b/lib/CPAN2RT.pm @@ -150,6 +150,82 @@ sub fetch_file { return 1; } +=head2 fetch_bugtracker + +Retrieve bugtracker information from the meta CPAN API. + +=cut + +sub fetch_bugtracker { + my $self = shift; + + my $url = "http://api.metacpan.org/v0/"; + require MetaCPAN::API; + my $mcpan = MetaCPAN::API->new( + base_url => $url, + ); + + # Pull the details of distribution bugtrackers + my $result = $mcpan->post( + 'release/_search', { + query => { match_all => {} }, + size => 5000, + fields => [ "distribution" , "resources.bugtracker" ], + filter => { + and => [{ + or => [ + { exists => { field => "resources.bugtracker.email" }}, + { exists => { field => "resources.bugtracker.web" }}, + ]}, + { term => { "release.status" => "latest" }}, + { term => { "release.maturity" => "released" }}, + ], + }, + }, + ); + + unless ( defined($result) ) { + print STDERR "Request to '$url' failed.\n"; + return undef; + } + + debug { "Fetched '$url'\n" }; + + my $data = {}; + + # Iterate the results from MetaCPAN + foreach my $dist (@{$result->{hits}->{hits}}) { + my $distribution = $dist->{fields}->{distribution}; + + # Email based alternative + if(defined($dist->{"fields"}->{"resources.bugtracker"}->{"email"})) { + my $email = $dist->{"fields"}->{"resources.bugtracker"}->{"email"}; + + # We don't care if this is rt.cpan.org + if($email =~ m/rt.cpan.org/) { + next; + } + + $data->{$distribution}->{"email"} = $email; + } + + # Web based alternative + if(defined($dist->{"fields"}->{"resources.bugtracker"}->{"web"})) { + my $web = $dist->{"fields"}->{"resources.bugtracker"}->{"web"}; + + # We don't care if this is rt.cpan.org + if($web =~ m/rt.cpan.org/) { + next; + } + + $data->{$distribution}->{"web"} = $web; + } + } + + return $data; +} + + { my $cache; sub authors { my $self = shift; @@ -293,6 +369,106 @@ sub sync_authors { return (1); } +sub sync_bugtracker { + my $self = shift; + + my $data = $self->fetch_bugtracker(); + $self->_sync_bugtracker_cpan2rt({ data => $data }); + + $self->_sync_bugtracker_rt2cpan({ data => $data }); + +} + +=head2 _sync_bugtracker_cpan2rt + +Sync DistributionBugtracker info from CPAN to RT. +This updates and adds to existing queues. + +=cut + +sub _sync_bugtracker_cpan2rt { + my $self = shift; + my $args = shift; + + my $data = $args->{"data"}; + + # Iterate through the ditributions. + foreach my $dist (keys(%{$data})) { + my $text = ""; + + # Build the text to set in the queue attribute. + foreach my $method (keys(%{$data->{$dist}})) { + my $uri = $data->{$dist}->{$method}; + + if( $method eq "email" ) { + $text .= "<div>Please email the <a href=\"mailto:$uri\">alternative bug tracker</a> to report your issue.</div>"; + } + + elsif( $method eq "web" ) { + $text .= "<div>Please visit the <a href=\"$uri\">alternative bug tracker</a> to report your issue.</div>"; + } + } + + # Fetch the queue + my $queue = $self->load_queue( $dist ); + unless( $queue ) { + debug { "No queue for dist '$dist'" }; + next; + } + + my $name = "DistributionBugtracker"; + + # Get the existing attribute from the queue and log if it's changing + my $attr = $queue->FirstAttribute( $name ); + + if(defined($attr)) { + unless($attr->Content eq $text) { + debug { "Changing DistributionBugtracker for $dist from '" . $attr->Content . "' to '$text'\n" }; + } + } + + else { + debug { "Changing DistributionBugtracker for $dist from nothing to '$text'\n" }; + } + + # Set the queue attribute + $queue->SetAttribute( + Name => $name, + Content => $text, + ); + } +} + +=head2 _sync_bugtracker_rt2cpan + +Sync DistributionBugtracker info from RT to CPAN. +This deletes records that are no longer needed or missing in the source. + +=cut + +sub _sync_bugtracker_rt2cpan { + my $self = shift; + my $args = shift; + + my $data = $args->{"data"}; + my $name = "DistributionBugtracker"; + + my $queues = RT::Queues->new( $RT::SystemUser ); + $queues->LimitAttribute( NAME => $name ); + + # Iterate over queues from RT + while(my $queue = $queues->Next()) { + + my $dist = $queue->Name(); + + # Check that the source defines this queue as having an external tracker + unless(defined($data->{$dist})) { + # Delete the attribute, it's no longer needed. + $queue->DeleteAttribute( $name ); + } + } +} + sub sync_distributions { my $self = shift; my $force = shift;
Subject: Re: [rt.cpan.org #80020] Importing external bug tracker information into rt.cpan.org
Date: Fri, 19 Oct 2012 16:22:01 -0700
To: bug-CPAN2RT [...] rt.cpan.org
From: Thomas Sibley <trs [...] bestpractical.com>
On 10/19/2012 04:02 PM, bug-CPAN2RT@rt.cpan.org wrote: Show quoted text
> Please see the attached git diff for the addition of external bug > tracking information to the rt.cpan.org database.
Thanks! Overall this patch and your others for RT-BugTracker and RT-BugTracker-Public look good. I have a few notes below based on the code, and if you could address those, that'd be great. The commit messages are much appreciated. When applying your patches, I'll likely break them up into a few more commits, just FYI. Show quoted text
> +=head2 fetch_bugtracker > + > +Retrieve bugtracker information from the meta CPAN API. > + > +=cut > + > +sub fetch_bugtracker { > + my $self = shift; > + > + my $url = "http://api.metacpan.org/v0/"; > + require MetaCPAN::API; > + my $mcpan = MetaCPAN::API->new( > + base_url => $url,
Adding ua_args => [ agent => 'CPAN2RT' ] here would be great. Show quoted text
> + ); > + > + # Pull the details of distribution bugtrackers > + my $result = $mcpan->post( > + 'release/_search', { > + query => { match_all => {} }, > + size => 5000, > + fields => [ "distribution" , "resources.bugtracker" ], > + filter => { > + and => [{ > + or => [ > + { exists => { field => "resources.bugtracker.email" }}, > + { exists => { field => "resources.bugtracker.web" }}, > + ]}, > + { term => { "release.status" => "latest" }}, > + { term => { "release.maturity" => "released" }}, > + ], > + }, > + }, > + );
1) resources.bugtracker.email is spelled resources.bugtracker.mailto :) 2) Is it feasible to further limit returned results to those where .web or .mailto lacks "rt.cpan.org"? 3) What happens if there are more than 5000 distributions which have a bugtracker defined? There probably needs to be some pagination happening. (Scrolling in ES terms?) Show quoted text
> +=head2 _sync_bugtracker_cpan2rt > + > +Sync DistributionBugtracker info from CPAN to RT. > +This updates and adds to existing queues. > + > +=cut > + > +sub _sync_bugtracker_cpan2rt { > + my $self = shift; > + my $args = shift; > + > + my $data = $args->{"data"}; > + > + # Iterate through the ditributions. > + foreach my $dist (keys(%{$data})) { > + my $text = ""; > + > + # Build the text to set in the queue attribute. > + foreach my $method (keys(%{$data->{$dist}})) { > + my $uri = $data->{$dist}->{$method}; > + > + if( $method eq "email" ) { > + $text .= "<div>Please email the <a href=\"mailto:$uri\">alternative bug tracker</a> to report your issue.</div>"; > + } > + > + elsif( $method eq "web" ) { > + $text .= "<div>Please visit the <a href=\"$uri\">alternative bug tracker</a> to report your issue.</div>"; > + } > + }
The HTML really shouldn't be shoved into the attribute to be spit out later. RT::Attributes can contain arbitrary Perl data structures. The HTML rendering should be in /Dist/Elements/ShowBugtracker. Note that this also opens up rt.cpan.org to XSS vulnerabilities because $uri is unescaped. Show quoted text
> + > + # Fetch the queue > + my $queue = $self->load_queue( $dist ); > + unless( $queue ) { > + debug { "No queue for dist '$dist'" }; > + next; > + } > + > + my $name = "DistributionBugtracker"; > + > + # Get the existing attribute from the queue and log if it's changing > + my $attr = $queue->FirstAttribute( $name );
Use $queue->DistributionBugtracker and $queue->SetDistributionBugtracker since you have those methods available. Show quoted text
> + if(defined($attr)) { > + unless($attr->Content eq $text) { > + debug { "Changing DistributionBugtracker for $dist from '" . $attr->Content . "' to '$text'\n" }; > + } > + } > + > + else { > + debug { "Changing DistributionBugtracker for $dist from nothing to '$text'\n" }; > + }
If the attribute is already the same, we should skip the update. Show quoted text
> + > + # Set the queue attribute > + $queue->SetAttribute( > + Name => $name, > + Content => $text, > + ); > + } > +}
Show quoted text
> Thanks! Overall this patch and your others for RT-BugTracker and > RT-BugTracker-Public look good. I have a few notes below based on the > code, and if you could address those, that'd be great.
Cool, new patch attached here and on #79499. Show quoted text
> > The commit messages are much appreciated. When applying your patches, > I'll likely break them up into a few more commits, just FYI.
Sure. Show quoted text
> 1) resources.bugtracker.email is spelled resources.bugtracker.mailto > :)
Whoops! Corrected, also changed the variable name for consistency. Show quoted text
> 2) Is it feasible to further limit returned results to those where > .web > or .mailto lacks "rt.cpan.org"?
Spoke to the metacpan guys on irc and seemingly it would be expensive to do this server side. Request submitted to have the fields added as full text searchable - https://github.com/CPAN-API/cpan-api/issues/238 following a chat with clintongormley. Once that's done then we can improve this. Show quoted text
> 3) What happens if there are more than 5000 distributions which have a > bugtracker defined? There probably needs to be some pagination > happening. (Scrolling in ES terms?)
Oh, that's a fail. Switched to Elastic::Search as per the example on the meta cpan site, I think this is correct now. Show quoted text
> The HTML really shouldn't be shoved into the attribute to be spit out > later. RT::Attributes can contain arbitrary Perl data structures. > The HTML rendering should be in /Dist/Elements/ShowBugtracker.
Yup, done that. Also switched to using a hash for storage which means we get both the mailto and the web rather than just the last one it sees. Show quoted text
> Note that this also opens up rt.cpan.org to XSS vulnerabilities > because $uri is unescaped.
I've added validation into the SetDistributionBugtracker sub in order to check that both are correct using Email::Address (this seems to be used else where for similar) and URI. Is that good enough or do you have something more specific in mind? Show quoted text
> Use $queue->DistributionBugtracker and $queue-> >SetDistributionBugtracker since you have those methods available.
Fair point, change made and additional validation added to the set function. Show quoted text
> If the attribute is already the same, we should skip the update.
That's done now too. Let me know if there are any further issues. Regards, Ian.
Subject: cpan2rt-external-bugtracker.patch
diff --git a/bin/cpan2rt b/bin/cpan2rt index b334258..9cc172b 100755 --- a/bin/cpan2rt +++ b/bin/cpan2rt @@ -81,7 +81,7 @@ $commands{ $command }->(); sub cmd_update { my %opt = ( sync => 1, force => 0, debug => 0, skip => [] ); GetOptions( \%opt, 'sync!', 'force!', 'debug!', 'home=s', 'datadir=s', 'mirror=s', 'skip=s@' ); - $opt{'skip'} = { map $_ => 1, @{$opt{'skip'}} }; + $opt{'skip'} = { map { $_ => 1 } @{$opt{'skip'}} }; my $importer = CPAN2RT->new( %opt ); $importer->sync_files( $opt{'mirror'} ) if $opt{'sync'}; @@ -89,6 +89,7 @@ sub cmd_update { $importer->sync_distributions( $opt{'force'} ) unless $opt{'skip'}{'distributions'}; $importer->sync_versions( $opt{'force'} ) unless $opt{'skip'}{'versions'}; $importer->sync_maintainers( $opt{'force'} ) unless $opt{'skip'}{'maintainers'}; + $importer->sync_bugtracker( $opt{'force'} ) unless $opt{'skip'}{'bugtrackers'}; } sub usage { diff --git a/lib/CPAN2RT.pm b/lib/CPAN2RT.pm index 1f39c69..53597f2 100644 --- a/lib/CPAN2RT.pm +++ b/lib/CPAN2RT.pm @@ -150,6 +150,72 @@ sub fetch_file { return 1; } +=head2 fetch_bugtracker + +Retrieve bugtracker information from the meta CPAN API. + +=cut + +sub fetch_bugtracker { + my $self = shift; + + require ElasticSearch; + my $es = ElasticSearch->new( + servers => 'api.metacpan.org', + no_refresh => 1, + ); + + # Pull the details of distribution bugtrackers + my $scroller = $es->scrolled_search( + query => { match_all => {} }, + size => 100, + search_type => 'scan', + scroll => '5m', + index => 'v0', + type => 'release', + fields => [ "distribution" , "resources.bugtracker" ], + filter => { + and => [{ + or => [ + { exists => { field => "resources.bugtracker.mailto" }}, + { exists => { field => "resources.bugtracker.web" }}, + ]}, + { term => { "release.status" => "latest" }}, + { term => { "release.maturity" => "released" }}, + ], + }, + ); + + unless ( defined($scroller) ) { + die("Request to api.metacpan.org failed.\n"); + } + + debug { "Fetched data from api.metacpan.org\n" }; + + my $data = {}; + + # Iterate the results from MetaCPAN + while ( my $result = $scroller->next ) { + + # Record data + my $distribution = $result->{"fields"}->{"distribution"}; + my $mailto = $result->{"fields"}->{"resources.bugtracker"}->{"mailto"}; + my $web = $result->{"fields"}->{"resources.bugtracker"}->{"web"}; + + # Email based alternative - we don't care if this is rt.cpan.org + if(defined($mailto) && !($mailto =~ m/rt\.cpan\.org/)) { + $data->{$distribution}->{"mailto"} = $mailto; + } + + # Web based alternative - we don't care if this is rt.cpan.org + if(defined($web) && !($web =~ m/rt\.cpan\.org/)) { + $data->{$distribution}->{"web"} = $web; + } + } + + return $data; +} + { my $cache; sub authors { my $self = shift; @@ -293,6 +359,122 @@ sub sync_authors { return (1); } +sub sync_bugtracker { + my $self = shift; + + my $data = $self->fetch_bugtracker(); + $self->_sync_bugtracker_cpan2rt({ data => $data }); + + $self->_sync_bugtracker_rt2cpan({ data => $data }); + +} + +=head2 _sync_bugtracker_cpan2rt + +Sync DistributionBugtracker info from CPAN to RT. +This updates and adds to existing queues. + +=cut + +sub _sync_bugtracker_cpan2rt { + my $self = shift; + my $args = shift; + + my $data = $args->{"data"}; + + # Iterate through the ditributions. + foreach my $dist (keys(%{$data})) { + my $bugtracker = {}; + + # Build the text to set in the queue attribute. + foreach my $method (keys(%{$data->{$dist}})) { + my $uri = $data->{$dist}->{$method}; + + if( $method eq "mailto" || $method eq "web" ) { + $bugtracker->{$method} = $uri; + } + } + + # Fetch the queue + my $queue = $self->load_queue( $dist ); + unless( $queue ) { + debug { "No queue for dist '$dist'" }; + next; + } + + # Get the existing bugtracker from the queue and log if it's changing + my $attr = $queue->DistributionBugtracker(); + + # Set this if we need to update when we're done + my $update = 0; + + # If the attr is defined, then check it hasn't changed. + if(defined($attr)) { + + debug { "Bugtracker set for distribution '$dist'. Has it changed?\n" }; + + foreach my $method (keys(%{$bugtracker})) { + + if(ref($attr) eq "REF") { + # If this method has changed, log it + if(defined($attr->{$method}) && $attr->{$method} ne $bugtracker->{$method}) { + debug { "Changing DistributionBugtracker for $dist from '" . $attr->{$method} . "' to '" . $bugtracker->{$method} . "'\n" }; + $update = 1; + } + } + + else { + # Hmm, something odd happened. Data in the db is wrong, fix it. + $update = 1; + } + } + } + + else { + debug { "Setting DistributionBugtracker for $dist from nothing\n" }; + $update = 1; + } + + + if($update) { + # Set the queue bugtracker + $queue->SetDistributionBugtracker( $bugtracker ); + } + } + + return 1; +} + +=head2 _sync_bugtracker_rt2cpan + +Sync DistributionBugtracker info from RT to CPAN. +This deletes records that are no longer needed or missing in the source. + +=cut + +sub _sync_bugtracker_rt2cpan { + my $self = shift; + my $args = shift; + + my $data = $args->{"data"}; + my $name = "DistributionBugtracker"; + + my $queues = RT::Queues->new( $RT::SystemUser ); + $queues->LimitAttribute( NAME => $name ); + + # Iterate over queues from RT + while(my $queue = $queues->Next()) { + + my $dist = $queue->Name(); + + # Check that the source defines this queue as having an external tracker + unless(defined($data->{$dist})) { + # Delete the attribute, it's no longer needed. + $queue->DeleteAttribute( $name ); + } + } +} + sub sync_distributions { my $self = shift; my $force = shift;
Subject: Re: [rt.cpan.org #80020] Importing external bug tracker information into rt.cpan.org
Date: Tue, 30 Oct 2012 12:14:07 -0700
To: bug-CPAN2RT [...] rt.cpan.org
From: Thomas Sibley <trs [...] bestpractical.com>
On 10/29/2012 11:19 AM, bug-CPAN2RT@rt.cpan.org wrote: Show quoted text
>> > 2) Is it feasible to further limit returned results to those where >> > .web or .mailto lacks "rt.cpan.org"?
> Spoke to the metacpan guys on irc and seemingly it would be expensive to > do this server side. Request submitted to have the fields added as full > text searchable - https://github.com/CPAN-API/cpan-api/issues/238 > following a chat with clintongormley. Once that's done then we can > improve this.
Ah, thanks. That's too bad we can't limit out of the gate. Show quoted text
> Oh, that's a fail. Switched to Elastic::Search as per the example on > the meta cpan site, I think this is correct now.
Great. Show quoted text
>> > Note that this also opens up rt.cpan.org to XSS vulnerabilities >> > because $uri is unescaped.
> I've added validation into the SetDistributionBugtracker sub in order to > check that both are correct using Email::Address (this seems to be used > else where for similar) and URI. Is that good enough or do you have > something more specific in mind?
++ for the Email::Address and URI validation, but there's another underlying rendering issue: at display time you're including the "|n" flag when interpolating the web/mailto URLs. This tells Mason to not apply it's default HTML escaping, which would allow third-party HTML injection into rt.cpan.org. Removing the |n is sufficient for that. However, beyond just checking if .web parses with URI, we also need to check that it's only a http or https URI. Otherwise you can use other schemes with security problems for a website (data:, javascript:). Otherwise, this round of patches looks good! Thanks a lot for iterating on this. If you have the time, can you fix the security issues above? If not, I'll wait to apply the patches until I have time to fix them as well. Cheers, Thomas
Ian, Sorry for the long delay. I've applied your patches to CPAN2RT, RT-BugTracker, and RT-BugTracker-Public, with some minor tweaks here and there. Each repo has an alternate-bugtracker branch. The branches are on top of the rt4 branches in those extensions, which I'm using as I prepare to upgrade rt.cpan.org to RT 4 from 3.8. Eventually it'll all be merged down. In testing it out, I've noted that all of the alternate bugtracker data is loaded into memory from MetaCPAN. This makes the cpan2rt importer take upwards of 500MB of memory from the MetaCPAN data alone (skipping all other steps). In the current and soon-to-be production environments, that's a no go to be running every two hours. Hopefully down the road chewing through 500MB of RAM won't matter, but right now it does. :) I plan to rework it to be incremental and hence never load the entire dataset into memory, but I'm going to focus on the RT 4 upgrade first and then roll out the alternate bugtracker data after. This is just a status update for you. If you have the cycles to make it incremental, that'd be awesome and it would get rolled out that much sooner. Cheers, Thomas
Resolving. These branches are merged into the rt4 branch, which will be deployed on rt.cpan.org during the RT upgrade. The RT upgrade is currently in beta testing. Thanks for your work, Ian!