Bug #31624 for Perl-Critic: Policy suggestion : identify similar/duplicate code for refactoring

Mon Dec 17 20:47:25 2007 SHERLOCK [...] cpan.org - Ticket created

Subject:

Policy suggestion : identify similar/duplicate code for refactoring

A hallmark of bad code is duplicated, or nearly duplicated blocks of code in a single distribution - when you inherit a huge swathe of horrible code to maintain, this can be a significant problem. A policy that would be akin to the CPD (http://pmd.sourceforge.net/cpd.html) to find duplicated code, or even better, could somehow compare PPI representations of arbitrary chunks of code both within and across files in a distribution, ignoring the actual variable names and values, but able to detect duplicated logic, would be hugely valuable. It could then suggest possible ways in which the code could be re-factored.

Mon Dec 17 23:02:47 2007 ELLIOTJS [...] cpan.org - Correspondence added

From:

ELLIOTJS [...] cpan.org

My, you don't think small, do you? Care to submit an implementation of this policy? Yes, this is a good idea, but it'll take a decent amount of effort to do it.

Mon Dec 17 23:02:51 2007 The RT System itself - Status changed from 'new' to 'open'

Wed Dec 19 13:43:47 2007 sherlock [...] genome.stanford.edu - Correspondence added

Subject:	Re: [rt.cpan.org #31624] Policy suggestion : identify similar/duplicate code for refactoring
Date:	Wed, 19 Dec 2007 10:42:22 -0800
To:	bug-Perl-Critic [...] rt.cpan.org
From:	Gavin Sherlock <sherlock [...] genome.stanford.edu>

I'll try to think about this a little over the next couple of week, at least on an algorithmic level. My initial thought is that you can read in each Perl document, strip comments and whitespace, and then walk the PDOM tree (not yet clear to me how to do this), and simply use each subtree (maybe above a certain size) as a hash key, and push a structure recording various attributes about that subtree onto the value. When you encounter a PDOM subtree with the same structure again, you will know that you already have a key that matches, so you can push a new value on for that hash key recording the attributes about this subtree. Once you've been been through all the documents in the distribution, you can then iterate over the hash doing more fine grained comparisons of the subtrees whose structures were the same, to see if they really are similar. I have no idea yet whether this is even remotely close to an idea that might work, but if I get time will see if I can come up with a toy implementation. On Dec 17, 2007, at 8:02 PM, ELLIOTJS via RT wrote: Show quoted text

> > <URL: http://rt.cpan.org/Ticket/Display.html?id=31624 > > > My, you don't think small, do you? Care to submit an implementation > of this policy? > > Yes, this is a good idea, but it'll take a decent amount of effort > to do it.

Thu Dec 20 00:50:09 2007 ELLIOTJS [...] cpan.org - Correspondence added

From:

ELLIOTJS [...] cpan.org

One thing you cannot do is modify the DOM used by P::C. It is shared across all policies. It would take forever to create a separate DOM object for each. Also, at present, P::C itself only keeps one DOM tree in memory at a time. However, go have a look at the policies in P::C::StricterSubs and P::C::Dynamic for some code that might help you.

Thu Dec 20 12:04:13 2007 jeffrey_thalhammer [...] yahoo.com - Correspondence added

Subject:	Re: [rt.cpan.org #31624] Policy suggestion : identify similar/duplicate code for refactoring
Date:	Thu, 20 Dec 2007 09:03:47 -0800 (PST)
To:	bug-Perl-Critic [...] rt.cpan.org
From:	Jeffrey Thalhammer <jeffrey_thalhammer [...] yahoo.com>

I think it would probably be better to implement this as a stand-alone tool. We could probably wrap it with a test module, similar to Test::Perl::Critic. I've been looking at the source code for CPD. From my initial investigation, it appears that it only looks for exact duplicates among blocks of code. So I think changing a variable name in an otherwise duplicate block would constitute "different" code. I'm not yet sure how CPD figures out the appropriate block size (e.g once it finds a matching line, does it just keep appending lines until they don't match anymore). I think we could improve upon this in a couple ways. First, we can normalize things like variable names (and possibly literal strings) so the comparison will focus on semantic duplication. Second, we could compute a score (e.g. from 1 to 100) that indicates how strong the match is. There's a lot of literature on finding plagiarism in documents, which is a very similar problem. If you happen to find a particularly useful discussion or algorithm, please send me a link. -Jeff

Mon Feb 27 06:01:04 2012 dolmen [...] cpan.org - Correspondence added

RT-Send-CC:

jeffrey_thalhammer [...] yahoo.com

Le 2007-12-20 18:04:13, jeffrey_thalhammer@yahoo.com a écrit : Show quoted text

> I've been looking at the source code for CPD. From my initial > investigation, it appears that it only looks for exact duplicates > among blocks of code. So I think changing a variable name in an > otherwise duplicate block would constitute "different" code.

Code duplicated that does not even change variable names is an obvious target for refactoring with low risk (at least lower risk than duplicates with moveable parts). So even detecting that would be useful. -- Olivier Mengué - http://perlresume.org/DOLMEN

Mon Feb 27 13:20:26 2012 jeff [...] imaginative-software.com - Correspondence added

Subject:	Re: [rt.cpan.org #31624] Policy suggestion : identify similar/duplicate code for refactoring
Date:	Mon, 27 Feb 2012 10:20:14 -0800
To:	bug-Perl-Critic [...] rt.cpan.org
From:	Jeffrey Thalhammer <jeff [...] imaginative-software.com>

On Feb 27, 2012, at 3:01 AM, Olivier Mengué via RT wrote: Show quoted text

> Code duplicated that does not even change variable names is an obvious > target for refactoring with low risk (at least lower risk than > duplicates with moveable parts). So even detecting that would be useful.

Funny you mention that. I was just looking at the code I wrote to do that... http://perlcritic.tigris.org/svn/perlcritic/trunk/distributions/Perl-Critic-Redundancy It is only half baked. I think I gave up because I couldn't figure out what the minimum unit of comparison should be (e.g. expression, statement, block, or sub). It might be sufficient to just compute the levenshtein distance of two subs and warn if the distance is less than N (where N is some configurable value, and 0 would be exact duplicates). Whitespace and comments need to be normalized (or just removed) first. Searching for duplication is a lot more interesting if you look across files. But this doesn't really fit Perl::Critic's model of file-by-file analysis. So I expected that this would be a separate tool. You are totally welcome to take a look at the stuff I wrote and try to make it work. -Jeff

Bug #31624 for Perl-Critic: Policy suggestion : identify similar/duplicate code for refactoring

Preferred bug tracker