ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.


AddThis Social Bookmark Button

Detecting Duplicate Code with PMD's CPD

by Tom Copeland

The Problem of Duplicated Code

A program with a lot of duplicated code is headed for trouble. Duplicated code means that bugs can appear in several places. It means longer compilation times and bigger binaries. It means major pain for whoever has to go through the program and make changes. Fortunately, there's a little open source utility called CPD--the copy/paste detector--that finds duplicate Java code for you.

How It Happens

It's a rare programmer who would assert that duplicated code is good, so why does it crop up in so many places?

  1. When people get in a hurry, they copy and paste rather than refactor code into methods. It's easy to get into the mindset of duplicating code rather than taking the time to a) move it into a method, b) add a parameter to the method, c) update the original code to use the new method, and d) test the original code. It seems so much easier simply to Ctrl-C/Ctrl-V that code into your function and move ahead. This produces a short-term gain, but a long-term loss, when you consider that any bugs in that first clump of code are now in the new code. Since it's usually the complicated bits of code that are copied and pasted, they're more likely to contain bugs.

  2. When two projects merge, sometimes the programmers have written the same code. This is especially true if both use servlets, JavaMail, or another common API. Both project teams will have written very similar utilities to parse parameters out of the HttpServletRequest, to resolve EJBs, or to build SQL statements. This results in programmars being hesitant to touch another's code since "Bill uses that weird mungeQuotedSQLString method."

  3. 0
  4. In large projects, over time, people forget what's already been written. Maybe they just don't know that other people are working on similar code. This is especially true if there's high turnover. I've fallen prey to this myself; I've written nifty little EJB utilities and then poked around the com.mycompany.util package tree a bit more only to find something similar. This sort of duplication can be reduced by good communication, but sometimes we all forget what's out there.

How To Fix It

We've discussed duplicated code, why it's bad, and how it gets into a project. Now that we know the evil, how do we root it out?

  1. Manual inspection is not the best answer. This is especially true on a large source tree, where no one person can continually scan the code for duplication. An active codebase, where people make daily changes to many different classes, is even harder. Pair programming improves the situation by increasing communication between programmers, but duplicate code can still grow over time. Even keeping a list of duplicates may not work: they'll probably be forgotten before anyone gets around to cleaning them up.

  2. Let the computer do the tedious work. It would be useful to have a tool that could trawl through a source tree and find duplicated chunks of code. You could run this tool nightly on a server or a powerful workstation and have a report mailed to you each morning or placed on an internal web server.

You saw it coming--yup, CPD, the Copy/Paste Detector, is such a tool. CPD is a utility that finds duplicated code in Java programs. It's bundled with PMD, and you can also can run it via Java Web Start from here. For a sample of the sort of thing CPD can unearth, here are some duplicates it found in the JDK 1.4 source code. Yikes!

Installing CPD

Simply download the latest binary release of PMD and unzip it into any directory. If you already have PMD installed somewhere on your system, you're all set.

Running CPD

Run the pmd/etc/cpdgui.bat script and you'll see CPD's main GUI (Figure 1). Select a source directory, click "Go," and CPD will start analyzing your code.

Figure 1: The main CPD GUI

Note that you can also run CPD from a command prompt. It's usuable in scripts, and runs much faster, since it's just crunching data and not updating a GUI's JProgressBars. To do this, you need to specify a tile size--that is, the minimum size of a duplicated chunk of code--and a directory tree to analyze. Here's an example of running CPD from a command prompt:

$ java -cp pmd-1.03.jar net.sourceforge.pmd.cpd.CPD 100 \

Be warned: CPD is very, very slow. Analyzing the popular Ant project's 46,400 lines of source code took nearly two hours to process on a 1GHz Pentium III. It also pegged the CPU the entire time. So you may want to kick it off right before you leave work, or run it on a small set of files--a single package, maybe, or a small project.

How It Works

CPD follows a couple of steps to find duplicate chunks of code: it tokenizes the source code, it builds an occurrence table, and it scans the table for duplicates. It sounds simple, so let's follow a source code fragment through the duplication detection process. Here's the code snippet:


The first step is to tokenize each source code file. CPD does this by piggybacking on PMD's JavaCC-generated tokenizer. The tokenizer reads each file and converts the characters into tokens. For example, System.out.println produces five separate tokens: System, ., out, ., and println. Along the way, the tokenizer discards whitespace and some other unneeded tokens like import statements, package statements, and semicolons. This reduces the number of tokens that need to be scanned, and it gets rid of uninteresting duplicate chunks like duplicate sequences of import statements. Our source code snippet is now tokenized and looks like this:

System . out . println ( " Hello " ) System . out . println ( " World " )

CPD next builds an occurrence table; that is, a list of tokens and their locations. This is something like a frequency table, but with more detail--instead of just how many times each token occurs, CPD also needs to track where each token occurs. For example, the token println might occur in several hundred places in a set of source code files. Here's the occurrence table for our code:

Tile Occurrences
System1, 11
.2, 4, 12, 14
out3, 13
println5, 15
(6, 16
"8, 10, 18, 20
)9, 21

Note that there's only one row for tokens that occur in several places. For example, the println token appears twice in the original code sample, but only once in the occurrence table.

CPD then scans the occurrence table and expands each token into a tile--a sequence of tokens--by attaching it to the token next to it. Since System is a token, it gets expanded into a tile of System..

Tile Occurrences
System.1, 11
.println4, 14
("6, 16
")10, 20

Where did out go? We discarded it for now. When we try to expand each token, we first ensure that we're not expanding into another tile. When we expanded the out token, we tried to latch on the . token in positions 4 and 14. These . tokens were already in use by other tiles, so we discarded the tile. Let's expand the tiles again:

Tile Occurrences
System.out1, 11
.println(4, 14

Our table is even smaller, because we removed some tiles that were already covered by other tiles. We dropped the last tile, "), because it was at the end of our token list.

After several more expansions, we end up with an occurrence table containing one tile:

Tile Occurrences
System.out.println("1, 11

That's our duplicate code snippet, and we know exactly where it occurs. We're done!

I've omitted some of the nitty gritty details. For example, CPD also keeps track of line numbers so the report is more useful. CPD also accepts a lower bound on the duplicates that it finds, so you don't end up getting duplicate chunks that are only 20 tokens long. I usually scan for duplicates longer than 100 tokens and then go smaller once the big problems have been cleaned up.

Future Efforts

CPD has plenty of room for improvement; here are some of the things we're looking at:

  • I've written a distributed version of CPD that uses JavaSpaces to spread the scanning load over several machines. This distributed version works, but it could use some improvement in memory usage.
  • CPD could be adapted to work with C, C++, PHP, Ruby, Perl, or any other language for which a tokenizer exists. There could be a runtime toggle to select which language to parse.
  • There are various ways to make the token set smaller and the duplicate chunks more interesting. Not counting individual variable names as differences is one of these ways.


We've examined duplicate code as a problem, talked about how it gets into our programs, and discussed some ways to get rid of it. We've looked at the open source utility CPD, the Copy/Paste Detector, and seen how to install and run it, and we've gone through an overview of how CPD finds duplicate code. Get CPD scanning your code and see what it finds!


David Dixon-Peugh created this algorithm--he put it together on a white board in about an hour. It's been improved in various ways since then; most changes were inspired by several papers published by Michael Wise in December 1993. The CPD algorithm is decidedly inferior to Dr. Wise's algorithms both in execution time and memory usage; hopefully, someone smarter than I am will read this article and optimize the algorithm.


Tom Copeland started programming on a TRS-80 Model III, but demand for that skill has waned and he now programs mostly in Java and Ruby.

Return to ONJava.com.