ONJava.com    
 Published on ONJava.com (http://www.onjava.com/)
 See this if you're having trouble printing code examples


Using Lucene to Search Java Source Code

by Renuka Sindhgatta
01/18/2006

Several websites allow the software development community to share information by publishing developer guidelines, white papers, FAQs, and source code. As the information content grows, with several developers contributing to the repository, websites provide search engines to search all of the information present on the site. While the search engines do a very good job in retrieving text documents, they severely constrain a developer when searching source code. Search engines consider source code files as plain text documents and hence and are no different from a sophisticated grep tool capable of handling large source files.

In this article I propose the approach of using Lucene, the Java-based open source search engine, to search source code by extracting and indexing relevant source code elements. I restrict the search to Java source code only. However, extending the search to any other programming language's source code should not be very different.

The article gives you a brief overview of the important aspects of a search engine in the context of Lucene. For more detailed information, refer to the Resources section.

Conceptual Overview

Lucene is one of the most popular open source search engine libraries. Lucene consists of core APIs that allow indexing and searching of text. Given a set of text files, Lucene can create indexes and allow you to search those indexes with complex queries such as +title:Lucene -content:Search , search AND Lucene , +search +code. Before going into the details of searching, let me introduce you to some of the functional aspects of Lucene.

Indexing Text in Lucene

Search engines scan all of the data that need to be searched and store them in a structure that allows efficient retrieval. The most well-known structure is called the inverted index. For example, consider indexing a set of conference proceedings. First, each paper or document of the proceedings is considered to have distinct parts, or fields, such as title, author, email, abstract, and content. The text of each field is tokenized and keywords or terms are extracted. The inverted index for the proceedings will be stored as shown in the following table.

Field Term Document Frequency Document
Author Erik Hatcher 1 Doc1
  Bill Inmon 1 Doc6
  ....    
Email erik@abc.com 1 Doc1
  xyz@mit.edu 1 Doc15
  ....    
Abstract lucene 1 Doc1
  j2ee 2 Doc6,Doc7
  system 5 Doc1,Doc2,Doc6,Doc15,Doc18
  ....    

For each term in a field, the number of the documents it occurs in (document frequency) and the IDs of the documents in which it occurs are stored. Other details are stored for each term: the number of times it occurs in each document and the position where it occurs. However, all that is important for us is to know that indexing a text file for Lucene means storing it in a format that allows for efficient querying and retrieval.

Analyzing Text to Be Indexed

Lucene processes the text to be indexed with analyzers. Analyzers are used to tokenize text, extract relevant words, discard common words, stem the words (reduce them to the root form, meaning that bowling, bowler and bowls are reduced to bowl), and perform any other desired processing before storing it into the index. The common analyzers provided by Lucene are:

Searching Indexes

Once created, indexes can be searched by providing complex queries specifying the field and the term that needs to be searched. For example, the user query abstract:system AND email:abc@mit.edu will result in all documents that contain system in the abstract and have the email address abc@mit.edu. Hence, a search on the sample index of the earlier table would return Doc15. Documents that match the query are ranked based on the number of times the terms occurs in the document and the number of documents that have the terms. Lucene implements a ranking mechanism and gives us the flexibility to modify it if required.

Jakarta Commons Cookbook

Related Reading

Jakarta Commons Cookbook
By Timothy M. O'Brien

Source Code Search Engine

Now that we understand the basics of a search engine, let us look at what a search engine that searches source code should support. In the context of searching relevant example code in Java, a developer is typically interested identifying Java classes that:

A combination of the above would satisfy the developer in identifying the relevant code he or she is looking for. So the search engine should allow a developer to query one or a combination of these aspects. Here is one other limitation of IDEs: most of the available tools support searching source code based on just one of the above criteria. They do not give the developer the flexibility of combining these criteria in their search.

Let's get started on building a search engine for source code, one that supports these requirements!

Writing a Source Code Analyzer

The first step is to write an analyzer to extract or discard source code elements and ensure that the indexes created are optimal, containing only the relevant aspects of code. In Java, the language keywords--public, null, for, if, etc.--need to be discarded, since they appear in every .java file. They are similar to the common words of English language (the, a, an, of). An analyzer needs to discard these keywords in the indexes.

A Java source code analyzer is built by extending Lucene's abstract Analyzer class. The following listing shows the JavaSourceCodeAnalyzer, which implements the tokenStream(String,Reader) method. The JavaSourceCodeAnalyzer defines a set of stop words that can be discarded in the process of indexing, using a StopFilter provided by Lucene. The tokenStream method checks the field that is being indexed. If the field is a comment, it first tokenizes and lower-cases input using the LowerCaseTokenizer, eliminates stop words of English (a limited set of English stop words) using the StopFilter, and uses the PorterStemFilter to remove common morphological and inflexional endings. If the content to be indexed is not a comment, the analyzer tokenizes and lower-cases input using LowerCaseTokenizer and eliminates the Java keywords using the StopFilter.

package com.infosys.lucene.code JavaSourceCodeAnalyzer.;

import java.io.Reader;
import java.util.Set;
import org.apache.lucene.analysis.*;

public class JavaSourceCodeAnalyzer extends Analyzer {
private Set javaStopSet;
private Set englishStopSet;
private static final String[] JAVA_STOP_WORDS = {
    "public","private","protected","interface",
    "abstract","implements","extends","null""new",
    "switch","case", "default" ,"synchronized" ,
    "do", "if", "else", "break","continue","this",
    "assert" ,"for","instanceof", "transient",
    "final", "static" ,"void","catch","try",
    "throws","throw","class", "finally","return",
    "const" , "native", "super","while", "import",
    "package" ,"true", "false" };
private static final String[] ENGLISH_STOP_WORDS ={
    "a", "an", "and", "are","as","at","be" "but",
    "by", "for", "if", "in", "into", "is", "it",
    "no", "not", "of", "on", "or", "s", "such",
    "that", "the", "their", "then", "there","these",
    "they", "this", "to", "was", "will", "with" };
public SourceCodeAnalyzer(){
    super();
    javaStopSet = StopFilter.makeStopSet(JAVA_STOP_WORDS);
    englishStopSet = StopFilter.makeStopSet(ENGLISH_STOP_WORDS);
}
public TokenStream tokenStream(String fieldName,
                    Reader reader) {
    if (fieldName.equals("comment"))
        return   new PorterStemFilter(
        new StopFilter(
        new LowerCaseTokenizer(reader),englishStopSet));
    else
        return   new StopFilter(
        new LowerCaseTokenizer(reader),javaStopSet);
 }
}

Writing a JavaSourceCode Indexer

The next step is to create indexes. The important classes used to build indexes are IndexWriter, Analyzer, Document, and Field . For each source code file, a Lucene Document is created. The source code file is parsed and relevant syntactic elements of the code are extracted: import declarations, class names, classes it extends, the interface it implements, methods implemented, method parameters used, and code for each of the methods. These syntactic elements are added to distinct Fields of the Document. The Document is added to the index using the IndexWriter, which stores the indexes.

The following listing shows the JavaSourceCodeIndexer. It uses JavaParser to parse a Java file and extract syntactic elements, using the Eclipse 3.0 ASTParser. I will not go into the details of the JavaParser, as any parser could be used to extract the relevant source code elements. On extracting the elements of the source code file, a Field is created and added to the Document

import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import com.infosys.lucene.code.JavaParser.*;

public class JavaSourceCodeIndexer {

private static JavaParser parser = new JavaParser();
private static final String IMPLEMENTS = "implements";
private static final String IMPORT = "import";
...
public static void main(String[] args) {
File indexDir = new File("C:\\Lucene\\Java");
File dataDir = new File("C:\\JavaSourceCode ");
IndexWriter writer = new IndexWriter(indexDir,
            new JavaSourceCodeAnalyzer(), true);
indexDirectory(writer, dataDir);
writer.close();
}
public static void indexDirectory(IndexWriter writer,
                            File dir){
    File[] files = dir.listFiles();
    for (int i = 0; i < files.length; i++) {
    File f = files[i];
        // Create a Lucene Document
        Document doc = new Document();
        //  Use JavaParser to parse file
        parser.setSource(f);
        addImportDeclarations(doc, parser);
        addComments(doc, parser);
         // Extract Class elements Using Parser
        JClass cls = parser.getDeclaredClass();
        addClass(doc, cls);
         // Add field to the Lucene Document
        doc.add(Field.UnIndexed(FILENAME, f.getName()));
        writer.addDocument(doc);
    }
}
private static void addClass(Document doc, JClass cls) {
    //For each class add Class Name field
    doc.add(Field.Text(CLASS, cls.className));
    String superCls = cls.superClass;
    if (superCls != null)
    //Add the class it extends as extends field
        doc.add(Field.Text(EXTENDS, superCls));
    // Add interfaces it implements
    ArrayList interfaces = cls.interfaces;
    for (int i = 0; i < interfaces.size(); i++)
        doc.add(Field.Text(IMPLEMENTS, (String) interfaces.get(i)));
    //Add details  on methods declared
    addMethods(cls, doc);
    ArrayList innerCls = cls.innerClasses;
    for (int i = 0; i < innerCls.size(); i++)
        addClass(doc, (JClass) innerCls.get(i));
 }
private static void addMethods(JClass cls, Document doc) {
    ArrayList methods = cls.methodDeclarations;
    for (int i = 0; i < methods.size(); i++) {
        JMethod method = (JMethod) methods.get(i);
        // Add method name field
        doc.add(Field.Text(METHOD, method.methodName));
        // Add return type field
        doc.add(Field.Text(RETURN, method.returnType));
        ArrayList params = method.parameters;
        for (int k = 0; k < params.size(); k++)
        // For each method add parameter types
            doc.add(Field.Text(PARAMETER, (String)params.get(k)));
        String code = method.codeBlock;
        if (code != null)
        //add the method code block
            doc.add(Field.UnStored(CODE, code));
    }
}
private static void addImportDeclarations(Document doc,
                            JavaParser parser) {
    ArrayList imports = parser.getImportDeclarations();
    if (imports == null)     return;
    for (int i = 0; i < imports.size(); i++)
    //add import declarations as keyword
        doc.add(Field.Keyword(IMPORT, (String) imports.get(i)));
}
}

Lucene has four different types of fields, which can be specified for optimal index creation: Keyword, UnIndexed, UnStored, and Text.

Field Type
Class Name Text
Import Declarations Keyword
Method Name Text
Method Block (Code) UnStored
File Name UnIndexed
Method Parameter Type Text
Return Type Text
Comments UnStored
Extends Class Text
Implements Text

The indexes created by Lucene can be viewed and modified using Luke, a useful open source tool for understanding indexes. Luke's snapshot of the indexes creates by JavaSourceCodeIndexer is shown in Figure 1.

Figure 1
Figure 1. Snapshot of indexes in Luke

As you can see, the import declarations are stored as is, without tokenizing or analyzing. The class names and method names are converted to lower case and stored.

Querying Java Source Code

After creating multi-field indexes, Lucene can be used to query these indexes. Lucene provides IndexSearcher and QueryParser, the key classes to search documents. QueryParser is used to parse a query expression entered by the user, while IndexSearcher searches for the query terms in the documents. The following table shows some possible queries and the meaning of each:

Query Expression Matches Java Class that ....
extends:ViewPart code:Table extends the class ViewPart and uses Table class in the code
code:Document +import:com.w3c.* Uses Document in the code and definitely has com.w3c in the import definition
parameter:JGraph Uses JGraph as parameter in the a method
parameter:JGraph code:Cell Uses JGraph as a parameter and/or Cell in the Code
method:paint-class:Color Contains a method named paint but does not have Color as the class name
method:paint +parameter:Graphics Contains a method named paint and definitely has Graphics in any of the method parameters

Indexing different syntactic elements allows the user to form specific queries and search code. The sample code used for searching is shown in the following listing.

public class JavaCodeSearch {
public static void main(String[] args) throws Exception{
    File indexDir = new File(args[0]);
    String q =  args[1]; //parameter:JGraph code:insert
    Directory fsDir = FSDirectory.getDirectory(indexDir,false);
    IndexSearcher is = new IndexSearcher(fsDir);

    PerFieldAnalyzerWrapper analyzer = new
        PerFieldAnalyzerWrapper( new
                JavaSourceCodeAnalyzer());

    analyzer.addAnalyzer("import", new KeywordAnalyzer());
    Query query = QueryParser.parse(q, "code", analyzer);
    long start = System.currentTimeMillis();
    Hits hits = is.search(query);
    long end = System.currentTimeMillis();
    System.err.println("Found " + hits.length() +
                " docs in " + (end-start) + " millisec");
    for(int i = 0; i < hits.length(); i++){
    Document doc = hits.doc(i);
        System.out.println(doc.get("filename")
                + " with a score of " + hits.score(i));
    }
    is.close();
}
}

IndexSearcher uses FSDirectory to open the directory containing the indexes. The search query string is analyzed by an Analyzer to ensure that the query is in the same form as the index (stemmed, lower-cased, filtered, etc.). In cases where a Field is indexed as a keyword, Lucene has a limitation while querying. Lucene analyzes all of the fields using the Analyzer passed to it in the QueryParser. To overcome this issue, the PerFieldAnalyzerWrapper provided by Lucene can be used to specify analysis required for each field in the query. Hence, the query string import:org.w3c.* AND code:Document will use KeywordAnalyzer to parse org.w3c.* and JavaSourceCodeAnalyzer to parse Document. QueryParser, using the default code field if the query does not refer to a field, analyzes the query string with the PerFieldAnalyzerWrapper, and returns the analyzed Query. The IndexSearcher uses the Query and returns a Hits object that contains documents satisfying the query.

Conclusion

This article shows how Lucene, a text search engine, can be used to search source code by adding an analyzer and a multiple field index. While the article introduces the basic functionality of a code search engine, building more sophisticated analyzers for source code could improve the querying capability and search results. Such search engines would allow uses across the software development community to search and share source code.

Resources

Renuka Sindhgatta currently works as a senior technical architect in the Software Engineering and Technology labs of Infosys Technologies Limited, Bangalore, India.


Return to ONJava.com.

Copyright © 2009 O'Reilly Media, Inc.