ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

Using Lucene to Search Java Source Code
Pages: 1, 2, 3, 4, 5

Querying Java Source Code

After creating multi-field indexes, Lucene can be used to query these indexes. Lucene provides IndexSearcher and QueryParser, the key classes to search documents. QueryParser is used to parse a query expression entered by the user, while IndexSearcher searches for the query terms in the documents. The following table shows some possible queries and the meaning of each:

Query Expression Matches Java Class that ....
extends:ViewPart code:Table extends the class ViewPart and uses Table class in the code
code:Document +import:com.w3c.* Uses Document in the code and definitely has com.w3c in the import definition
parameter:JGraph Uses JGraph as parameter in the a method
parameter:JGraph code:Cell Uses JGraph as a parameter and/or Cell in the Code
method:paint-class:Color Contains a method named paint but does not have Color as the class name
method:paint +parameter:Graphics Contains a method named paint and definitely has Graphics in any of the method parameters

Indexing different syntactic elements allows the user to form specific queries and search code. The sample code used for searching is shown in the following listing.

public class JavaCodeSearch {
public static void main(String[] args) throws Exception{
    File indexDir = new File(args[0]);
    String q =  args[1]; //parameter:JGraph code:insert
    Directory fsDir = FSDirectory.getDirectory(indexDir,false);
    IndexSearcher is = new IndexSearcher(fsDir);

    PerFieldAnalyzerWrapper analyzer = new
        PerFieldAnalyzerWrapper( new
                JavaSourceCodeAnalyzer());

    analyzer.addAnalyzer("import", new KeywordAnalyzer());
    Query query = QueryParser.parse(q, "code", analyzer);
    long start = System.currentTimeMillis();
    Hits hits = is.search(query);
    long end = System.currentTimeMillis();
    System.err.println("Found " + hits.length() +
                " docs in " + (end-start) + " millisec");
    for(int i = 0; i < hits.length(); i++){
    Document doc = hits.doc(i);
        System.out.println(doc.get("filename")
                + " with a score of " + hits.score(i));
    }
    is.close();
}
}

IndexSearcher uses FSDirectory to open the directory containing the indexes. The search query string is analyzed by an Analyzer to ensure that the query is in the same form as the index (stemmed, lower-cased, filtered, etc.). In cases where a Field is indexed as a keyword, Lucene has a limitation while querying. Lucene analyzes all of the fields using the Analyzer passed to it in the QueryParser. To overcome this issue, the PerFieldAnalyzerWrapper provided by Lucene can be used to specify analysis required for each field in the query. Hence, the query string import:org.w3c.* AND code:Document will use KeywordAnalyzer to parse org.w3c.* and JavaSourceCodeAnalyzer to parse Document. QueryParser, using the default code field if the query does not refer to a field, analyzes the query string with the PerFieldAnalyzerWrapper, and returns the analyzed Query. The IndexSearcher uses the Query and returns a Hits object that contains documents satisfying the query.

Conclusion

This article shows how Lucene, a text search engine, can be used to search source code by adding an analyzer and a multiple field index. While the article introduces the basic functionality of a code search engine, building more sophisticated analyzers for source code could improve the querying capability and search results. Such search engines would allow uses across the software development community to search and share source code.

Resources

Renuka Sindhgatta currently works as a senior technical architect in the Software Engineering and Technology labs of Infosys Technologies Limited, Bangalore, India.


Return to ONJava.com.