Introduction to Text Indexing with Apache Jakarta Lucene
Pages: 1, 2
In this example of a custom Analyzer, we will assume we are
indexing text in English. Our PorterStemAnalyzer will perform
Porter stemming on its input. As stated by its creator, the
Porter stemming algorithm (or "Porter stemmer") is a process for removing the
more common morphological and inflexional endings from words in English. Its main
function is to be part of a term normalization process that is usually done
when setting up Information Retrieval systems.
This Analyzer will use an implementation of the Porter stemming
algorithm provided by Lucene's PorterStemFilter class.
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.PorterStemFilter;
import java.io.Reader;
import java.util.Hashtable;
/**
* PorterStemAnalyzer processes input
* text by stemming English words to their roots.
* This Analyzer also converts the input to lower case
* and removes stop words. A small set of default stop
* words is defined in the STOP_WORDS
* array, but a caller can specify an alternative set
* of stop words by calling non-default constructor.
*/
public class PorterStemAnalyzer extends Analyzer
{
private static Hashtable _stopTable;
/**
* An array containing some common English words
* that are usually not useful for searching.
*/
public static final String[] STOP_WORDS =
{
"0", "1", "2", "3", "4", "5", "6", "7", "8",
"9", "000", "$",
"about", "after", "all", "also", "an", "and",
"another", "any", "are", "as", "at", "be",
"because", "been", "before", "being", "between",
"both", "but", "by", "came", "can", "come",
"could", "did", "do", "does", "each", "else",
"for", "from", "get", "got", "has", "had",
"he", "have", "her", "here", "him", "himself",
"his", "how","if", "in", "into", "is", "it",
"its", "just", "like", "make", "many", "me",
"might", "more", "most", "much", "must", "my",
"never", "now", "of", "on", "only", "or",
"other", "our", "out", "over", "re", "said",
"same", "see", "should", "since", "so", "some",
"still", "such", "take", "than", "that", "the",
"their", "them", "then", "there", "these",
"they", "this", "those", "through", "to", "too",
"under", "up", "use", "very", "want", "was",
"way", "we", "well", "were", "what", "when",
"where", "which", "while", "who", "will",
"with", "would", "you", "your",
"a", "b", "c", "d", "e", "f", "g", "h", "i",
"j", "k", "l", "m", "n", "o", "p", "q", "r",
"s", "t", "u", "v", "w", "x", "y", "z"
};
/**
* Builds an analyzer.
*/
public PorterStemAnalyzer()
{
this(STOP_WORDS);
}
/**
* Builds an analyzer with the given stop words.
*
* @param stopWords a String array of stop words
*/
public PorterStemAnalyzer(String[] stopWords)
{
_stopTable = StopFilter.makeStopTable(stopWords);
}
/**
* Processes the input by first converting it to
* lower case, then by eliminating stop words, and
* finally by performing Porter stemming on it.
*
* @param reader the Reader that
* provides access to the input text
* @return an instance of TokenStream
*/
public final TokenStream tokenStream(Reader reader)
{
return new PorterStemFilter(
new StopFilter(new LowerCaseTokenizer(reader),
_stopTable));
}
}
The tokenStream(Reader) method is the core of the
PorterStemAnalyzer. It lower-cases input, eliminates stop words,
and uses the PorterStemFilter to remove common morphological and
inflexional endings. This class includes only a small set of stop words for
English. When using Lucene in a production system for indexing and searching
text in English, I suggest that you use a more complete list of stop words,
such as this one.
To use our new PorterStemAnalyzer class, we need to modify a
single line of our LuceneIndexExample class shown above, to
instantiate PorterStemAnalyzer instead of
StandardAnalyzer:
Old line:
Analyzer analyzer = new StandardAnalyzer();
New line:
Analyzer analyzer = new PorterStemAnalyzer();
The rest of the code remains unchanged. Anything indexed after this change
will pass through the Porter stemmer. The process of text indexing with
PorterStemAnalyzer is depicted in Figure 1.

Figure 1: The indexing process with PorterStemAnalyzer.
Because different Analyzers process their text input
differently, note again that changing the Analyzer for an existing
index is dangerous. It will result in erroneous search results later, in the
same way that using a different Analyzer for both indexing and
searching will produce invalid results.
Field Types
Lucene offers four different types of fields from which a developer can
choose: Keyword, UnIndexed, UnStored,
and Text. Which field type you should use depends on how you want
to use that field and its values.
Keyword fields are not tokenized, but are indexed and stored
in the index verbatim. This field is suitable for fields whose original value
should be preserved in its entirety, such as URLs, dates, personal names,
Social Security numbers, telephone numbers, etc.
UnIndexed fields are neither tokenized nor indexed, but their
value is stored in the index word for word. This field is suitable for fields
that you need to display with search results, but whose values you will never
search directly. Because this type of field is not indexed, searches against
it are slow. Since the original value of a field of this type is stored in the
index, this type is not suitable for storing fields with very large values, if
index size is an issue.
UnStored fields are the opposite of UnIndexed fields.
Fields of this type are tokenized and indexed, but are not stored in the index.
This field is suitable for indexing large amounts of text that does not need to
be retrieved in its original form, such as the bodies of Web pages, or any
other type of text document.
Text fields are tokenized, indexed, and stored in the index.
This implies that fields of this type can be searched, but be cautious about
the size of the field stored as Text field.
If you look back at the LuceneIndexExample class, you will see
that I used a Text field:
document.add(Field.Text("fieldname", text));
If we wanted to change the type of field "fieldname," we would call one of
the other methods of class Field:
document.add(Field.Keyword("fieldname", text));
or
document.add(Field.UnIndexed("fieldname", text));
or
document.add(Field.UnStored("fieldname", text));
Although the Field.Text, Field.Keyword, Field.UnIndexed, and
Field.UnStored calls may at first look like calls to constructors, they are
really just calls to different Field class methods. Table 1
summarizes the different field types.
Table 1: An overview of different field types.
| Field method/type | Tokenized | Indexed | Stored |
Field.Keyword(String, String) | No | Yes | Yes |
Field.UnIndexed(String, String) | No | No | Yes |
Field.UnStored(String, String) | Yes | Yes | No |
Field.Text(String, String) | Yes | Yes | Yes |
Field.Text(String, Reader) | Yes | Yes | No |
Conclusion
In this article, we have learned about adding basic text indexing
capabilities to your applications using IndexWriter and its associated
classes. We have also developed a custom Analyzer that can
perform Porter stemming on its input. Finally, we have looked at different
field types and learned what each of them can be used for. In the next article
of this Lucene series, we shall look at indexing in more depth, and address
issues such as performance and multi-threading.
References
- Lucene home page
- Lucene Developers
- Lucene Users
- Lucene release downloads
- Nightly build downloads
- JavaCC download
- Porter Stemmer
- Lucene FAQ: How do I write my own Analyzer?
Otis Gospodnetic is an active Apache Jakarta member, a member of Apache Jakarta Project Management Committee, a developer of Lucene and maintainer of the jGuru's Lucene FAQ.
Return to ONJava.com.