Advanced Text Indexing with Lucene
Pages: 1, 2
In-Memory Indexing
In the previous section, I mentioned that new documents added to an index
are stored in memory before being written to the disk. You also saw how to
control the rate at which this is done via IndexWriter's instance
variables. The Lucene distribution contains the RAMDirectory
class, which gives even more control over this process. This class implements
the Directory interface, just like FSDirectory does,
but stores indexed documents in memory, while FSDirectory stores
them on disk.
Because RAMDirectory does not write anything to the disk, it
is faster than FSDirectory. However, since computers usually come
with less RAM than hard disk space, RAMDirectory is not suitable
for very large indices.
The MemoryVsDisk class demonstrates how to use
RAMDirectory as an in-memory buffer in order to improve the
indexing speed.
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import java.io.IOException;
/**
* Creates an index called 'index' in a temporary directory.
* The number of documents to add to this index, the mergeFactor and
* the maxMergeDocs must be specified on the command line
* in that order - this class expects to be called correctly.
* Additionally, if the fourth command line argument is '-r' this
* class will first index all documents in RAMDirectory before
* flushing them to the disk in the end. To make this class use the
* regular FSDirectory use '-f' as the fourth command line argument.
*
* Note: before running this for the first time, manually create the
* directory called 'index' in your temporary directory.
*/
public class MemoryVsDisk
{
public static void main(String[] args) throws Exception
{
int docsInIndex = Integer.parseInt(args[0]);
// create an index called 'index' in a temporary directory
String indexDir =
System.getProperty("java.io.tmpdir", "tmp") +
System.getProperty("file.separator") + "index";
Analyzer analyzer = new StopAnalyzer();
long startTime = System.currentTimeMillis();
if ("-r".equalsIgnoreCase(args[3]))
{
// if -r argument was specified, use RAMDirectory
RAMDirectory ramDir = new RAMDirectory();
IndexWriter ramWriter = new IndexWriter(ramDir, analyzer, true);
addDocs(ramWriter, docsInIndex);
IndexWriter fsWriter = new IndexWriter(indexDir, analyzer, true);
fsWriter.addIndexes(new Directory[] { ramDir });
ramWriter.close();
fsWriter.close();
}
else
{
// create an index using FSDirectory
IndexWriter fsWriter = new IndexWriter(indexDir, analyzer, true);
fsWriter.mergeFactor = Integer.parseInt(args[1]);
fsWriter.maxMergeDocs = Integer.parseInt(args[2]);
addDocs(fsWriter, docsInIndex);
fsWriter.close();
}
long stopTime = System.currentTimeMillis();
System.out.println("Total time: " + (stopTime - startTime) + " ms");
}
private static void addDocs(IndexWriter writer, int docsInIndex)
throws IOException
{
for (int i = 0; i < docsInIndex; i++)
{
Document doc = new Document();
doc.add(Field.Text("fieldname", "Bibamus, moriendum est"));
writer.addDocument(doc);
}
}
}
To create an index with 10,000 documents and only use FSDirectory, use this:
prompt> time java MemoryVsDisk 10000 10 100000 -f
Total time: 41380 ms
real 0m42.739s
user 0m36.750s
sys 0m4.180s
To create the index of the same size but do it faster, with
RAMDirectory, call MemoryVsDisk as follows:
prompt> time java MemoryVsDisk 10000 10 100000 -r
Total time: 27325 ms
real 0m28.695s
user 0m27.920s
sys 0m0.610s
However, note that you can achieve the same, or even better, performance by
choosing a more suitable value for mergeFactor:
prompt> time java MemoryVsDisk 10000 1000 100000 -f
Total time: 24724 ms
real 0m26.108s
user 0m25.280s
sys 0m0.620s
Be careful, however, when tuning mergeFactor. A value that
requires more memory than your JVM can access may cause the
java.lang.OutOfMemoryError error.
Finally, do not forget that you can greatly influence the performance of any Java application by giving the JVM more memory to work with:
prompt> time java -Xmx300MB -Xms200MB MemoryVsDisk 10000 10 100000 -r
Total time: 15166 ms
real 0m17.311s
user 0m15.400s
sys 0m1.590s
Merging Indices
If you want to improve indexing performance with Lucene, and manipulating
IndexWriter's mergeFactor and
maxMergeDocs prove insufficient, you can use
RAMDirectory to create in-memory indices. You could create a
multi-threaded indexing application that uses multiple
RAMDirectory-based indices in parallel, one in each thread, and
merges them into a single index on the disk using IndexWriter's
addIndexes(Directory[]) method. Taking this idea further, a
sophisticated indexing application could even create in-memory indices on
multiple computers in parallel. To make full use of this approach, one needs
to ensure that the thread that performs the actual indexing on the disk is
never idle, as that translates to wasted time.
Indexing in Multi-Threaded Environments
While multiple threads or processes can search (i.e. read) a single Lucene
index simultaneously, only a single thread or process is allowed to modify
(write) an index at a time. If your indexing application uses multiple
indexing threads that are adding documents to the same index, you must
serialize their calls to the IndexWriter.addDocument(Document)
method. Leaving these calls unserialized may cause threads to get in each
other's way and modify the index in unwanted ways, causing Lucene to throw
exceptions. In addition, to prevent misuse, Lucene uses file-based locks in
order to stop multiple threads or processes from creating
IndexWriters with the same index directory at the same time.
For instance, this code:
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
/**
* Demonstrates how Lucene uses locks to prevent multiple processes from
* writing to the same index at the same time.
* Note: before running this for the first time, manually create the
* directory called 'index' in your temporary directory.
*/
public class DoubleTrouble
{
public static void main(String[] args) throws Exception
{
// create an index called 'index' in a temporary directory
String indexDir =
System.getProperty("java.io.tmpdir", "tmp") +
System.getProperty("file.separator") + "index";
Analyzer analyzer = new StopAnalyzer();
IndexWriter firstWriter = new IndexWriter(indexDir, analyzer, true);
// the following line will cause an exception
IndexWriter secondWriter = new IndexWriter(indexDir, analyzer, false);
// the following two lines will never even be reached
firstWriter.close();
secondWriter.close();
}
}
will cause the following exception:
Exception in thread "main" java.io.IOException: \
Index locked for write: Lock@/tmp/index/write.lock
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:145)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:122)
at DoubleTrouble.main(DoubleTrouble.java:23)
Optimizing Indices
I have mentioned index optimization a few times in this article, but I have
not yet explained it. To optimize an index, one has to call
optimize() on an IndexWriter instance. When this
happens, all in-memory documents are flushed to the disk and all index segments
are merged into a single segment, reducing the number of files that make up the
index. However, optimizing an index does not help improve indexing
performance. As a matter of fact, optimizing an index during the indexing
process will only slow things down. Despite this, optimizing may sometimes be
necessary in order to keep the number of open files under control. For
instance, optimizing an index during the indexing process may be needed in
situations where searching and indexing happen concurrently, since both
processes keep their own set of open files. A good rule of thumb is that if
more documents will be added to the index soon, you should avoid calling
optimize(). If, on the other hand, you know that the index will
not be modified for a while, and the index will only be searched, you should
optimize it. That will reduce the number of segments (files on the disk), and
consequently improve search performance--the fewer files Lucene has to open
while searching, the faster the search.
To illustrate the effect of optimizing an index, we can use the
IndexOptimizeDemo class:
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
/**
* Creates an index called 'index' in a temporary directory.
* If you want the index to optimize the index at the end use '-o'
* command line argument. If you do not want to optimize the index
* at the end use any other value for the command line argument.
* This class expects to be called correctly.
*
* Note: before running this for the first time, manually create the
* directory called 'index' in your temporary directory.
*/
public class IndexOptimizeDemo
{
public static void main(String[] args) throws Exception
{
// create an index called 'index' in a temporary directory
String indexDir =
System.getProperty("java.io.tmpdir", "tmp") +
System.getProperty("file.separator") + "index";
Analyzer analyzer = new StopAnalyzer();
IndexWriter writer = new IndexWriter(indexDir, analyzer, true);
for (int i = 0; i < 15; i++)
{
Document doc = new Document();
doc.add(Field.Text("fieldname", "Bibamus, moriendum est"));
writer.addDocument(doc);
}
if ("-o".equalsIgnoreCase(args[0]))
{
System.out.println("Optimizing the index...");
writer.optimize();
}
writer.close();
}
}
As you can see from the class Javadoc and code, the created index will be
optimized only if -o command line argument is used. To create an
unoptimized index with this class, use this:
prompt> java IndexOptimizeDemo -n
-rw-rw-r-- 1 otis otis 10 Feb 18 23:50 _a.f1
-rw-rw-r-- 1 otis otis 260 Feb 18 23:50 _a.fdt
-rw-rw-r-- 1 otis otis 80 Feb 18 23:50 _a.fdx
-rw-rw-r-- 1 otis otis 14 Feb 18 23:50 _a.fnm
-rw-rw-r-- 1 otis otis 30 Feb 18 23:50 _a.frq
-rw-rw-r-- 1 otis otis 30 Feb 18 23:50 _a.prx
-rw-rw-r-- 1 otis otis 11 Feb 18 23:50 _a.tii
-rw-rw-r-- 1 otis otis 41 Feb 18 23:50 _a.tis
-rw-rw-r-- 1 otis otis 4 Feb 18 23:50 deletable
-rw-rw-r-- 1 otis otis 5 Feb 18 23:50 _g.f1
-rw-rw-r-- 1 otis otis 130 Feb 18 23:50 _g.fdt
-rw-rw-r-- 1 otis otis 40 Feb 18 23:50 _g.fdx
-rw-rw-r-- 1 otis otis 14 Feb 18 23:50 _g.fnm
-rw-rw-r-- 1 otis otis 15 Feb 18 23:50 _g.frq
-rw-rw-r-- 1 otis otis 15 Feb 18 23:50 _g.prx
-rw-rw-r-- 1 otis otis 11 Feb 18 23:50 _g.tii
-rw-rw-r-- 1 otis otis 41 Feb 18 23:50 _g.tis
-rw-rw-r-- 1 otis otis 22 Feb 18 23:50 segments
Example 2: An unoptimized index usually contains more than one segment.
This index contains two segments. To create a fully-optimized index, call
this class with -o command line argument:
prompt> java IndexOptimizeDemo -o
-rw-rw-r-- 1 otis otis 4 Feb 18 23:50 deletable
-rw-rw-r-- 1 otis otis 15 Feb 18 23:50 _h.f1
-rw-rw-r-- 1 otis otis 390 Feb 18 23:50 _h.fdt
-rw-rw-r-- 1 otis otis 120 Feb 18 23:50 _h.fdx
-rw-rw-r-- 1 otis otis 14 Feb 18 23:50 _h.fnm
-rw-rw-r-- 1 otis otis 45 Feb 18 23:50 _h.frq
-rw-rw-r-- 1 otis otis 45 Feb 18 23:50 _h.prx
-rw-rw-r-- 1 otis otis 11 Feb 18 23:50 _h.tii
-rw-rw-r-- 1 otis otis 41 Feb 18 23:50 _h.tis
-rw-rw-r-- 1 otis otis 15 Feb 18 23:50 segments
Example 3: A fully-optimized index contains only a single segment.
Conclusion
This article has discussed the basic structure of a Lucene index and has demonstrated a few techniques for improving indexing performance. You also learned about potential problems with indexing in multi-threaded environments, about what it means to optimize an index, and how this affects indexing. This knowledge should allow you to gain more control over Lucene's indexing process to improve its performance. The next article will examine Lucene's text-searching capabilities.
References
Otis Gospodnetic is an active Apache Jakarta member, a member of Apache Jakarta Project Management Committee, a developer of Lucene and maintainer of the jGuru's Lucene FAQ.
Return to ONJava.com.