SearchAssist: A Portable Search Engine in Java
by Ashwin Jayaprakash10/01/2003
2257! That's how many bookmarks I have in my Mozilla browser right now, all of which I upload to my web site every month. Uploading them is not a problem, but displaying them nicely on a web site without server-side support is. That simply rules out SQL, any JSPs, and other fancy stuff. Sure, DHTML is good, but it has its limitations. So, the only solution that remained was to download everything to the client's machine and run it there — an applet, that's it! But that presents its own challenges: the applet has to be small, the index must be pre-generated, and the applet should be able to read the index, retrieve the results, and then display them — all without having to make any calls back to the server. This looks like an ideal candidate for a three-tier web application.
After four weeks of meddling with XML, XPath, Swing, and LiveConnect, I came up with SearchAssist, a portable search engine written in Java. It's a Java applet of 30KB, using a ternary search tree. The UI components are a Swing interface and HTML, as shown in Figure 1, and the input format is XML. Other cool features are:
- Uses LiveConnect to communicate with the browser.
- Also runs as a Mozilla Sidebar.
- Auto-completes the word closest to what the user is typing.
- Configurable rendering component.
- Index file can be specified externally using applet parameters.

Figure 1. SearchAssist
Let's go through the entire process of design and development of SearchAssist.
SearchAssist Design
Input Data Format
When you consider the functionality that is expected, the potential applications of this project become obvious. For example, using XML (instead of the HTML format of Mozilla bookmarks) opens up new usage scenarios: a lightweight help system for another application, perhaps, or a photo album that can be searched using the person's name. The possibilities are endless. The only trick is designing a suitable data format able to support multiple nested levels of bookmarks and topics, such as Mozilla allows (see Figure 2).

Figure 2. The Mozilla bookmark manager
The XML will have a Category tag as the root. See Figure 3 for
an example. A Category can have nested categories or
Items as leaves. Both Category and Item
have Title tags. Each Item has an additional
Link tag that, as the name suggests, contains the link to the
actual text, web site, or photo, as the case may be. Each Category
or Item is uniquely addressable. Both Category and
Item tags have an attribute called uniqueID that
contains a string that does not repeat in the document. Such a format is also
ideal for storing nested topics for a help system.
Figure 3. An XML input file generated from Mozilla bookmarks
Indexing
The Index, like a book's index, is a map of keywords to locations (much like a book's index points to page
numbers where the index entries appear in the main text). In our XML file, the keywords
are the text in the Title tags of Category and
Item tags. Since each Category and Item
tag carries a uniqueID, they are similar to the page numbers in our
book index example.
For a seasoned Java programmer, a HashMap should come to mind,
as it can be serialized to a file. Although this is the right way of thinking,
remember that the index must be compact and easily searchable in as few steps
as possible. Although widely used (and sometimes overused), the
HashMap does not completely satisfy the requirements at hand. It
does not differentiate between an Integer, String,
or Array, as long as the key is an Object. If there
are three keywords words in the XML input file (such as "book," "bookmark," and
"booklet"), then a HashMap will treat all three of them as just
Objects and hence they will be hashed to completely different
locations. Although all three words have a common prefix — "book" —
the HashMap cannot use this pattern effectively, being unaware of
the nature and content of its keys.
A ternary search tree, on the other hand, is designed so that the nodes with
the same prefix share the same parent in the search tree. Instead of treating a
keyword as a single, unbreakable Object, it is split into an array
of Characters. Keys that share common prefixes reuse the same
characters. The remaining, unmatched part of the keyword branches off. This way of storing
saves a lot of space, as the tree is aware of the keyword's content. Such a data
structure automatically lends itself to providing another very useful UI
feature — auto-completion.
Auto-completion
Since the index is aware of the keywords, the search program navigates the
tree so that the string of characters that have been visited in the tree match
the text that the user has typed. If the user has typed "boo", then the tree
has seen b, o, and o. The tree can then
guess that the user is typing book, bookmark,
or booklet, if the tree contains only these three words starting with
boo. The tree will supply the closest match that appears first in
alphabetical order. If there is no match, there will be no auto-completion, so
the user can stop without having to type the entire search word.
Displaying Search Results
The search results should be displayed if the user's typed text matches a keyword in the index. Because the actual information can be as varied as simple text, web sites, or photos, the display of search results must be customizable. To achieve this, the actual implementation must be kept separate from the interface — a simple programming idiom.
UI Components and Layout
Now that we've hammered out most of the challenges, it's time to design the
UI layout. The layout is very minimal, as seen in Figure 4. The most crucial
Swing components are a custom JTextField to do auto-completion, a
JTextArea to display the list of nearest matches and a
JPanel containing a custom JComponent to display the
search results.

Figure 4. The layout of the UI
Generating the Ternary Search Tree Index File
|
Source Code Download the example source code for this article. |
The Mozilla bookmarks file must be converted to XML format. A careful
observation of the HTML bookmarks file reveals a simple structure that can be
easily transformed to the XML structure using SAX. The HTML file contains
several attributes that are not required for the transformation, and so can be
safely ignored. See MozillaBookmark2XMLConverter.java.
The resulting XML file, containing the Category,
Title, Item, and Link tags, is fed to a
small Java program that just echoes the XML tags and text content into another
file. It is a simple SAX Handler. It also generates a uniqueID for each
Category and Item, adding it as an attribute and
value pair. The IDs are generated such that the parent's ID is also preserved
in the child's ID. If the parent XML node has an ID of 1_2, its
children will have IDs of 1_2_1, 1_2_2,
1_2_3, and so on, as shown in Figure 5. This makes it possible to
provide a "Go to Parent" link beside each subtree in the search results. Since
every node carries within itself the parent's ID as a prefix, it is very easy
to navigate to the parent node. I haven't yet implemented the navigation,
though. See XMLIDGenerator.java.

Figure 5. The index-generation algorithm, illustrated
The code uses a combination of Stacks and HashMaps
to accomplish this. The Stack is used because the parent's ID is a
prefix for the child's ID. So, it helps keep track of the depth of the tree.
The HashMap keeps track of which sibling the current node is, for
that level. A node could be 1_1_3 if it's the second sibling, or
1_1_3, if it's the third.
A list of keywords and their corresponding uniqueIDs is required for the
index. Since the uniqueIDs have been generated in the previous step, the next
step is to generate the keywords that will point to the uniqueIDs, as shown in
Figure 6.

Figure 6. The XML input file with keywords
Again, a simple SAX handler retrieves the text of the Title
tags and tokenizes them. This list of tokens tag will contain non-keywords or
stop words such as articles, pronouns, and prepositions that can be weeded out.
The stop words are stored in a file that can be referred to by the SAX Handler.
The filtered list of keywords is written back as a comma-separated list under a
Keywords tag of the parent Category or
Title tag. See KeywordTagAdder.java, excerpted
here:
public void startElement(
String namespaceURI,
String lName, // local name
String qName, // qualified name
Attributes attrs) throws SAXException
{
indentLevel++;
nl();
// element name
String eName = lName;
if ("".equals(eName))
{
// namespaceAware = false
eName = qName;
}
currTagName = eName;
emit("<" + eName);
if (attrs != null)
{
for (int i = 0; i < attrs.getLength(); i++)
{
String aName = attrs.getLocalName(i);
// Attr name
if ("".equals(aName))
{
aName = attrs.getQName(i);
}
emit("\t");
emit(aName);
emit("=\"");
emit(attrs.getValue(i));
emit("\"");
}
}
emit(">");
}
public void characters(char buf[], int offset, int len)
throws SAXException
{
String s = new String(buf, offset, len).trim();
if (!s.equals(""))
{
// nl();
emit(s);
keywords = null;
if (currTagName
.equals(getTagNameContainingKeywords()))
{
keywords = s;
}
}
}
public void endElement(
String namespaceURI,
String sName, // simple name
String qName // qualified name
) throws SAXException
{
// element name
String eName = sName;
if ("".equals(eName))
{
eName = qName;
// namespaceAware = false
}
nl();
emit("</" + eName + ">");
//if (currTagName.equals(getTagNameContainingKeywords()))
//causes problems
if (eName.equals(getTagNameContainingKeywords()))
{
nl();
emit("<" + getTagNameToPutKeywordsIn() + ">");
nl();
emit(convertToKeywords(keywords));
nl();
emit("</" + getTagNameToPutKeywordsIn() + ">");
}
indentLevel--;
}
With the XML file containing both the Keywords and the
uniqueIDs, the ternary search tree can be generated. As a keyword
can occur in more than one Category or Item, the
ternary search tree stores a List of uniqueIDs
instead of just one. This List can provide another useful UI
feature. The size() method on the List can determine
the number of hits for the corresponding keyword, displayable in square
brackets in the JTextArea. The ternary search tree thus populated
is serialized into a file and stored on the disk. See
TSTGenerator.java.
The steps preceding the actual index generation are required only if the Mozilla bookmarks file is used as the source of information. The index can be generated from any source — it is just a mapping of keywords to lists of unique IDs.
The Mozilla bookmarks file may contain some characters that cause the XML
parser to fail. Because HTML allows unclosed tags such as <P>,
<BR>, <HR>, and <DT>, these unclosed tags must
be replaced by their XHTML equivalents — <p />,
<br />, <hr />, and <dt />. As
URLs may contain ampersands (&) as part of HTTP GET requests,
they too must be converted to an XML entity equivalent, such as
&. A small Java class finds and replaces such
troublemaking strings in the XML file before it is fed to the XML parser.
The steps described above constitute a simple information transformation
pipeline. See SpecialCharactersFilter.java.
Pages: 1, 2 |