ONJava.com    
 Published on ONJava.com (http://www.onjava.com/)
 See this if you're having trouble printing code examples


Regular Expressions in J2SE

by Hetal C. Shah
11/26/2003

In Java applications that do text searching and manipulation, the StringTokenizer and String classes are used heavily. This can often result in complex code and lead to a maintenance nightmare.

Often such Java applications are looking for an occurrence of a particular character or token in a String, and then trying to find a string surrounding it, validating the extracted String. A simple example is validation of a web site URL or an email address. To validate an email address, we could check for an occurrence of '@', followed by one or more '.'. This logic might be implemented in Java as shown below.

JDK 1.4 supports regular expressions in the java.util.regex package. Use of this package and supporting classes makes string search and manipulation very easy. It helps reduce the development effort, and at the same time significantly improves the maintenance of code. Since classes in this package are a standard part of core Java, they don't have to be distributed separately, and can be assumed to be present. We will see at the end of article how regular expressions simplify the implementation of email validation.

String str="administrator@admin.com";
int indexOfAtChar=str.indexOf("@");

if(indexOfAtChar > 0)
{
    int indexOfDotChar =
        str.indexOf(".",indexOfAtChar);
    if(indexOfDotChar > 0)
    {
      System.out.println ("Valid Email Address.");
    }
    else
    {
      System.out.println
      ("Invalid Email Address- " +
       "Missing character '.' after '@'.");
    }
}
else{
    System.out.println("Invalid Email Address- " +
                       "Missing character'@' .");
}

This produces the output:

Valid Email Address.

Interest in regular expressions has been around for a number of years in the software industry. It has been heavily used in:

Many programming languages and operating systems tools support regular expressions, such as:

This article explains the benefits of writing regular expressions using the java.util.regex package, and how to use its key components.

What Is a Regular Expression?

First of all, let's define a regular expression in a simple approach: A regular expression is a pattern, a template, to be matched against a string.

Users of a command-line operating system like DOS or Unix often use a directory listing command to find a list of files in a directory. On DOS, this would be:

dir *.txt

And on Unix, it would be:

ls *.txt

Here "*.txt" is a command parameter to display the list of files with file extension 'txt', irrespective of file name.

Now, say we want to see list of files where the filename begins with 'a'; then the DOS command will be

dir a*.*

and the Unix command will be

ls a*.*

Related Reading

Regular Expression Pocket Reference
By Tony Stubblebine

Here "a*.*", means a filename starting with 'a' followed by any number of characters, followed by a character '.', followed by any file extension.

These examples are straightforward uses of regular expressions.

Regular Expression Grammar Rules

Before we jump into how to write regular expression code using the java.util.regex package, let's first have a brief look at regular expression syntax in general.

In its simplest form, a regular expression is just a word or phrase for which to search. For example, the regular expression 'John' would match any string with the string 'John' in it. Strings like 'John', 'Ajohn', and ' Decker John' all would match.

In regular expressions some characters are used for more special purposes. These are called Quantifiers. For instance, '*' matches any sequence of characters, and the '.' matches any single character except a new line. Hence, the regular expression '.ine' matches any four character strings that ends with 'ine', including 'line', and 'nine'.

But what if you want to search for a string containing a period and, say, references to pi. The following regular expression would not work:

3.141592

This would indeed match "3.141592", but it will also match "3x141592",and "38141592". To get around this, we can use a metacharacter, the backslash (\). The backslash can be used to indicate that the character immediately to its right is to be taken literally. Thus, to search for the string "3.141592", we would use:

3\.141592

Regular Expressions in JDK 1.4

The entire regular expression support is contained in the package java.util.regex and is made up of the following two main classes:

A typical implementation of text searching and/or manipulation using the java.util.regex package is divided into three steps.

  1. Compile the regular expression into an instance of Pattern
  2. Use the Pattern object to create a Matcher object.
  3. Use the Matcher object to search and/or manipulate the character sequence

A typical invocation sequence might be like the example to follow, which uses a regular expression to match 'cats', followed by any number of characters, followed by 'dogs':

Pattern pat=Pattern.compile("cats.*dogs");
Matcher matcher=pat.matcher("cats and dogs");
boolean flag=matcher.matches();

We will look at each of the above methods in detail in next few sections.

Creating Patterns

The Pattern class provides an overloaded static factory method compile() to create Pattern instances.

Flags

In the java.util.regex package, text matching defaults to case sensitivity and treats each character as ASCII rather than Unicode. To modify this default behavior, you can provide flags to the compile() method. All flags are static int members of Pattern. To combine behaviors, you can mathematically OR flags together with the "|" operator.

Flag Purpose
CANON_EQ Enables canonical equivalence in the search.
CASE_INSENSITIVE Enables case-insensitive matching.
COMMENTS Permits white space and comments in pattern. If this flag is set then white spaces, and embedded comments starting with # are ignored.
DOTALL By default the metaCharacter '.' does not match line terminator, but using this flag it matches any character, including a line terminator.
MULTILINE Enables multiline searches. In multiline input character sequence '^' and '$' MetaCharacters match, respectively, after or before a line terminator or at the end of input sequence.
UNICODE_CASE This flag specified along with the CASE_INSENSITIVE flag makes case-insensitive matching in a manner consistent with the Unicode Standards.
UNIX_LINES Unix lines mode.

Creating Matchers

Once we have a compiled Pattern, we call matcher(charsequence) on it to create a Matcher.

java.lang.CharSequence is an interface to represent a readable sequence of characters. The String, StringBuffer, and CharBuffer classes implement this interface. Typically, we pass Strings to the matcher method:

Pattern pat=Pattern.compile("cats.*dogs");
Matcher matcher=pat.matcher("cats and dogs");

Text Matching and Manipulation

In text matching, we are interested to check whether the entire string matches the regular expression, or whether only a part of the string matches it.

Exact Match

Matcher Class provides the matches() method to test for an exact match.

Here's an example of using matches():

import java.util.regex.*;
import java.util.*;

public class exactmatch
{
    public static void main(String args[])
    {
    // This regular expression means 'cats' followed
    // by any number of characters (except new line
    // Character), followed by 'dogs'
        Pattern pat=Pattern.compile("cats.*dogs");
        Matcher matcher =
             pat.matcher("cats and dogs");
        boolean flag = matcher.matches(); //true

    // This regular expression means 'house' followed
    // by one or more character (except new line
    // Character),  followed by 'family'
        Pattern pat2 =
            Pattern.compile("house.+family");
        Matcher matcher2 =
            pat2.matcher("housefamily");
        boolean flag2 = matcher2.matches(); //false
    }
}

The above example could be implemented using only the Pattern class, which has its own matches() method:

Example:

// 'cats' followed by any number of characters
// (except new line Character), followed by 'dogs'
boolean flag =
    Pattern.matches("cats.*dogs",
                    "cats and dogs"); //true

Partial Match

Matcher provides two methods for partial string matching, each used for a slightly different purpose.

find() is particularly useful when you are interested in all the subsequences of the given input character sequence that match the given pattern, as in the following example:

Pattern pat=Pattern.compile("john");

Matcher matcher =
    pat.matcher ("Hello, I am john parker");
boolean flag=matcher.find(); //true

// matches() returns false since the
// entire input sequence doesn't match
// the regular expression.
flag=matcher.matches(); //false

Pattern pat2=Pattern.compile("john");
// lookingAt() returns true because the
// regular expression matches the beginning of
// the string
Matcher matcher2 =
    pat2.matcher("john parker is my name");
flag = matcher2.lookingAt(); //true

Say we have a requirement to implement a program that checks that an input string is a valid web site URL in the form of www.domain-name.top-level-domain, and if it is valid then prints its domain name, and its top-level domain. Regular expressions covered so far in this article will help in validating the input string, but not in extracting the domain name and its top-level domain. Use of Capturing groups in regular expression helps in extracting the meaningful information from the matched string.

Capturing groups

In regular expression, parentheses are used for grouping sub-expressions, but they also capture characters matched by that sub-expression. Capturing groups are numbered by counting their opening parentheses from left to right. Group zero always stands for the entire expression.

In the expression ((x)(y(z))), for example, there are four such groups:

Group # regular expression
0 ((x)(y(z)))
1 (x)
2 (y(z))
3 (z)

During a match, each subsequence of the input sequence that matches such a group is saved. The captured input associated with a group is always the subsequence that the group most recently matched. As a convenience, the following methods are provided in Matcher for working with capturing groups.

int end() Returns the index of the last character matched, plus one.
int end(int group) Returns the index of the last character, plus one, of the subsequence captured by the given group during the previous match operation. If the match was successful but the group itself did not match anything, then it returns -1.
String group() Returns the input subsequence matched by the previous match.
String group(int group) Returns the input subsequence captured by the given group during previous match operation.
int groupCount() Returns the number of capturing groups in this matcher's pattern.
int start() Returns the start index of the previous match.
int start(int group) Returns the start index of the subsequence captured by the given group during previous match operation. If the match was successful but the group itself did not match anything then it returns -1.

If Matcher evaluates a group a second time because of quantifiers, then the group's previously captured value will be retained if the second evaluation fails. But all captured input is discarded at the beginning of each match.

Here's an example of validating a web site URL, and extracting its domain name, and top-level domain using Capturing groups:

String str="www.onjava.com";
String regExpr = "www\\.(.+)\\.(.+)";
Pattern pat;

// Pattern Matching will be case insensitive.
pat =
  Pattern.compile(regExpr,Pattern.CASE_INSENSITIVE);

Matcher matcher=pat.matcher(str);

if(matcher.find())
{
 System.out.println("Input is valid.\n");
 System.out.println("Domain:"+matcher.group(1));
 System.out.println("TLD is:" + matcher.group(2));
} else {
 System.out.println("Input is not valid.");
}

This produces the output:

Input is valid.

Domain:onjava
TLD is:com

'\' represents an escape character in regular expressions. To represent the \ character in a string, we use \\, as in the above web site validation example.

A Matcher may be reset explicitly by invoking its reset() method, or if a new input sequence is desired, by calling its reset(CharSequence) method. Resetting a matcher discards its explicit state information. An example of resetting is shown below:

Pattern pat=Pattern.compile("John");

Matcher matcher =
    pat.matcher("Hello World");
boolean flag = matcher.find(); //false

//Reset matcher with new input sequence.
matcher.reset("Hello John");
flag=matcher.find(); //true

String Replacement

Matcher provides the replaceAll(String) and replaceFirst(String) methods for string replacement.

The following code snippet shows an example of replaceAll():

String oldString =
      "Telephone is a new technology. " +
      "People can carry Telephones with them " +
      "in office, college, trains etc..";
Pattern pat=Pattern.compile("Telephone");
Matcher matcher=pat.matcher(oldString);

// Replace all occurrences of 'Telephone' in
// oldString with 'Cellular Phone'
String newString =
    matcher.replaceAll("Cellular Phone");

System.out.println(newString);

This produces the output:

Cellular Phone is a new technology. People can
carry Cellular Phones with them in office,
college, trains etc.

String Tokenizing

A record stored in a flat-file is typically formatted using a separator character to separate the individual fields in record. If the separator is a single character like '|', ',' or a tab, then the StringTokenizer class can be used to split the line into fields. But the separator is complex (say, '~__' ), then parsing with regular expressions is helpful. Pattern provides a useful method for this:

Here's a simple example of this technique:

/* Variable str represents a record with
   field values first name, last name, and middle
   initial separated by ~__.
*/
String str="Cathy ~__Paul~__C";

Pattern pat=Pattern.compile("~__");

// Split the record by ~__.
String flds[]=pat.split(str);

System.out.println("Fields are:\n");

for(int i=0;i <flds.length;i++)
{
        System.out.println(flds[i]);
}

This produces the output:

Fields are:

Cathy
Paul
C

Email Validation Example

This article began by showing the difficulty of validating a String as an email address. Let's see how easy it is to implement the same thing using regular expressions.

// Variable str represents an email address
// to be validated.

String str="administrator@admin.com";

Pattern pat=Pattern.compile(".+@.+\\.[a-z]+");

Matcher matcher=pat.matcher(str);
boolean flag=matcher.find(); // true;

matcher.reset("administrator@admincom");
flag=matcher.find(); // false;

matcher.reset("administrator@a.");
flag=matcher.find(); // false;

matcher.reset("administrator@at.admin.com");
flag=matcher.find(); // true;

In above example, [a-z] is a character class that defines a character range starting with 'a' and ending with 'z'. Any of the characters defined in the character class can be matched by a single character of the same value in the input data.

Regular Expressions and Multithreading

Instances of Pattern class are immutable and are safe for use by multiple concurrent threads. However, instances of the Matcher class are not safe for multiple concurrent threads use because of its explicit state, which includes the start and end indices of the most recent successful match. It also includes the start and end indices of the input subsequence captured by each capturing group in the pattern as well as a total count of such subsequences.

The explicit state of a matcher is initially undefined; attempting to query any part of it before a successful match will cause an IllegalStateException to be thrown.

Conclusion

The java.util.regex package in JDK 1.4 is quite handy and useful to the developers of search, extract, and replace systems such as search engines, rule-based data formation and transformation engines, EAI, and so on. Regular expressions are also used in extracting meaningful information from large chunks of text data, hence the java.util.regex package will significantly help Java developers with efficient development and maintenance of such applications. Although the examples in this article were simplistic, they lay the groundwork for examining the usefulness of the package.

The java.util.regex package has several other features for appending, text replacement, and greedy/non-greedy pattern matching. Space won't allow us to discuss them in more detail here, so see the JDK 1.4 Documentation on java.util.regex to learn more about using regular expressions in Java.

Hetal C. Shah is an IT consultant, specializing in Internet-related technologies


Return to ONJava.com.

Copyright © 2009 O'Reilly Media, Inc.