ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

Regular Expressions in J2SE
Pages: 1, 2

Text Matching and Manipulation

In text matching, we are interested to check whether the entire string matches the regular expression, or whether only a part of the string matches it.



Exact Match

Matcher Class provides the matches() method to test for an exact match.

  • boolean matches()
    Attempts to match the entire input sequence against the pattern. This method succeeds only if the whole input character sequence is matched.

Here's an example of using matches():

import java.util.regex.*;
import java.util.*;

public class exactmatch
{
    public static void main(String args[])
    {
    // This regular expression means 'cats' followed
    // by any number of characters (except new line
    // Character), followed by 'dogs'
        Pattern pat=Pattern.compile("cats.*dogs");
        Matcher matcher =
             pat.matcher("cats and dogs");
        boolean flag = matcher.matches(); //true

    // This regular expression means 'house' followed
    // by one or more character (except new line
    // Character),  followed by 'family'
        Pattern pat2 =
            Pattern.compile("house.+family");
        Matcher matcher2 =
            pat2.matcher("housefamily");
        boolean flag2 = matcher2.matches(); //false
    }
}

The above example could be implemented using only the Pattern class, which has its own matches() method:

  • boolean matches(String regex, CharSequence input)
    Compiles the given regular expression and attempts to match the given input sequence against it. If a regular expression is to be used multiple times, compiling it once and reusing it will be more efficient than invoking this method each time.

Example:

// 'cats' followed by any number of characters
// (except new line Character), followed by 'dogs'
boolean flag =
    Pattern.matches("cats.*dogs",
                    "cats and dogs"); //true

Partial Match

Matcher provides two methods for partial string matching, each used for a slightly different purpose.

  • boolean lookingAt()
    Attempts to match the input sequence, starting at the beginning, against the pattern. Like the matches() method, this method always starts at the beginning of the input sequence; unlike that method, it does not require that the entire input sequence be matched.
  • boolean find()
    Attempts to find the next subsequence of the input sequence that matches the pattern. This method starts at the beginning of the input sequence, or where the previous successful invocation of the method was ended.

find() is particularly useful when you are interested in all the subsequences of the given input character sequence that match the given pattern, as in the following example:

Pattern pat=Pattern.compile("john");

Matcher matcher =
    pat.matcher ("Hello, I am john parker");
boolean flag=matcher.find(); //true

// matches() returns false since the
// entire input sequence doesn't match
// the regular expression.
flag=matcher.matches(); //false

Pattern pat2=Pattern.compile("john");
// lookingAt() returns true because the
// regular expression matches the beginning of
// the string
Matcher matcher2 =
    pat2.matcher("john parker is my name");
flag = matcher2.lookingAt(); //true

Say we have a requirement to implement a program that checks that an input string is a valid web site URL in the form of www.domain-name.top-level-domain, and if it is valid then prints its domain name, and its top-level domain. Regular expressions covered so far in this article will help in validating the input string, but not in extracting the domain name and its top-level domain. Use of Capturing groups in regular expression helps in extracting the meaningful information from the matched string.

Capturing groups

In regular expression, parentheses are used for grouping sub-expressions, but they also capture characters matched by that sub-expression. Capturing groups are numbered by counting their opening parentheses from left to right. Group zero always stands for the entire expression.

In the expression ((x)(y(z))), for example, there are four such groups:

Group # regular expression
0 ((x)(y(z)))
1 (x)
2 (y(z))
3 (z)

During a match, each subsequence of the input sequence that matches such a group is saved. The captured input associated with a group is always the subsequence that the group most recently matched. As a convenience, the following methods are provided in Matcher for working with capturing groups.

int end() Returns the index of the last character matched, plus one.
int end(int group) Returns the index of the last character, plus one, of the subsequence captured by the given group during the previous match operation. If the match was successful but the group itself did not match anything, then it returns -1.
String group() Returns the input subsequence matched by the previous match.
String group(int group) Returns the input subsequence captured by the given group during previous match operation.
int groupCount() Returns the number of capturing groups in this matcher's pattern.
int start() Returns the start index of the previous match.
int start(int group) Returns the start index of the subsequence captured by the given group during previous match operation. If the match was successful but the group itself did not match anything then it returns -1.

If Matcher evaluates a group a second time because of quantifiers, then the group's previously captured value will be retained if the second evaluation fails. But all captured input is discarded at the beginning of each match.

Here's an example of validating a web site URL, and extracting its domain name, and top-level domain using Capturing groups:

String str="www.onjava.com";
String regExpr = "www\\.(.+)\\.(.+)";
Pattern pat;

// Pattern Matching will be case insensitive.
pat =
  Pattern.compile(regExpr,Pattern.CASE_INSENSITIVE);

Matcher matcher=pat.matcher(str);

if(matcher.find())
{
 System.out.println("Input is valid.\n");
 System.out.println("Domain:"+matcher.group(1));
 System.out.println("TLD is:" + matcher.group(2));
} else {
 System.out.println("Input is not valid.");
}

This produces the output:

Input is valid.

Domain:onjava
TLD is:com

'\' represents an escape character in regular expressions. To represent the \ character in a string, we use \\, as in the above web site validation example.

A Matcher may be reset explicitly by invoking its reset() method, or if a new input sequence is desired, by calling its reset(CharSequence) method. Resetting a matcher discards its explicit state information. An example of resetting is shown below:

Pattern pat=Pattern.compile("John");

Matcher matcher =
    pat.matcher("Hello World");
boolean flag = matcher.find(); //false

//Reset matcher with new input sequence.
matcher.reset("Hello John");
flag=matcher.find(); //true

String Replacement

Matcher provides the replaceAll(String) and replaceFirst(String) methods for string replacement.

  • String replaceAll(String replacementStr)
    Replaces every subsequence of the input sequence that matches the pattern with the given replacementStr string.
  • String replaceFirst(String replacementStr)
    Replaces only first subsequence of the input sequence that matches the pattern with the given replacementStr string.

The following code snippet shows an example of replaceAll():

String oldString =
      "Telephone is a new technology. " +
      "People can carry Telephones with them " +
      "in office, college, trains etc..";
Pattern pat=Pattern.compile("Telephone");
Matcher matcher=pat.matcher(oldString);

// Replace all occurrences of 'Telephone' in
// oldString with 'Cellular Phone'
String newString =
    matcher.replaceAll("Cellular Phone");

System.out.println(newString);

This produces the output:

Cellular Phone is a new technology. People can
carry Cellular Phones with them in office,
college, trains etc.

String Tokenizing

A record stored in a flat-file is typically formatted using a separator character to separate the individual fields in record. If the separator is a single character like '|', ',' or a tab, then the StringTokenizer class can be used to split the line into fields. But the separator is complex (say, '~__' ), then parsing with regular expressions is helpful. Pattern provides a useful method for this:

  • String [] split(CharSequence inputStr)
    It splits the given input sequence around matches of pattern.

Here's a simple example of this technique:

/* Variable str represents a record with
   field values first name, last name, and middle
   initial separated by ~__.
*/
String str="Cathy ~__Paul~__C";

Pattern pat=Pattern.compile("~__");

// Split the record by ~__.
String flds[]=pat.split(str);

System.out.println("Fields are:\n");

for(int i=0;i <flds.length;i++)
{
        System.out.println(flds[i]);
}

This produces the output:

Fields are:

Cathy
Paul
C

Email Validation Example

This article began by showing the difficulty of validating a String as an email address. Let's see how easy it is to implement the same thing using regular expressions.

// Variable str represents an email address
// to be validated.

String str="administrator@admin.com";

Pattern pat=Pattern.compile(".+@.+\\.[a-z]+");

Matcher matcher=pat.matcher(str);
boolean flag=matcher.find(); // true;

matcher.reset("administrator@admincom");
flag=matcher.find(); // false;

matcher.reset("administrator@a.");
flag=matcher.find(); // false;

matcher.reset("administrator@at.admin.com");
flag=matcher.find(); // true;

In above example, [a-z] is a character class that defines a character range starting with 'a' and ending with 'z'. Any of the characters defined in the character class can be matched by a single character of the same value in the input data.

Regular Expressions and Multithreading

Instances of Pattern class are immutable and are safe for use by multiple concurrent threads. However, instances of the Matcher class are not safe for multiple concurrent threads use because of its explicit state, which includes the start and end indices of the most recent successful match. It also includes the start and end indices of the input subsequence captured by each capturing group in the pattern as well as a total count of such subsequences.

The explicit state of a matcher is initially undefined; attempting to query any part of it before a successful match will cause an IllegalStateException to be thrown.

Conclusion

The java.util.regex package in JDK 1.4 is quite handy and useful to the developers of search, extract, and replace systems such as search engines, rule-based data formation and transformation engines, EAI, and so on. Regular expressions are also used in extracting meaningful information from large chunks of text data, hence the java.util.regex package will significantly help Java developers with efficient development and maintenance of such applications. Although the examples in this article were simplistic, they lay the groundwork for examining the usefulness of the package.

The java.util.regex package has several other features for appending, text replacement, and greedy/non-greedy pattern matching. Space won't allow us to discuss them in more detail here, so see the JDK 1.4 Documentation on java.util.regex to learn more about using regular expressions in Java.

Hetal C. Shah is an IT consultant, specializing in Internet-related technologies


Return to ONJava.com.