ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

Building Wireless Web Clients, Part 1: Pitfalls of MIDP HTTP
Pages: 1, 2

How Much Data Will I Get?

Having connected to the server and successfully requested the HTML page for a book, we now need to read the response. The most obvious question to ask at this point is how much data are we going to receive or, more to the point, how can we tell when we have read all of the data that we're going to get? Answering this question is not as simple as you might think.



Back in the early days of the Web, each exchange between a browser and a Web server required a separate TCP/IP connection, which the browser would use to send the HTTP request headers and any accompanying data, then wait for server to reply. Having sent all of its reply, the server would close the connection. With this simple arrangement, the browser could simply read data until it detected that the connection had been closed, so there was no need for the server to explicitly indicate how much data it was going to send.

Unfortunately, this mode of operation turns out to be very expensive in most cases, because the time required to create a TCP/IP connection is often very significant when compared to the time spent subsequently transferring the data. For this reason, version 1.1 of the HTTP protocol allows the browser and the server to establish a TCP/IP connection and then use it more than once to exchange messages. In this mode, usually called "keepalive mode," the browser will typically send a request, read the server's response, send another request, and so on. Since TCP/IP is a stream-based rather than a message-based protocol, and since the connection is no longer being closed, there is no way for either the server or the browser to know when it has received all of the data from its peer, unless the data itself contains an embedded length. To solve this problem, HTTP 1.1 includes a Content-Length header, which can be used by the browser when sending a request and by the server in its response, to indicate how many bytes of data follow the HTTP headers.

The MIDP specification requires that support for HTTP 1.1 be provided and, therefore, keepalive mode is available (and, in fact, it is the default). A MIDP client application, therefore, appears to have two choices:

  1. Use the HTTP 1.0 method with no re-use of the connection.
  2. Use the HTTP 1.1 method, which might be more efficient for some applications.

To use the HTTP 1.0 method, you have to add an HTTP header called Connection with value Close. This requests that the server closes the connection after all of its data has been sent. To retrieve the data, you simply read until an end-of-file condition is reached.

To use the HTTP 1.1 method, you omit the Connection header and, if you are sending data along with the request, you also need to add the Content-Length header to specify the length of the data. In practice, though, the MIDP implementation includes this header for you.

In most cases, the likely result is that the server will return a response including a Content-Length header of its own, which you can retrieve using the HttpConnection getLength() method and then loop until you have read that many bytes from the input stream. However, even though you may ask the server to keep the connection open, it is not obliged to do so -- it may still close the connection after sending its reply and, if it decides to do this (or if it is, in fact, an HTTP 1.0 server that doesn't understand keepalives), it is not obliged to send you a Content-Length header. This means that even if you allow keepalives, you still have to be prepared to handle the alternative.

In order to keep the code for this article as simple as possible, our bookstore client opts for the HTTP 1.0 mode of operation by including the Connection header with value Close. However, if you expect to exchange more than one pair of messages with the same server in a reasonably short space of time, you should allow the server to keep the connection open, in which case you have to be prepared either to use the length in the Content-Length header, if the server returns it, or read until the connection is closed, if it does not, using code that looks something like this (where only the essential details are shown):

int length = conn.getLength();	// Value from Content-Length header
if (length != -1) {
	// HTTP 1.1 mode - content length supplied
	// Read exactly "length" bytes from the connection
	int count = 0;
	while (count < length) {
		int bytes = read(buffer, 0, Math.min(BUFFER_SIZE, length - count));
		if (bytes == -1) {
			// Unexpected end of file
			break;
		}

		count += bytes;

		// Do something with the bytes in "buffer" - not shown.
	}
} else {
	// Read until -1 is returned
	int count;
	while ((count = inputStream.read(buffer, 0, BUFFER_SIZE)) != -1) {
		// Do something with the bytes in "buffer" - not shown
	}
	// End of file
}

When the client and server operate in HTTP 1.1 keepalive mode, it is still possible that the getLength() method will return -1, even if the TCP/IP connection is not going to be closed to mark the end of the server's response. This can happen when the server chooses to use chunked mode when returning its response. Setting up a Content-Length header requires that the software (such as a servlet) that is responsible for generating the reply knows in advance how much data it is going to send. In some cases, this is not practical. Chunked mode allows the server to send out a response in chunks, where each individual chunk is preceded by its length and the end of the reply is marked by a chunk with length 0. The format of the body of a chunked HTTP reply message looks something like this:

19\r\n
This is a chunk of data. \r\n
16\r\n
This is another chunk.\r\n
0\r\n

The first line encodes the size of the following chunk in hexadecimal, followed by the characters \r\n. The data follows, on a single logical line, followed again by \r\n. The end of the reply is marked by a chunk of zero length, as shown.

The MIDP HTTP implementation takes care of managing chunks for you, so your code does not need to detect chunk boundaries and extract the length parts, etc. -- you just see a single stream of bytes which, in the example above would be:

This is a chunk of data. This is another chunk.

Since the complete length is not known until the last chunk has been read (and the HTTP 1.1 specification explicitly requires that a Content-Length header must not be included when using chunked encoding), calling getLength() when the server elects to supply a chunked response always results in -1 being returned, and you have no option but to read until the read() method returns -1.

Incidentally, the MIDP HTTP implementation may choose to write the body of your MIDlet's HTTP request to the server in chunked form. This is invisible to the code, but not to the server. Strictly speaking, this should not be a problem, because MIDP only supports interworking with HTTP 1.1 servers, which are required to accept chunked data. In the real world, however, it is sometimes necessary to communicate with an HTTP 1.0 server or application, which would not understand chunked encoding. You can avoid having your request sent in chunked mode by following both of the following rules:

  • Make sure that the message body is no more than 256 bytes long.
  • Do not call the flush() method on the output stream. Instead, write your message body to the output stream before opening the input stream or trying to access reply headers. This way, your message body will be sent without requiring a flush() call.

Obviously, this advice depends on the current MIDP HTTP implementation. Nevertheless, if you absolutely must communicate with a server that does not understand chunking, you have no choice but to rely on it.

Reading and Interpreting the Data

Now that we know how to detect when we have received all of the server's reply, the last thing we have to do is extract the information that we need from it, which is simply a matter of looking for certain strings in the HTML page. The book title, for example, follows the fixed string "buying info:", making up the text between the end of that string and the next end of line characters. Since the String class has convenient methods (such as indexOf) that make searching for substrings easy, the most obvious approach would be to read all of the server's reply into a single String, using code like this:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
int length = conn.getLength();	// Value from Content-Length header
if (length != -1) {
	// HTTP 1.1 mode with keepalive
	// Read exactly "length" bytes from the connection
	int count = 0;
	while (count < length) {
		int bytes = read(buffer, 0, Math.min(BUFFER_SIZE, length - count));
		if (bytes == -1) {
			// Unexpected end of file
			break;
		}

		// Append data to the page
		baos.write(buffer, 0, bytes);

		count += bytes;
	}
} else {
	// Read until the connection is closed
	int count;
	while ((count = inputStream.read(buffer, 0, BUFFER_SIZE)) != -1) {
		// Append data to the page
		baos.write(buffer, 0, count);
	}
	// End of file
}

// Get the whole page
String page = baos.toString();
baos.close();	// Discard content

Notice that we use a ByteArrayOutputStream to gather the bytes that we receive and then call its toString() method to get the result in String form. This is convenient, not only because it manages the buffering of the data internally for the case where we don't know in advance exactly how much data we will receive, but also because the toString() method performs the necessary conversion from bytes to Unicode characters. Strictly speaking, this code should extract the character encoding of the received bytes from the HTTP headers of the server reply, but for simplicity we have assumed that the bytes are encoded using the client platform's default encoding. Since MIDP platforms are not guaranteed to support more than one encoding, this is an assumption you will normally have to make, in any case.

The problem with this simple approach is that while it would work on a J2SE platform, it requires too much memory for many of today's MIDP devices. Amazon's Web pages contain a lot of information -- in fact, a typical page would be at least 40KB long. Gathering this into a single String means that we have to be able to hold at least one copy of the whole page in memory (and the code shown above actually involves more than one copy). Unfortunately, this is not always possible. On a typical PalmOS device with 8MB of memory, the maximum memory available for the Java heap, which has to hold all of the objects and dynamic data for the Java VM, is only 64KB! The result of using this approach in such a resource-constrained environment is likely to be an OutOfMemoryError! The memory limitation can be overcome by reading the page content in relatively small chunks and scanning for the strings we are looking for in each section as we read it. The details of this are a little messy and not particularly interesting, so we'll just take a brief look at some of the code to demonstrate how it works. If you want to see the complete implementation, you'll find it in the source files BookInfo.java and InputHelper.java.

To get the information that we require, we need to do the following:

  1. Find the string "buying info:". This is followed by the book title, which needs to be extracted and stored.
  2. Find the string "Based on"" and then get the number of reviews, which follows it.
  3. Find the string "Sales Rank:", which is followed by the book's sales ranking.

Because we are going to process the page content as we receive it and without storing much of it, it isn't possible to move backwards, so we have to search for these strings in the order shown. The major problem that we have to deal with is that we won't necessarily ever have any of these strings completely in memory at the same time -- we might find, for example, buying in one buffer full of data and info in the next, or we might receive each character individually. To keep the code readable, we create a class called InputHelper that hides the details of working with the characters as they arrive from the input stream and offers higher-level methods that provide the neccessary searching capability. The following code shows how this class is used to get the book title, its sales rank, and the number of reviews from the input stream, where the variable helper is an instance of the InputHelper class:

boolean found = helper.moveAfterString("buying info: ");
if (!found) {
    return;
}

// Gather the title from the rest of this line
StringBuffer titleBuffer = helper.getRestOfLine();

// Look for the number of reviews
found = helper.moveAfterString("Based on ");
if (!found) {
    return;
}

// Gather the number of reviews from the current location
String reviewString = helper.gatherNumber();

// Look for the sales rank
found = helper.moveAfterString("Sales Rank: ");
if (!found) {
    return;
}

// Gather the number from the current location
String rankingString = helper.gatherNumber();

This code is, I think you'll agree, perfectly readable. The messy part is hidden in the InputHelper class. Here, for example, is how the gatherNumber() method, which scans forward until it finds a sequence of characters representing a number and collects them all into a string, is implemented:

// Gets the characters for a number, ignoring
// the grouping separator. The number starts at the
// current input position, but any leading non-numerics
// are skipped.
public String gatherNumber() throws IOException {
    StringBuffer sb = new StringBuffer();
    boolean gotNumeric = false;
    for (;;) {
        char c = getNext();

        // Skip until we find a digit.
        boolean isDigit = Character.isDigit(c);
        if (!gotNumeric && !isDigit) {
            continue;
        }
        gotNumeric = true;
        if (!isDigit) {
            if (c == '.' || c == ',') {
                continue;
            }
            break;
        }
        sb.append(c);
    }
    return sb.toString();
}

    
// Gets the next character from the stream,
// returning (char)0 when all input has been read.
private char getNext() throws IOException {
    if (charsLeft == 0) {
        charsLeft = reader.read(buffer, 0, BUFFER_SIZE);
        if (charsLeft < 0) {
            return (char)0;
        }
        nextchar = 0;
    }
    charsleft--;
    return buffer[nextchar++];
}

The key to this method, and most of the others in this class, is the helper method getNext(), which returns a single character from the input stream. For efficiency, instead of calling the read() method of the InputStream, obtained from the HttpConnection each time getNext() is invoked, the data is held in a relatively small buffer which is filled when it is empty and returned from there. In this implementation, a 1024-byte buffer is used. The code shown here is written for the case in which the server will close the connection after sending the reply (or will use chunked encoding). To ensure that this will happen, the client includes a Connection: Close header with its request. Modifying the code to allow reuse of the connection if the server allows it is, however, fairly simple and might be undertaken as an exercise.

The gatherNumber() method calls getNext() whenever it needs a character, skipping until it finds a digit and then collecting digits until it finds something that is not numeric. The InputHelper class has several other methods used in processing the Web page returned by the server that work in exactly the same way. Writing the code in this way is slightly more complex than it would be if we could work with the entire reply in the form of a String and is almost certainly a little slower, but the environment in which the code executes leaves us with no real choice.

Summary

Although on the surface it appears that a MIDP HTTP client would be very much like one written for J2SE, this article has shown that, in the general case, you can't simply port existing J2SE code and expect it to work unchanged on a cell phone or PDA. The MIDP HTTP support currently presents a lower-level interface than its J2SE counterpart, requiring you to handle details, such as server redirection, than are taken care of automatically for desktop devices. Furthermore, the example shown in this article emphasized the ever-present need to be aware of the resource limitations imposed by MIDP devices and to restructure your code accordingly. In the second part of this article, we'll look at how to store the book's title, ISBN, and other information on a mobile device so that the user can keep track of changes without needing to enter the ISBN each time.

Kim Topley has more than 25 years experience as a software developer and was one of the first people in the world to obtain the Sun Certified Java Developer qualification.


Return to ONJava.com.