Quick quiz: how do you rewrite the statement below, which alternates between two constants, without a conditional?
if (x == a) x= b;
else x= a;
Answer:
x= a ^ b ^ x;
//where x is equal to either a or b
The ^
character is the logical
XOR
operator. How does the code
work? With XOR
, doing
the operation twice always gives you back the original
result.
Case 1: x equals a 
Case 2: x equals b 

x= a ^ b ^ x 
x= a ^ b ^ x 
Use formula 
x= a ^ b ^ a 
x= a ^ b ^ b 
Substitute 
x= b ^ a ^ a 
x= a ^ b ^ b 
Reorder variables 
x= b 
x= a 
The last two variables cancel out 
While not a popular technique, delving into the black art of bitwise manipulation can boost performance in Java. The rewrite above benchmarked at 5.2 seconds, versus 5.6 seconds for the conditional statement on my setup. (See Resources below for the sample code.) Avoiding conditionals can often increase performance on modern processors with multiple pipelines.
Related Reading Better, Faster, Lighter Java 
Of particular interest to Java programmers are certain
tricks related to what are called bitsets. In Java,
the int
and long
integer
primitives can double as a kind of set of bits with 32 or
64 elements. These simple data structures can be
combined to represent larger structures, including arrays.
Additionally, certain special operations
(called bitparallel) effectively condense a number of
separate operations into one. When using finer
grained sized databits rather than integerswe will find that operations that do only one
thing at a time when applied to integers can sometimes
do many things at once when applied to bits. We can
think of bitparallel operations as rediscovered software
pipelines in Java systems. (Of course, assembly programmers may not
see much new about this!)
Advanced chessplaying programs like Crafty (written in C) use a special chess data structure called a bitboard to represent chess positions faster than arrays could. Java programmers should be even more interested in avoiding arrays than C programmers. While C has a fast implementation of arrays, Java arrays (which have more features like bounds checking and garbage collection) are on the slow side. A simple test I performed on my system (Windows XP, AMD Athlon, HotSpot JVM) comparing integer access to array access shows that array access takes 160 percent longer than integer access. This doesn't even touch on the garbage collection issue. Bitsets, which are implemented as integers, can replace arrays in some situations.
Let's take a look at dusting off some of those old assemblyage tricks with a special emphasis on bitsets. Java actually has good bitwise support for a portable language. Additionally, Tiger adds a number of useful bit manipulation methods to the API.
Java programs are pretty far removed from the bitcrunching
machines and operating systems on which they run. While modern
CPUs often have special instructions to manipulate bitsets,
you certainly can't execute these instructions in Java. The
JVM supports only signed and unsigned shift, and bitwise
XOR
, OR
, AND
, and NOT
. Ironically, Java programmers find
themselves in the same boat with many assembly programmers
who happened to be targeting CPUs, in lacking
extra bitwise instructions. The assembly programmers had
to emulate these instructions, as quickly as possible, in
software. Many of the new Tiger methods do just this in
pure Java.
Clanguage programmers who are microtuning may review their actual assembly code and then consult the optimizer's manual for their target CPU, and actually count how many instruction cycles their code will run in. In contrast, a practical problem with doing lowlevel optimization in Java is determining the results. The Sun JVM spec says nothing about the relative speed at which the given opcodes are executed. You may spoonfeed the JVM the code you think will run fastest, only to be shocked by lack of improvement or even detrimental effects. The JVM may either mangle your optimizations somehow, or it may be optimizing the normal code internally. In any case, any JVM optimizations, even if they are mainstays in compiled languages, should be benchmarked.
The standard tool to look at bytecodes, javap
,
is included in the JDK. However, users of Eclipse can use
the
Bytecode Outline plugin from Andrei Loskutov. A
reference book about the JVM's design and instruction set
is available online on
Sun's online
Java book page. Note that there are two kinds of
memory in every JVM. There is a stack, where local
primitives and expressions are stored, and a heap, where
objects and arrays are stored. The heap is subject to
garbage collection, while the stack is a fixed chunk of
memory. While most programs only use a small amount of stack,
the JVM specification states that you can expect the stack
to be at least 64K in size. Take special note that the JVM is
really a 32/64bit machine, so the byte
and short
primitives
are unlikely to run any faster than the int
.
In Java, all integer primitives are in a signed number format known as "two's complement." To manipulate the primitives, we need to understand them.
There are two types of two's complement numbers: nonnegative and negative. The highest bit, the sign bit, usually shown on the left, is set to zero for nonnegative numbers. These numbers are "normal" you simply read them left to right and convert to the base you want. However, if the sign bit is on, then the number is negative and the remaining bits represent a negative number.
There are two ways to look at the negative numbers. In the
first way, negatives count up starting at the smallest
possible number and end at 1. So, for a byte, they start at
10000000
(or 128
decimal), then
10000001
(or 127
), all the way up
to 11111111
(or 1
). The second way
to think about them is a little odd. When the sign bit is
on, instead of having leading zeros followed one bits, there
are leading ones followed by zero bits. However, you must
also subtract one from the result. For example,
11111111
is just the sign bit padded by seven
ones, which is "negative zero" (0
). We then
add (or subtract, depending on how you look at it) one to
get 1
. 11111110
is
2
, 11111101
is 3
,
and so on.
It may seem strange, but we can do a lot of operations by
mixing the bitwise operators together with the arithmetic
operators. For example, to change between decimal
x
and x
, negate and add one:
(~x)+1
. This can be seen in the following table.
x 
~x 
(~x)+1 or x 
0111 (7) 
1000 (8) 
1001 (7) 
0110 (6) 
1001 (7) 
1010 (6) 
0101 (5) 
1010 (6) 
1011 (5) 
0100 (4) 
1011 (5) 
1100 (4) 
0011 (3) 
1100 (4) 
1101 (3) 
0010 (2) 
1101 (3) 
1110 (2) 
0001 (1) 
1110 (2) 
1111 (1) 
0000 (0) 
1111 (1) 
0000 (0) 
The bit flag pattern is common knowledge and widely used in the public APIs of GUIs. Perhaps we are writing a Java GUI for a constrained device like a cell phone or a PDA. We have widgets like buttons and dropdown lists that each have a list of Boolean options. With bit flags, we can stuff a large number of options in a single word.
//The constants to use with our GUI widgets:
final int visible = 1 << 0; // 1
final int enabled = 1 << 1; // 2
final int focusable = 1 << 2; // 4
final int borderOn = 1 << 3; // 8
final int moveable = 1 << 4; // 16
final int editable = 1 << 5; // 32
final int borderStyleA = 1 << 6; // 64
final int borderStyleB = 1 << 7; //128
final int borderStyleC = 1 << 8; //256
final int borderStyleD = 1 << 9; //512
//etc.
myButton.setOptions( 1+2+4 );
//set a single widget.
int myDefault= 1+2+4; //A list of options.
int myExtras = 32+128; //Another list.
myButtonA.setOptions( myDefault );
myButtonB.setOptions( myDefault  myExtras );
Your program can pass around a whole group of Boolean
variables in no time with a simple assignment expression.
Perhaps the API states that the user can have only one of
borderStyleA
, borderStyleB
,
borderStyleC
, or borderStyleD
at
the same time. To check, first select those four bits with a
mask and second, check to see that the result has, at most,
one bit. The code below uses a little trick we will explain
soon.
int illegalOptionCombo=
2+ 64+ 128+ 512; // 10 11000010
int borderOptionMask=
64+ 128+ 256+ 512; // 11 11000000
int temp= illegalOptionCombo &
borderOptionMask // 10 11000000
int rightmostBit=
temp & ( temp ); // 00 01000000
If temp
is not equal to
rightMostBit
, that means temp
must
have more than one bit, because rightmostBit
will
contain zero if temp
is zero, otherwise it
contains only one bit.
if (temp != rightmostBit)
throw new IllegalArgumentException();
The example above is a toy example. In the real world, AWT
and Swing do use the bit flag pattern, but
inconsistently. java.awt.geom.AffineTransform
uses it extensively. java.awt.Font
uses it, as
does java.awt.InputEvent
.
To get very far with bitsets, you need to know the standard "tricks," or operations you can perform. There are new bitset API methods as of the J2SE 5.0 (Tiger) release. If you're using an older release, you can just cut and paste the new methods into your code. A recent book with a lot of material on bitwise algorithms is Hacker's Delight by Henry S. Warren, Jr. (See www.hackersdelight.org or read the book on Safari.)
The following table shows some operations that can be done either with a line of code or one of the API methods:
y= all 0 bits 
y= 0; 
y= all 1 bits 
y= 1 
y= all zeros except for the rightmost or
least significant bit 
y= 1; 
y= all zeros except for the leftmost or
sign bit 
y= Integer.MIN_VALUE; 
y= the rightmost 1bit of x 
y= x & (x) 
y= the leftmost 1bit of x 
y= Integer.highestOneBit(x); 
y= the rightmost 0bit of x 
y= ~x & (x + 1) 
y= x with the rightmost 1bit turned off

y= x & (x  1) 
y= x with the rightmost 0bit turned off

y= x  (x + 1) 
y= the number of leading zeros in x

y= Integer.numberOfLeadingZeros(x);

y= the number of trailing zeros in x

y= Integer.numberOfTrailingZeros(x);

y= the number of 1 bits in x 
y= Integer.bitCount(x); 
y= x with the bits reversed 
y= Integer.reverse(x); 
y= x after a rotated shift left by c units

y= Integer.rotateLeft(x,c); 
y= x after a rotated shift right by c units

y= Integer.rotateRight(x,c); 
y= x with the bytes reversed 
y= Integer.reverseBytes(x); 
To get an idea of how long these methods take, one can step
through the source. Some methods are more cryptic than
others. They are all explained in Hacker's Delight. They
are either oneliners or a few lines long, like
highestOneBit(int)
below.
public static int highestOneBit(int i)
{
i = (i >> 1);
i = (i >> 2);
i = (i >> 4);
i = (i >> 8);
i = (i >> 16);
return i  (i >>> 1);
}

At this point, you can grab a bottle of vodka and a shot glass, because things are getting serious.
Chess used to be a hot research area in computer science during the height of the Cold War. Apparently, independently of one another, both Soviet and American teams came up with a new data structure called a bitboard. The American team, Slate and Atkin, seem to have been first to print with a chapter in Chess Skill in Man and Machine on Chess 4.x. The Soviet team, including Mikhail Donskoy, among others, wrote a bitboardenabled program called Kaissa. Both programs competed successfully internationally.
Before looking at bitboards, let's first look at the standard way to represent a chess board in Java (and many other languages).
//declare an integer enum for the possible
//state of the 64 individual squares.
final int EMPTY = 0;
final int WHITE_PAWN = 1;
final int WHITE_KNIGHT = 2;
final int WHITE_BISHOP = 3;
final int WHITE_ROOK = 4;
final int WHITE_QUEEN = 5;
final int WHITE_KING = 6;
final int BLACK_PAWN = 7;
final int BLACK_KNIGHT = 8;
final int BLACK_BISHOP = 9;
final int BLACK_ROOK = 10;
final int BLACK_QUEEN = 11;
final int BLACK_KING = 12;
//And we have an array of 64 squares.
int[] board= new int[64];
The array method is extremely straightforward. In contrast, the bitboard structure is made up of twelve 64bit bitsets; one for each type of piece. They are visualized as being stacked on top of each other.
//Declare twelve 64bit integers for one board:
long WP, WN, WB, WR, WQ, WK, BP, BN, BB, BR, BQ, BK;
Figure 1. The bitboard data structure
Why is there no layer for empty squares? Since
EMPTY
is a function of the other twelve layers,
including it would create redundant data. To calculate
EMPTY
, we just join the 12 layers and negate:
long NOT_EMPTY=
WP  WN  WB  WR  WQ  WK 
BP  BN  BB  BR  BQ  BK ;
long EMPTY = ~NOT_EMPTY;
Chessplaying programs work by generating lots of legal moves and then choosing the best one. A number of calculations need to be done. The squares being attacked have to be determined, so that they can be avoided and so that check and checkmate can be determined. Each chess piece has a different way of attacking. Looking at code that affects these different attacks will demonstrate some pros and cons of the bitboard. Some of the pieces fit in our bitboard scheme elegantly, while others don't.
Let's look at the kings first. They simply attack the squares adjacent to them. Depending on where they are, that's from three to eight squares. The king can be in the center, on the edge, or in the corner of the boardall need to be considered while coding.
To generate the king's 64 attacks programmatically at
runtime, we can first start with the base case, eight
attacks, and transform it into the special cases. First,
create a mask for one of the squares in the centersay the 10^{th} square. Figure 2 shows some long
s
that will be used to create this mask.
Figure 2. Finding the king's attacks
long KING_ON_B2=
1L  1L << 1  1L << 2 
1L << 7  1L << 9 
1L << 15  1L << 16  1L << 17;
We want to shift that box of ones around left and right and up and down visually speaking, but when shifting left and right watch the wraparound effect:
SHIFTED_LEFT= KING_ON_B2 >>> 1;
Oops. We wrapped the box around (see Figure 2). In chess,
a vertical row is called a file. If we shift a layer
visually left by one, we know that the rightmost file,
file H
, should always be zeros. Let's code that:
final long FILE_H=
1L  (1L<<8)  (1L<<16)  (1L<<24) 
(1L<<32)  (1L<<40)  (1L<<48)  (1L<<56);
KING_ON_A2= (KING_ON_B2 >>> 1) & ~FILE_H;
If we want to shift visually right, that's a different case:
KING_ON_B3= (KING_ON_B2 >>> 1) & ~FILE_A;
To shift up and down visually, just shift by eight:
KING_ON_B1= MASK_KING_ON_B2 << 8;
KING_ON_B3= MASK_KING_ON_B2 >>> 8;
In the real world, we can avoid having to write code to generate the king attacks by just hardcoding the 64 possibilities. Also, we are trying to avoid arrays, so let's not just stuff our attacks into a 64element array. An alternative is a 64way switch statementit may not be pretty, but it works well.
Let's look at the pawns next. While there is only one king
per color, there up to eight pawns. We can find all eight
attacks just as easily as we determined a single king's
attack! Note that pawns attack diagonally. To shift
visually up and down, we shift by eight, and for left or right, we
shift by one. Shift by seven or nine to go diagonally (that's
8+1
and 81
). Figure 3 shows how this works.
PAWN_ATTACKS=
((WP << 7) & ~RANK_A) & ((WP <<9) & ~RANK_H)
Figure 3. Pawn attacks for white
This code fragment works regardless of how many pawns are in play or where they are. Robert Hyatt, the author of Crafty, calls this a bitparallel operation because it works on a number of pieces simultaneously. Bitparallel expressions can be very powerful and are the key thing to look for in your own projects. If you see a lot of bitparallel operations, you might have a good candidate for bitwise optimization.
For a counter example, let's look at how we have to do the pawns with arrays:
for (int i=0; i<56; i++)
{
if (board[i]= WHITE_PAWN)
{
if ((i+1) % 8 != 0) pawnAttacks[i+9]= true;
if ((i+1) % 8 != 1) pawnAttacks[i+7]= true;
}
}
Above, we loop through the almost entire array, but not very quickly. We could rewrite it so that we keep track of the pawns and iterate through the (up to eight) pawns rather than the squares, but we would still be looking at many more bytecodes that need to be run with the bitboard method.
Knights are similar to kings and pawns. We can determine knight attacks with a precomputed table, just like we did with the king. Since there can be more than one knight per color, the attacks are technically bitparallel. However, since there are rarely more than two knights per color, that isn't going to help much in a practice sense (there can be more than two if the player promotes his pawns to knights, but that isn't likely).
Bishops, rooks, and queens all slide around the board. While their potential attacks are always the same, the actual attacks depend on what pieces lie along their attack lines. To determine legal moves, you must look at the squares along the attack lines individually. This is our worstcase scenario. There are no bitparallel operations possible. We are down to accessing individual squares as we would with the array. In addition, atomic access is more cumbersome (i.e., bugprone) with the bitboard than it is with the array.
An advantage of bitboards is that we can use masks to answer a lot of typical chess questions. For example, if you have a bishop, you might want to know how many of your opponent's pawns are on your bishop's color. Figure 4 illustrates this "attack mask" problem.
Figure 4. How many white pawns on white squares?
Chess has fairly complex rules, and this means there are pros and cons for using the bitboard. For some pieces bitboards are faster and for some they're slower. We went over the slower pieces to show that we aren't dealing with magic pixie dust. We can imagine a game that is very similar to chess, perhaps with different pieces, where bitsets might actually backfire or just not be worth the trouble. Bitset optimization must be considered carefully.
In general, bitsets have these pros and cons:
Pros:
Cons:
To generalize the chess example, think of the horizontal
and vertical positions on the board as two independent
variables or attributes. There are 8x8=64
bits required to represent this. In addition, we
required 12 layersone for each
type of piece. There are two ways the bitboard can
expand: 1) we can add more bits per layer, or 2)
add more layers. In the real world, we are
limited to 64 bits
per layer maximum, but let's consider a fantasy
128bit JVM with a doublelong
128bit
primitive. With 128 bits, we would have enough space to put
all 16 pawns, black and white, on the same layer
(8x8x2=128
). That would reduce the number
of layers needed and might simplify some obscure operations,
but doing so would complicate (and also slow down) functions
that work on all pawns of a certain color. All of Java's
bit operations work on all of the bits in
a primitive. Data that requires its own
operations or functions benefits from being in its
own layer. We can do bitparallel operations on
all of the bits in the layer faster than we can do
bitparallel operations on some of the bits. We
might find some clever use for the extra 64 bits, but
we don't want to mix the 12 types of pieces.
If we do use more than two variables per layer,
we can still change all of the variables for all of the elements
in a layer at the same time. Consider 3D tictactoe,
represented in Figure 5.
There are three possible values for each of the
three axes, or 3x3x3=27
possible values. We
need a 32bit bitset for each player.
Figure 5. A format for 3D tictactoe
We can even chain together 64bit bitsets (Figure 6) to implement Conway's Game of Life, a simulation game played on a large grid. The game goes through generations where the death or birth of a new "cell" is determined by the number of adjacent cells. A dead cell with exactly three live neighbors is (re)born. A live cell with zero or one neighbor dies (of loneliness). A live cell with more than three neighbors dies (of overcrowding). There are many variants where a different number of neighbors result in birth, life, or death. Figure 6 shows a life construct that will actually survive and move across the grid. To generate the next turn of the simulation we can follow this algorithm:
SUM_IS_0
up to
SUM_IS_8
. We can calculate these by using a recursive
algorithm that first considers two layers, then three layers,
then four, and so on up to the eighth.To sum up layers:
//S0 is short for "sum equals zero" etc.
//L1 is short for "layer one" etc.
//Look at just the first two layers:
S0= ~(L1  L2);
S1= L1 ^ L2;
S2= L1 & L2;
//Now look at the third layer:
S3 = L3 & S2;
S2 = (S2 & ~L3)  (S1 & L3);
S1 = (S1 & ~L3)  (S0 & L3);
S0 = S0 & ~L3;
//The fourth layer.
S4 = S3 & L4;
S3 = (S3 & ~L4)  (S2 & L4);
S2 = (S2 & ~L4)  (S1 & L4);
S1 = (S1 & ~L4)  (S0 & L4);
S0 = S0 & ~L4;
//Repeat this pattern up to the 8th layer.
The algorithm in takes 42 lines of code. It could be tuned a little more if needed, but it gives a number of advantages. It uses no conditionalswhich can cause processors to slow. It's a nice simple block of code that a JIT or Hotspot Compiler is sure to compile. And, most importantly, it calculates all 64 cells in parallel.
Figure 6. A format for The Game of Life
Gamerelated applications of bitsets are easier to find than nongame examples. The reason is that games like chess have rich relationships between each bit. For chess, there are 12 layers of 64 bits, or 768 bits. Each of the 768 are essentially related to the rest somehow. In business apps, information is usually less tightly related.
Keeping an open mind may lead to uses of bitsets in any problem domain. However, it is up to the developer's good judgment whether a particular situation is right for bitsets. Frankly, you may never need to use these techniques. However, the methods above may be just what is needed to speed up certain Java applications. If not, you can still mystify your friends with your superior hacking. Enjoy.
Glen Pepicelli is a software guru living with his dog in upstate New York.
Return to ONJava.com.
Copyright © 2009 O'Reilly Media, Inc.