Linux DevCenter    
 Published on Linux DevCenter (http://www.linuxdevcenter.com/)
 See this if you're having trouble printing code examples


Range-Keyed Queries

by Dan Tow, author of SQL Tuning
01/06/2005

In a recent dialogue on the oracle-l@freelists.org mail group, I ran into a good prototype for a recurring class of SQL tuning problems. The coder had two versions of SQL getting dramatically different performance levels for a very frequent single-row query, and wanted to know the reason for the difference, citing logical and physical I/O figures for each alternative. To make a long story short, neither version was near optimal, and the optimal solution came from a strategy I've applied to a recurring class of problems, which I describe below. First, I present the original problem, summarized with his permission from the note by Mark Bobak (thanks, Mark!):

There are two tables involved: ADDS_USERS, which has AUSR_ID as the primary key, and AUTHORIZED_IP_ADDRESSES.

The original query was:

SELECT A.AUSR_LOGIN_SCREEN_NAME FROM ADDS_USERS A,
AUTHORIZED_IP_ADDRESSES B
WHERE A.AUSR_ID = B.AUSR_ID AND :B1 BETWEEN 
B.AIA_IP_ADDRESS_START AND
B.AIA_IP_ADDRESS_END;

Autotrace apparently misrepresented the execution plan, which was the original source of confusion, and the reason for the original note. The trace statistics for the query were correct, however, and are edited here just to show the important statistics:

Statistics
----------------------------------------------------------
       1321  consistent gets
        864  physical reads
          1  rows processed

1321 buffer gets were too many, not to mention the physical I/O, considering the frequency with which this was to be executed.

Therefore, Mark tried re-creating ADDS_USERS ordered by AUSR_ID, to improve AUSR_PK index clustering factor. This didn't help. Creating AUTHORIZED_IP_ADDRESSES as an index-organized table was also useless.

So Mark went back to SQL hacking, and finally came up with the following:

SELECT (select A.AUSR_LOGIN_SCREEN_NAME FROM ADDS_USERS A 
where
a.ausr_id = b.ausr_id) from AUTHORIZED_IP_ADDRESSES B
WHERE :B1 BETWEEN B.AIA_IP_ADDRESS_START AND B.AIA_IP_ADDRESS_END;

which produced the (accurate) plan:

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=110 Bytes=3080)
   1    0   TABLE ACCESS (BY INDEX ROWID) OF 'ADDS_USERS' (Cost=2 Card=1 Bytes=14)
   2    1     INDEX (UNIQUE SCAN) OF 'AUSR_PK' (UNIQUE) (Cost=1 Card=1)
   3    0   INDEX (RANGE SCAN) OF 'AIA_INDX_PR01' (NON-UNIQUE) 
              (Cost=4 Card=3D110 Bytes=3D3080)

Statistics were much improved:

Statistics
----------------------------------------------------------
         67  consistent gets
          0  physical reads
          1  rows processed

Consistent gets were down to 67 from 1321, but I noticed that neither plan delivered a path that maximized efficiency for a single-row query. Here is the approach I recommend for this class of problems.

A class of tables I'll call range-keyed tables are laid out with one row per range of values, where the ranges of values are non-overlapping and (usually) inclusive, meaning they leave no "holes" in the midst of the ranges if the ranges are laid out end-to-end, so to speak. (These ranges must not overlap even at the endpoints, unless one end of the range is defined as non-inclusive.) The most common examples of this involve date ranges, although this particular example came from a case of IP-address ranges. Thus, to point to a row, you say:

:bindvariable between lowcol and highcol

or (where one range's highcol equals the next range's lowcol)

(:bindvariable > lowcol and : bindvariable <= highcol).

I'll show this another way. lowcol and highcol must reflect values belonging to some order-able set, some set capable of being mapped to points on a line, such as a number line or a timeline. Taking the case where one range's highcol value is the next range's lowcol value, the ranges would look something like Figure 1.

Figure 1
Figure 1. Ranges represented by rows 1 through 4

Most commonly, the designer chooses to create an index on (lowcol, highcol), but sometimes the order is reversed, or just one of these is indexed.

The problem is that any such index strategy means that the index range scan to find the single range that meets the query condition is likely to have to read about half of the index, if you ask for a range in the middle of the whole set of ranges. (With date ranges, you can often escape most of the dilemma because applications tend to ask for the most-recent range, or one of the most-recent ranges, anyway, so an index leading with highcol will see a very short range scan, usually. If, by contrast, you index (Low_Date, High_Date), in that order, you will guarantee that most index range scans will read the entire range, since every range will have Low_Date lower than the recent date inside the most-recent date range.)

E. F. Codd discussed (in The Relational Model for Database Management, Version 2, Addison Wesley) a useful set of comparators (greatest-less-than, least-greater-than, greatest-less-than-or-equal-to, and least-greater-than-or-equal-to) that turn out to be ideal for this problem. With these available, you'd just ask for an index range scan with:

highcol least-greater-than-or-equal-to :bindvariable

and the database would find the very first (least) value of highcol >= :bindvariable in an index range scan on a conventional index of highcol, and would then quit, because that's the row you want. The cost, if you look at the I/O, would be no different than an index unique scan! The database design for inclusive, non-overlapping ranges would no longer even require lowcol because lowcol would be implied by the previous range's highcol, and the database would automatically avoid potentially thorny corrupt-data issues with accidentally overlapping ranges!

Sadly, I've never found a database that has implemented this lovely concept. Instead, the best I've found is the following admittedly hacky solution I recommend for use on Oracle:

(Forewarning: the following solution may offend the relational purists among you. Yes, I know it's a hack.) Let's say the index on highcol is called highcolind. If you want to quickly find the row you're after, the query you need is:

SELECT /*+ INDEX(t highcolind) */ <whatever>
  FROM rangetable t
 WHERE highcol >= :bindvariable
   AND lowcol <= :bindvariable
   AND ROWNUM = 1

Here, the database won't keep searching for more ranges that might enclose :bindvariable once it finds the first one, and the range scan on highcolind finds the first one in the first index entry it sees, as long as the database follows the hint! (If it fails to follow the hint, performance will suffer, clearly, but the database will still return the right row, since it will keep looking until it satisfies both conditions on :bindvariable. This will not find multiple ranges that include :bindvariable, though, if the table contains overlapping ranges--I have no easy answer to find multiple ranges if your ranges overlap. If your ranges have holes, it will also have to do the entire range scan from :bindvariable to the end of the range before it will conclude that no range contains the particular value of :bindvariable you're looking with--that's why having no holes in the ranges helps.)

To build this into a more-complex query, you'd do something like:

SELECT /*+ ORDERED <maybe other hints> */ <whatever>
  FROM (SELECT /*+ INDEX(t highcolind) */ *
          FROM rangetable t
         WHERE highcol >= :bindvariable
           AND lowcol <= :bindvariable
           AND ROWNUM = 1) t, taba a, tabb b, ...
WHERE t.fkey = a.pkey
...

If you use this strategy, be sure to watch the endpoint cases. Mark's database design had endpoints that were inclusive, so he used BETWEEN to find the single range desired, and (presumably) the application guaranteed that the next range's value of AIA_IP_ADDRESS_START was the next legal address after the previous range's AIA_IP_ADDRESS_END. If these were Oracle DATE-type columns, the Low_Date for the next range would be one second after the High_Date for the previous range. If, on the other hand, the lowcol value for one range is equal to the highcol value of the previous range (as shown in Figure 1), then you must not use BETWEEN, and must use < or > for the comparison with the endpoint that is non-inclusive. For example, if High_Date is the last included time for one range, and Low_Date for the next range equals High_Date for the previous range, then the range condition (which is non-inclusive on the low end) will look like this:

WHERE High_Date >= :bindvariable
           AND Low_Date < :bindvariable

In Mark's particular case, a new version of the query would be:

SELECT /*+ ORDERED USE_NL(A) */ A.AUSR_LOGIN_SCREEN_NAME 
FROM 
(SELECT /*+ INDEX(B AIA_IP_ADDRESS_END_INDEX) */ * 
   FROM AUTHORIZED_IP_ADDRESSES B
  WHERE :B1 BETWEEN B.AIA_IP_ADDRESS_START AND
                    B.AIA_IP_ADDRESS_END
    AND ROWNUM = 1) IV,
ADDS_USERS A
WHERE A.AUSR_ID = IV.AUSR_ID;

which joins a single row from the range-keyed table to the users table. When Mark implemented this strategy, the fixed version of the query required just five logical I/Os, while the better version of the earlier two versions required 67 logical I/Os. More importantly, the fixed version read just a single row within each of its five logical I/Os, while most of the 67 logical I/Os of the better of the earlier queries were expensive range reads in index leaf blocks, which covered every row entry in the leaf block, burning far more CPU than a single-row read in a logical I/O.

Although I created this trick for use on Oracle, it should work on MySQL; use the clause “LIMIT 1” at the end of the query (or at the subquery inside the FROM clause), in place of the condition “ROWNUM=1”, and use the hint USE INDEX(<Keylist>) (or FORCE INDEX(<Keylist>) ) at the end of the table-reference, and the hint /*! STRAIGHT JOIN */ to force the join order (in place of ORDERED), if necessary. Thus, the generic case (for MySQL versions 4.1+) above becomes:

SELECT /*! STRAIGHT JOIN <maybe other hints> */ <whatever>
  FROM (SELECT *
          FROM rangetable t FORCE INDEX(highcol)
         WHERE highcol >= <bindvariable>
           AND lowcol <= <bindvariable>
           LIMIT 1) t INNER JOIN taba a ON t.fkey = a.pkey    
                      INNER JOIN tabb b ...
WHERE 
...

Similar tricks are likely possible in other open databases, as long as there is a way to force a join order and to stop a query, especially a subquery in a FROM clause, at the first row. I’d like to invite the readers to submit the equivalent versions for PostgreSQL, DB2, Firebird, and others.

SQL Tuning

Related Reading

SQL Tuning
By Dan Tow

Dan Tow is an independent consultant, operating under the banner SingingSQL (www.singingsql.com). His experience solving Oracle-related performance problems goes all the way back to his 1989 hire by Oracle Corporation. He has a Ph.D. in chemical engineering from the University of Wisconsin at Madison.


Return to the Linux DevCenter.

Copyright © 2009 O'Reilly Media, Inc.