ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

The Effective Use of Joins in Select Statements

by Satya Komatineni
01/07/2004

There is a school of thought that one programming language could be used for all purposes in developing a solution. For example, if I am a Java programmer, I must be able to do not only my business logic and presentation in Java, but also my persistence in Java, as well. A good goal, certainly, to have. But practicality has a wicked sense of humor and shows up at the table as SQL.

With sleeves drawn up, we, the Java programmers, start to tackle the first citizen of the SQL, called select. Very quickly, we meet these quirky things called joins. Our original suspicion that we should be sticking to Java is quickly reinforced as the syntax and oddities of joins put us on the defense and suddenly seem to show Java in a shining light. But the task at hand is unwavering and your boss is quite certain that it can't be all that hard to write a few SQL statements.

The story of one such journey from a simple select to a practical, everyday select is presented here in a set of 11 principles. I am hoping this article will be a good companion to non-relational programming languages such as Java, C#, and Perl. The provided examples use SQL Server syntax, but should be applicable to most databases.

The Summary of the 11 Principles

  • Principle 1. You can join tables.
  • Principle 2. You can join tables with different weights using left outer joins.
  • Principle 3. You can outer join a composite table which itself is an equi-join.
  • Principle 4. An outer join of two equi-joins is two outer joins combined.
  • Principle 5. Once you outer join a table, any joins that include the fields of that table needs to be outer joined as well.
  • Principle 6. It is usually not necessary to have a sub-select in the from clause.
  • Principle 7. You use a select to retrieve primary "entities" and not a general conglomeration of columns.
  • Principle 8. You use joins to add additional properties to a primary entity.
  • Principle 9. You can use functions to add additional properties to a primary entity.
  • Principle 10. You can conditionally add properties to a primary entity.
  • Principle 11. You can add properties to a primary entity using a sub-select.

Related Reading

SQL Pocket Guide
By Jonathan Gennick

Principles 1 through 6 deal with how to use the join construct to solve practical issues in retrieving data from relational databases. A join is a general-purpose mechanism (almost like a mathematical tool) for data retrieval. It does not prescribe specific patterns of usage. These principles outlined here put some boundaries on the join and show how to use the join in a restrictive patternized sense. This is similar to saying that a letter called "Q" exists in the English language, but is almost always followed by a "u" in usage. Similarly, it is important to recognize these join patterns aside from their syntax. These six principles show you when to use outer joins and the implication of using outer joins on the where clause. In the process, you will get to know some surprises and patterns.

The second set of principles, 8 through 10, deal with the general question of patternizing a select statement. From an OO perspective, the select statement is seen as a vehicle for retrieving objects that are called "primary entities."

Together, these principles will form a good foundation for retrieving data from relational databases.

Principle 1. You can join tables.

Database tables are quite simple in their intent. They are no different from their paper cousins, where we use them a lot. In fact, we learn about tabulated data in the fourth and fifth grades. When they are viewed as such, databases are simple to understand. This is an important insight in learning. When you see things for what they really are (as the saying goes, "keeping it real"), the element of fear is taken out of the problem and we begin to learn.

I digress, but I want to tell you an interesting insight about this idea of knowing things for what they are. During the last passage rites of a person in my part of the world, the dear ones of the departed carry with them to the site a pair of mud pots in a wooden sling to carry offerings. Once the rites are complete, these things get abandoned at the site at times. There is a large sense of fear attached to these two objects any context, auspicious or not, as death is equated to the pair. Discounting this, a famous balladeer pronounces "Listen, the sling a piece of wood, and the pots of mud, if you were to perceive the truth."

When information is kept in more than one table, and when those two tables express an idea about a common item, it is a common practice to join the two tables to retrieve information about that item. The join is necessary because we need information about that item, but the information is kept in two tables.

Following that idea, consider six tables: t1,t2,t3,t4,t5, and t6. Let use see how we are told to retrieve information from all of these tables.

Select *
>From 	t1
	,t2
	,t3
	,t4
	,t5
	,t6
Where (any clause involving t1 through t6)
	ex: t1.id=10 and t2.id=10 and t3.id=10 and t4.id=10 and t5.id=10 and t6.id=10

This is called an equi-join. There is equal weight for every table. If the joined tables don't have a matching row, those rows from both tables will be eliminated. When the rows match, the rows from both tables are combined and returned. This is the simplest of joins. There is no complex ANSI syntax, where you indicate if this is an outer join, left outer join, right outer join, etc.

The above statement is saying that there is an item called 10 in all of the tables. It retrieves all of the information pertaining to the item 10 that exists in all of the tables. The behavior of this default join is somewhat unintuitive. If t6 doesn't have an item called 10, then the join won't give out any information at all from the other tables. You have to resort to something called an outer join, where you tell the join that, after all, t6 is not that important.

Principle 2. You can join tables with different weights using left outer joins.

The six tables: t1,t2,t3,t4,t5,t6
Assume: 
T1 
    t2
    t3
T4
T5
T6

Consider the above relationship, where t2 and t3 are independent of each other but both are dependent on t1 and will give t1 more weight. Meaning that if there is a row in t1, give it out, irrespective of t2 and t3. This is represented in SQL Server as:

Select *
>From 	t1 
	left outer join t2 on t2.f1=t1.f1 and t2.f2 = t1.f2
	left outer join t3 on t3.f3=t1.f3
	,t4
	,t5
	,t6
Where (any clause involving t1 through t6)

This code is demonstrative of a couple of things. You can get the basics of the syntax for left outer joins. In the example, t1 is left outer joined with t2 and t3. You can also see how you can combine some left outer joins and some regular joins in the same select statement. In the example, t1 (having already outer joined) is equi-joined with t4, t5, and t6.

Principle 3. You can outer join a composite table which itself is an equi-join.

The previous example treated t2 and t3 at par and individually outer joined to t1. What if you want to narrow the rows in t2 first based on t3, and then outer join the result to t1? The following snippet illustrates this relationship.

The six tables: t1,t2,t3,t4,t5,t6
Assume: 
T1 
    t2
    	t3
T4
T5
T6

That above requirement necessitates that t2 and t3 are equi-joined first. This would have eliminated any rows in t2 based on t3. The result is then outer joined to t1. This means we are treating t2 as an optional data source for additional needs and not a mandatory data source. Subsequently t1 is equi-joined to t4, t5, and t6. This can be done in SQL Server as follows:

Select *
>From 	t1 
	left outer join 
		(select * 
		from t2,t3
		where (any where clause involving t1, t2, t3)
		) t2_t3_composite_table on t1_t2_composite_table.f1 = t1.f1
	,t4
	,t5
	,t6
Where (any clause involving t1, t2_te_composite_table, t4, t5, t6)

Additionally, this code demonstrates the following:

  1. Left outer join syntax.
  2. Using an inner select as the target of a left outer join.

Principle 4. An outer join of two equi joins is two outer joins combined.

Let us think about the above example for a second and see what necessitated the inner select. If t1 and t2 are to be outer joined, we can say that rows from t1 will be displayed irrespective of t2. If we are to equi-join t3 (as t3 is known to depend on t2), then t1 will not display any rows, even though we mentioned that t2 is not important. This is because in the next line we said that t3 is important. So the crux is: how can t3 be important when t3 depends on t2 and t2 is not important? This situation led us to the inner select where we hashed out t2 and t3 dependencies first, and then joined the results to t1.

The issue with the above is that the syntax is a bit clumsy, as you now have to invent an intermediate table and an intermediate select, not to mention the side effects on the optimizers (which may be positive or negative). But after some thought, we can avoid the inner select by telling t1 that both t2 and t3 are unimportant. This is because if t2 is unimportant, then t3 is automatically unimportant, as it is dependent on t2.

This is similar to an algebraic operation such as -(a+b) equals -a and -b. So we can rewrite the above SQL as follows:

Select *
>From 	t1 
	left outer join t2 on t2.f1=t1.f1 and t2.f2 = t1.f2
	left outer join t3 on t3.f3=t1.f3
	,t4
	,t5
	,t6
Where (any clause involving t1 through t6)

Notice that this is similar to the solution of Principle 2. This code also demonstrates the following:

  1. Shows the syntax for two left outer joins.
  2. Shows how an inner select could be avoided with two outer joins.

Principle 5. Once you outer join a table, any joins that include the fields of that table needs to be outer joined, as well.

The above principle of "propagating unimportance" will lead us to this principle as well. This principle guards against some obvious errors in select statements that involve outer joins. Once you outer join a table, then the field comparisons on these fields with other tables need to be outer joined as well.

Otherwise, your intention of outer joining (or less emphasizing) the table will be lost. Because when that table returns a null, the subsequent equi-joins on that field will fail and will result in removing the rows from your primary table.

Principle 6. It is usually not necessary to have a sub-select in the from clause.

Following Principle 4, these inner select joins can be accomplished by listing those tables directly outside. As mentioned, as -(a+b) becomes -a-b, you can flatten the join structure. This is based on a quick, intuitive conclusion. I might be wrong, but my suspicion is that I may not be that far off. What I am not so sure about is what the implications to optimization are; i.e., whether an inner select or an outer select is better, from a performance perspective.

Principle 7. You use a select to retrieve entities and not a general conglomeration of columns.

For example, you can use a select to retrieve orders. You can also use a select to retrieve parts. But you rarely go to retrieve orders, parts, and invoices at the same time. Even when you do, there is always a primary entity you are focusing on. When you retrieve orders, you may retrieve parts that belong to each order, but still, the focus is an order. Similarly, when you retrieve parts, you have an order ID associated with each part, but the focus is still a part in this case. Such an entity of focus in a select statement is called the primary table.

Principle 8. You use joins to add additional properties to a primary entity.

You can do joins with other tables to provide additional columns. Typically, these other tables will have a one-to-one relationship with the primary table. When they have a one-to-many relationship with the primary entity, you may decide to use "derived one-to-one" relationships. For example, an order has many parts; that is a one-to-many relationship. When you are retrieving a list of orders, it doesn't make sense to join with a parts table. But when you do, you may want to know some aggregate properties of parts belonging to an order. For example, the following are "derived one-to-one" relationships:

1. The total number of parts in a given order
2. The total cost of parts in a given order
3. The 1 and 2 are examples of aggregate functions
4. The first part in an order
5. The last part in an order

Principle 9. You can use functions to add additional properties to a primary entity.

In the following example, a function is used to add a derived additional property to the primary entity.

Select
	col1 as col1,
	my_function(col1,col2,col3) as my_derived_column
From
	table1

Inside of the function, you can use complex logic (including many selects, joins, etc.) to arrive at a value that gets returned. Functions are quite useful when the total number of rows returned are in the tens, as opposed to the hundreds. This is because a function gets called for every row returned; when your result set is large, you are better of doing joins where you can. When your result set is small, you may be able to minimize joins using functions.

Principle 10. You can conditionally add properties to a primary entity.

Select
	pt.column1,
	case when pt.conditional_column is null 
		then st1.column1
		else st2.column1 
	end as column2,
	pt.column2
From
	primary_table pt,
	secondary_table1 st1,
	secondary_table2 st2
Where 1=1
	and (additional where clause)

The example demonstrates the following:

  1. The syntax for the case statement.
  2. How to derive a column from two tables, based on a condition.

This pattern could be useful when you manage information in two tables, based on a certain attribute. This may not be a good data model example, but you will see databases where you may have to attempt this.

Principle 11. You can add properties to a primary entity using a sub-select.

Select
	pt.column1,
	(select col1 from secondary_table where col2=pt.column2) as derived_column2
	pt.column3
From
	primary_table pt
Where 1=1
	and (additional where clause)

Following the functional approach for columns (where applicable), you can also use a simple sub-select to retrieve a column. The drawback is that you can only return one column for this inner select. Technically, there is no reason why you can't imagine such a thing. For example, one can do this in an app server quite easily. Where possible, such an aggregation will save multiple inner select calls for each outer row.

Instead, you can also join the additional tables to get more columns; this option is always there. But the whole intent of using functions or sub-selects is to minimize the pressures on joins when the rows retrieved are small.

Summary

Programming and effectively using relational databases is an important ingredient in successfully implementing web-based systems. The malleable and transparent nature of databases provide a powerful paradigm for rapid development of front-end systems. The key to harnessing this power depends on effectively using select statements to mine relational databases. A join construct is an important element in this arsenal. Recognizing well known patterns involving outer joins will form a ground-level vocabulary for your further explorations of your data. This article examined syntax, surprises, and rules of thumb with respect to the usage of joins. The article also explored the primary motive behind a select, which, in my mind, is the retrieval of malleable and extensible objects. Knowing these elements should improve productivity as now, one can understand the systems from their front ends to their database back ends.

Further References

Satya Komatineni is the CTO at Indent, Inc. and the author of Aspire, an open source web development RAD tool for J2EE/XML.


Return to ONJava.com.