The State of the UNION

March 17, 2010 Skip Marchesani

DB2 for i supports the concept of union, which creates or derives a single result set table by combining two other result set tables, each of which was derived from a SELECT statement or another UNION. (See the discussion on cascading unions at the end of this article). A union is very useful when an SQL query must operate on two or more tables where JOIN cannot be used to produce the desired result set table.

UNION is actually a clause for what is referred to in SQL terminology as a FULLSELECT. The combination of the first SELECT statement, followed by the UNION clause, followed by the second SELECT statement comprises the FULLSELECT.

The simplified SQL syntax for a FULLSELECT is:

First SELECT statement
	UNION clause
Second SELECT statement
	ORDER BY clause (optional)

DB2 for i supports four types of union clauses:

UNION (or UNION DISTINCT)
UNION ALL
EXCEPT DISTINCT
INTERSECT DISTINCT

UNION creates or derives a final single result set table by combining the result set table from the first SELECT statement with the result set table from the second SELECT statement, eliminating any duplicate rows in the final result set.

UNION ALL is the same as UNION except that all rows, including duplicates, from the first and second SELECT statements are included in the final result set table.

EXCEPT DISTINCT differs from UNION in that the final result set table consists of rows that only exist in the result set table for the first SELECT statement, eliminating any duplicate rows in the final result set.

INTERSECT DISTINCT differs from UNION in that the final result set table consists only of those rows that exist in the result set tables for both the first and second SELECT statements, eliminating any duplicate rows in the final result set.

To better understand how the four types of UNION work, let’s look as some examples.

We will use two tables: the employee master file for 2010 called EMP, and the employee history file for 2009 called EMPHIST. The rows and columns for these two tables are shown below.

EMP--Employee Master Table for 2010
Nbr	Nam		Dpt	 Sal
10	Ed        	911	7000
20	Heikki    	901	6000
30	John      	977	3200
40	Mike      	977	6500
50	Marcela   	911	7500
60	Frank     	990	6500

EMPHIST--Employee History Table for 2009
Nbr	Nam		Dpt	 Sal
10	Ed        	911	6800
30	John      	977	3000
60	Frank     	990	6300
70	Sam       	901	3200

Note that there are six employees in the table called EMP, four employees in the table called EMPHIST, with employee numbers 10, 30, and 60 appearing in both tables.

UNION

With UNION, the final result set table consists of the rows from the result set table from the first SELECT statement combined with rows from the result set table from the second SELECT statement, with any duplicate rows eliminated in the final result set.

Following is the syntax for UNION.

SELECT nbr, nam, dpt FROM emp
     UNION
SELECT nbr, nam, dpt FROM emphist
     ORDER BY nbr

And, the final result set table that is created by UNION contains the following rows.

Nbr	Nam		Dpt
10	Ed        	911
20	Heikki    	901
30	John      	977
40	Mike      	977
50	Marcela   	911
60	Frank     	990
70	Sam       	901

Note that there are a total of 10 rows in both tables (six in EMP and four in EMPHIST), but only seven rows in the final result set table. Since this is a UNION, and employee numbers 10, 30, and 60, exist in both the result set table for the first SELECT statement (SELECT . . . FROM emp) and second SELECT statement (SELECT . . . FROM emphist), the duplicate rows are eliminated and the rows for employee numbers 10, 30, and 60, only appear once in the final result set table.

Suppose we change the syntax for the UNION to include the SAL column. What happens?

SELECT nbr, nam, dpt, sal FROM emp
     UNION
SELECT nbr, nam, dpt, sal FROM emphist
     ORDER BY nbr

The final result set table has all 10 rows as shown below. Why?

Nbr	Nam		Dpt	 Sal
10	Ed        	911	7000
10	Ed        	911	6800
20	Heikki    	901	6000
30	John      	977	3200
30	John      	977	3000
40	Mike      	977	6500
50	Marcela   	911	7500
60	Frank     	990	6500
60	Frank     	990	6300
70	Sam       	901	3200

From a DB2 for i perspective, one row in the final result set is considered a duplicate if the value for each column of that row matches the value of the corresponding column of another row in the final result set.

Even though NBR, NAM, and DPT, are the same for three pair of rows, SAL is different in each row and therefore there are no rows with duplicate data.

UNION ALL

With UNION ALL, all rows from the result set tables from the first and second SELECT statements, including duplicates, are included in the final result set table.

Below is the syntax for UNION ALL. Note that SAL was included in the syntax for UNION ALL to aid in an example later in this discussion.

SELECT nbr, nam, dpt, sal FROM emp
     UNION ALL
SELECT nbr, nam, dpt, sal FROM emphist
     ORDER BY nbr

And, the final result set table that is created by UNION ALL contains the following rows.

Nbr	Nam		Dpt	 Sal
10	Ed        	911	7000
10	Ed        	911	6800
20	Heikki    	901	6000
30	John      	977	3200
30	John      	977	3000
40	Mike      	977	6500
50	Marcela   	911	7500
60	Frank     	990	6500
60	Frank     	990	6300
70	Sam       	901	3200

The final result set contains 10 rows in the final result set table. The table called EMP has six rows, the table called EMPHIST has four rows, and since UNION ALL includes all rows from both the first and second SELECT statements, 10 rows in the final result set table is what we expect to see. And yes, this is the same result set from the previous UNION example where SAL was included in both the first and second SELECT statements.

Note that without going back and looking at the salary in the respective tables, it is not easy to tell where each row in the final result set table came from and what year the salary references.

However, a simple modification to the UNION ALL syntax will allow us to identify the source of each row. We will add a literal named YEAR to identify the year at the beginning of the select list for both the first and second SELECT statements as shown below (2009 for EMPHIST and 2010 for EMP). Note that YEAR was also added as the secondary ordering criteria.

SELECT '2010' AS year, nbr, nam, dpt, sal FROM emp
     UNION ALL
SELECT '2009' AS year, nbr, nam, dpt, sal FROM emphist
     ORDER BY nbr, year

The UNION ALL final result set table then looks like the following, and it is then no problem to determine the table where each row originated.

Year    Nbr	Nam		Dpt	 Sal
2009	10	Ed        	911	6800
2010	10	Ed        	911	7000
2010	20	Heikki    	901	6000
2009	30	John      	977	3000
2010	30	John      	977	3200
2010	40	Mike      	977	6500
2010	50	Marcela   	911	7500
2009	60	Frank     	990	6300
2010	60	Frank     	990	6500
2009	70	Sam       	901	3200

EXCEPT DISTINCT

With EXCEPT DISTINCT, the final result set table consists of rows that only exist in the result set table for the first SELECT statement, and any duplicate rows in the final result set are eliminated.

Following below is the syntax for EXCEPT DISTINCT. Note that SAL is not included this time.

SELECT nbr, nam, dpt FROM emp
     EXCEPT DISTINCT
SELECT nbr, nam, dpt FROM emphist
     ORDER BY nbr

And, the final result set table that is created by EXCEPT DISTINCT contains the following rows.

Nbr	Nam		Dpt
20	Heikki    	901
40	Mike      	977
50	Marcela   	911

The result set table for the first SELECT statement has one row for employee numbers 10, 20, 30, 40, 50 and 60, in the EMP table. The result set table for the second SELECT statement has one row for employee numbers 10, 30, 60, and 70, from the EMPHIST table. The result set table for EXCEPT DISTICNT will have one row for each row found in the result set table for the first SELECT statement that is not in the result set table for the second SELECT statement. Since employee numbers 10, 30, and 60, are in the result set tables for both the first and second SELECT statements, only employee numbers 20, 40, and 50, are in the final result set table for the EXCEPT DISTINCT.

INTERSECT DISTINCT

With INTERSECT DISTINCT, the final result set table consists of only those rows that exist in the result set tables for both the first and second SELECT statements, and any duplicate rows in the final result set are eliminated.

Following below is the syntax for INTERSECT DISTINCT. Note that SAL is not included this time

SELECT nbr, nam, dpt FROM emp
     INTERSECT DISTINCT
SELECT nbr, nam, dpt FROM emphist
     ORDER BY nbr

And, the final result set table that is created by INTERSECT DISTINCT contains the following rows.

Nbr	Nam		Dpt
10	Ed        	911
30	John      	977
60	Frank     	990

The result set table for the first SELECT statement has one row for employee numbers 10, 20, 30, 40, 50, and 60, in the EMP table. The result set table for the second SELECT statement has one row for employee numbers 10, 30, 60, and 70, from the EMPHIST table. The result set table for INTERSECT DISTICNT will have one row for each row found in the result set table for both the first and second SELECT statements. Since employee numbers 10, 30, and 60, are in the result set tables for both the first and second SELECT statements, they are in the final result set table for the INTERSECT DISTINCT. In effect, this is the opposite of EXCEPT DISTINCT.

Cascading Unions

If more than two tables need to be combined, you can use a cascading union. The simplified syntax for a cascading union follows below.

	SELECT statement #1
		UNION clause #1
	SELECT statement #2
                        UNION clause #2
	SELECT statement #3	
		UNION clause #3
	…
		UNION clause #n-1
SELECT statement #n
ORDER BY clause (optional)

As stated in the beginning of this article, a union creates or derives a final, single result set table by combining two other result set tables, each of which were derived from a SELECT statement or another UNION clause.

To clarify this statement, in the above syntax for a cascading union the final result set table for UNION clause #1 becomes the first result set table for UNION clause #2. The final result set table for UNION clause #2 becomes the first result set table for UNION clause #3, and so on.

Union Considerations

The result set table for the first SELECT statement must have the same number of columns as the result set table for the second SELECT statement. From a positional perspective, the column names do not have to be the same, nor must their data attributes or types be identical. The only requirement is that data types and attributes must be compatible. That is, one numeric type is compatible with any other numeric type, and one character type is compatible with any other character type, etc.

From a positional perspective, if the column name in the result set table for the first SELECT statement is the same as the column name in the result set table for the second SELECT statement, the corresponding column in the final result set table will have that same name. If the column names are different, then the corresponding column in the final result set table will be named with the ordinal number of the position of that column in the final result set table.

Where To Use Union

As stated in the beginning of this article, a union is very useful when an SQL statement or query must operate on two or more tables where JOIN cannot be used to produce the desired result set table.

For example, if you have multiple history tables–one for each specific time period (year, month, week, etc.), and their record formats are similar and compatible, a union is a good way to query and combine two or more of these tables to derive a single, final result set table using SQL.

Specifically I use union to combine multiple tables to do some detailed data analysis, which will be the subject of my next tip.

Skip Marchesani retired from IBM after 30 years and is now a consultant with Custom Systems Corporation. He is also a founding partner of System i Developer and the RPG & DB2 Summit. Skip spent much of his IBM career working with the Rochester Development Lab on projects for S/38 and AS/400 and was involved with the development of the AS/400. He was part of the team that taught early AS/400 education to customers and IBM lab sites worldwide. Skip is recognized as an industry expert on DB2 for i and the author of the book DB2/400: The New AS/400 Database. He specializes in providing customized education for any area of the System i, iSeries, and AS/400; does database design and design reviews; and performs general System i, iSeries, and AS/400 consulting for interested clients. He has been a speaker for user groups, technical conferences, and System i, iSeries, and AS/400 audiences around the world. He is an award-winning COMMON speaker and has received its Distinguished Service Award. Send your questions or comments for Skip to Ted Holt via the IT Jungle Contact page.

Table of Contents

Content archive

Recent Posts

Subscribe

Pages

Search