The State of the UNION
March 17, 2010 Skip Marchesani
DB2 for i supports the concept of union, which creates or derives a single result set table by combining two other result set tables, each of which was derived from a SELECT statement or another UNION. (See the discussion on cascading unions at the end of this article). A union is very useful when an SQL query must operate on two or more tables where JOIN cannot be used to produce the desired result set table. UNION is actually a clause for what is referred to in SQL terminology as a FULLSELECT. The combination of the first SELECT statement, followed by the UNION clause, followed by the second SELECT statement comprises the FULLSELECT. The simplified SQL syntax for a FULLSELECT is: First SELECT statement UNION clause Second SELECT statement ORDER BY clause (optional) DB2 for i supports four types of union clauses:
UNION creates or derives a final single result set table by combining the result set table from the first SELECT statement with the result set table from the second SELECT statement, eliminating any duplicate rows in the final result set. UNION ALL is the same as UNION except that all rows, including duplicates, from the first and second SELECT statements are included in the final result set table. EXCEPT DISTINCT differs from UNION in that the final result set table consists of rows that only exist in the result set table for the first SELECT statement, eliminating any duplicate rows in the final result set. INTERSECT DISTINCT differs from UNION in that the final result set table consists only of those rows that exist in the result set tables for both the first and second SELECT statements, eliminating any duplicate rows in the final result set. To better understand how the four types of UNION work, let’s look as some examples. We will use two tables: the employee master file for 2010 called EMP, and the employee history file for 2009 called EMPHIST. The rows and columns for these two tables are shown below. EMP--Employee Master Table for 2010 Nbr Nam Dpt Sal 10 Ed 911 7000 20 Heikki 901 6000 30 John 977 3200 40 Mike 977 6500 50 Marcela 911 7500 60 Frank 990 6500 EMPHIST--Employee History Table for 2009 Nbr Nam Dpt Sal 10 Ed 911 6800 30 John 977 3000 60 Frank 990 6300 70 Sam 901 3200 Note that there are six employees in the table called EMP, four employees in the table called EMPHIST, with employee numbers 10, 30, and 60 appearing in both tables. UNION With UNION, the final result set table consists of the rows from the result set table from the first SELECT statement combined with rows from the result set table from the second SELECT statement, with any duplicate rows eliminated in the final result set. Following is the syntax for UNION. SELECT nbr, nam, dpt FROM emp UNION SELECT nbr, nam, dpt FROM emphist ORDER BY nbr And, the final result set table that is created by UNION contains the following rows. Nbr Nam Dpt 10 Ed 911 20 Heikki 901 30 John 977 40 Mike 977 50 Marcela 911 60 Frank 990 70 Sam 901 Note that there are a total of 10 rows in both tables (six in EMP and four in EMPHIST), but only seven rows in the final result set table. Since this is a UNION, and employee numbers 10, 30, and 60, exist in both the result set table for the first SELECT statement (SELECT . . . FROM emp) and second SELECT statement (SELECT . . . FROM emphist), the duplicate rows are eliminated and the rows for employee numbers 10, 30, and 60, only appear once in the final result set table. Suppose we change the syntax for the UNION to include the SAL column. What happens? SELECT nbr, nam, dpt, sal FROM emp UNION SELECT nbr, nam, dpt, sal FROM emphist ORDER BY nbr The final result set table has all 10 rows as shown below. Why? Nbr Nam Dpt Sal 10 Ed 911 7000 10 Ed 911 6800 20 Heikki 901 6000 30 John 977 3200 30 John 977 3000 40 Mike 977 6500 50 Marcela 911 7500 60 Frank 990 6500 60 Frank 990 6300 70 Sam 901 3200 From a DB2 for i perspective, one row in the final result set is considered a duplicate if the value for each column of that row matches the value of the corresponding column of another row in the final result set. Even though NBR, NAM, and DPT, are the same for three pair of rows, SAL is different in each row and therefore there are no rows with duplicate data. UNION ALL With UNION ALL, all rows from the result set tables from the first and second SELECT statements, including duplicates, are included in the final result set table. Below is the syntax for UNION ALL. Note that SAL was included in the syntax for UNION ALL to aid in an example later in this discussion. SELECT nbr, nam, dpt, sal FROM emp UNION ALL SELECT nbr, nam, dpt, sal FROM emphist ORDER BY nbr And, the final result set table that is created by UNION ALL contains the following rows. Nbr Nam Dpt Sal 10 Ed 911 7000 10 Ed 911 6800 20 Heikki 901 6000 30 John 977 3200 30 John 977 3000 40 Mike 977 6500 50 Marcela 911 7500 60 Frank 990 6500 60 Frank 990 6300 70 Sam 901 3200 The final result set contains 10 rows in the final result set table. The table called EMP has six rows, the table called EMPHIST has four rows, and since UNION ALL includes all rows from both the first and second SELECT statements, 10 rows in the final result set table is what we expect to see. And yes, this is the same result set from the previous UNION example where SAL was included in both the first and second SELECT statements. Note that without going back and looking at the salary in the respective tables, it is not easy to tell where each row in the final result set table came from and what year the salary references. However, a simple modification to the UNION ALL syntax will allow us to identify the source of each row. We will add a literal named YEAR to identify the year at the beginning of the select list for both the first and second SELECT statements as shown below (2009 for EMPHIST and 2010 for EMP). Note that YEAR was also added as the secondary ordering criteria. SELECT '2010' AS year, nbr, nam, dpt, sal FROM emp UNION ALL SELECT '2009' AS year, nbr, nam, dpt, sal FROM emphist ORDER BY nbr, year The UNION ALL final result set table then looks like the following, and it is then no problem to determine the table where each row originated. Year Nbr Nam Dpt Sal 2009 10 Ed 911 6800 2010 10 Ed 911 7000 2010 20 Heikki 901 6000 2009 30 John 977 3000 2010 30 John 977 3200 2010 40 Mike 977 6500 2010 50 Marcela 911 7500 2009 60 Frank 990 6300 2010 60 Frank 990 6500 2009 70 Sam 901 3200 EXCEPT DISTINCT With EXCEPT DISTINCT, the final result set table consists of rows that only exist in the result set table for the first SELECT statement, and any duplicate rows in the final result set are eliminated. Following below is the syntax for EXCEPT DISTINCT. Note that SAL is not included this time. SELECT nbr, nam, dpt FROM emp EXCEPT DISTINCT SELECT nbr, nam, dpt FROM emphist ORDER BY nbr And, the final result set table that is created by EXCEPT DISTINCT contains the following rows. Nbr Nam Dpt 20 Heikki 901 40 Mike 977 50 Marcela 911 The result set table for the first SELECT statement has one row for employee numbers 10, 20, 30, 40, 50 and 60, in the EMP table. The result set table for the second SELECT statement has one row for employee numbers 10, 30, 60, and 70, from the EMPHIST table. The result set table for EXCEPT DISTICNT will have one row for each row found in the result set table for the first SELECT statement that is not in the result set table for the second SELECT statement. Since employee numbers 10, 30, and 60, are in the result set tables for both the first and second SELECT statements, only employee numbers 20, 40, and 50, are in the final result set table for the EXCEPT DISTINCT. INTERSECT DISTINCT With INTERSECT DISTINCT, the final result set table consists of only those rows that exist in the result set tables for both the first and second SELECT statements, and any duplicate rows in the final result set are eliminated. Following below is the syntax for INTERSECT DISTINCT. Note that SAL is not included this time SELECT nbr, nam, dpt FROM emp INTERSECT DISTINCT SELECT nbr, nam, dpt FROM emphist ORDER BY nbr And, the final result set table that is created by INTERSECT DISTINCT contains the following rows. Nbr Nam Dpt 10 Ed 911 30 John 977 60 Frank 990 The result set table for the first SELECT statement has one row for employee numbers 10, 20, 30, 40, 50, and 60, in the EMP table. The result set table for the second SELECT statement has one row for employee numbers 10, 30, 60, and 70, from the EMPHIST table. The result set table for INTERSECT DISTICNT will have one row for each row found in the result set table for both the first and second SELECT statements. Since employee numbers 10, 30, and 60, are in the result set tables for both the first and second SELECT statements, they are in the final result set table for the INTERSECT DISTINCT. In effect, this is the opposite of EXCEPT DISTINCT. Cascading Unions If more than two tables need to be combined, you can use a cascading union. The simplified syntax for a cascading union follows below. SELECT statement #1 UNION clause #1 SELECT statement #2 UNION clause #2 SELECT statement #3 UNION clause #3 … UNION clause #n-1 SELECT statement #n ORDER BY clause (optional) As stated in the beginning of this article, a union creates or derives a final, single result set table by combining two other result set tables, each of which were derived from a SELECT statement or another UNION clause. To clarify this statement, in the above syntax for a cascading union the final result set table for UNION clause #1 becomes the first result set table for UNION clause #2. The final result set table for UNION clause #2 becomes the first result set table for UNION clause #3, and so on. Union Considerations The result set table for the first SELECT statement must have the same number of columns as the result set table for the second SELECT statement. From a positional perspective, the column names do not have to be the same, nor must their data attributes or types be identical. The only requirement is that data types and attributes must be compatible. That is, one numeric type is compatible with any other numeric type, and one character type is compatible with any other character type, etc. From a positional perspective, if the column name in the result set table for the first SELECT statement is the same as the column name in the result set table for the second SELECT statement, the corresponding column in the final result set table will have that same name. If the column names are different, then the corresponding column in the final result set table will be named with the ordinal number of the position of that column in the final result set table. Where To Use Union As stated in the beginning of this article, a union is very useful when an SQL statement or query must operate on two or more tables where JOIN cannot be used to produce the desired result set table. For example, if you have multiple history tables–one for each specific time period (year, month, week, etc.), and their record formats are similar and compatible, a union is a good way to query and combine two or more of these tables to derive a single, final result set table using SQL. Specifically I use union to combine multiple tables to do some detailed data analysis, which will be the subject of my next tip. Skip Marchesani retired from IBM after 30 years and is now a consultant with Custom Systems Corporation. He is also a founding partner of System i Developer and the RPG & DB2 Summit. Skip spent much of his IBM career working with the Rochester Development Lab on projects for S/38 and AS/400 and was involved with the development of the AS/400. He was part of the team that taught early AS/400 education to customers and IBM lab sites worldwide. Skip is recognized as an industry expert on DB2 for i and the author of the book DB2/400: The New AS/400 Database. He specializes in providing customized education for any area of the System i, iSeries, and AS/400; does database design and design reviews; and performs general System i, iSeries, and AS/400 consulting for interested clients. He has been a speaker for user groups, technical conferences, and System i, iSeries, and AS/400 audiences around the world. He is an award-winning COMMON speaker and has received its Distinguished Service Award. Send your questions or comments for Skip to Ted Holt via the IT Jungle Contact page. RELATED STORIES A Database Union is Not a Join
|