Using UNION for Data Analysis
March 24, 2010 Skip Marchesani
To do a quick review of my previous article, UNION is very useful when an SQL statement or query must operate on two or more tables and JOIN cannot be used to produce the desired result set table. For example, if you have multiple history tables–one for each specific time period (year, month, week, etc.), and their record formats are similar and compatible, a union is a good way to query and combine two or more of these tables to derive a single, final result set table using SQL. Several months ago, a friend sent me the spreadsheet shown below. I’ve included only the beginning and end of spreadsheet. Creating the spreadsheet manually was taking a lot of time, and he wanted to know if he could use SQL to automate the creation. Country State 2006 2007 2008 2009 Total Percent USA CA 400 739 728 758 2625 7.7886 USA IL 534 87 1466 239 2326 6.9014 USA NY 858 838 228 169 2093 6.2101 … Virgin Islands 0 0 0 1 1 0.0029 Yemen 0 0 0 1 1 0.0029 The purpose of this spreadsheet is to show an analysis of the total number of sales or order transactions (not dollar amounts) by state within country; and for each country/state combination show the four previous years (in this case 2006, 2007, 2008, and 2009), the total number of orders for the country/state combination, and the percentage of all orders. There were four separate history tables that had to be analyzed–one for each year (2006, 2007, 2008, and 2009), which my friend had included when he sent the spreadsheet. There was one row in each table that summarized each distinct customer order, and customer country and state were part of the contact information for each order. My first thought was: “No way!” But always enjoying a good challenge, I decided to give it a try, and was able to do it by using UNION in conjunction with the derived table capability in DB2 for i. I used Run SQL Scripts, my SQL interface of choice for this type of experimentation, and I am currently running V6R1 of Navigator in conjunction with an IBM i running V5R4 of i5/OS. After a couple of false starts, step one was to cobble together the following SQL query (i.e., SELECT statement) that had a final result set table in the format I had determined I needed. This query returns one row with a count for each country/state/year combination in the final result set table. I used UNION ALL to combine the four history tables because all rows in the intermediate result set tables for the four history tables, including duplicates, needed to be included in the final result set table. A subset of the final result set table is shown after the SQL query. SELECT contry, state, COUNT(*) AS S06, 0 AS S07, 0 AS S08, 0 AS S09 FROM orders2006 GROUP BY contry, state UNION ALL SELECT contry, state, 0 AS S06, COUNT(*) AS S07, 0 AS S08, 0 AS S09 FROM orders2007 GROUP BY contry, state UNION ALL SELECT contry, state, 0 AS S06, 0 AS S07, COUNT(*) AS S08, 0 AS S09 FROM orders2008 GROUP BY contry, state UNION ALL SELECT contry, state, 0 AS S06, 0 AS S07, 0 AS S08, COUNT(*) AS S09 FROM orders2009 GROUP BY contry, state ORDER BY contry, state Subset of the final result set table for above query: Country State S06 S07 S08 S09 USA CA 400 0 0 0 USA CA 0 739 0 0 USA CA 0 0 728 0 USA CA 0 0 0 758 The above is a subset of the final result set for the SQL query preceding it, and shows only data for the state of California in the U.S.A. (there are many more rows for various country/state combinations). Even though each row has a column for the number of orders for each year, the orders for each year are in a separate row. There is one row each for history table year–2006, 2007, 2008, and 2009–and where there is a total for one specific year and the rest of the totals for the other three years in that row are zero. There are four SELECT statements in the SQL query–one for each year–and they are combined by the three UNION ALL clauses. Each SELECT statement has CONTRY, STATE, and a column for each year (S06 = 2001, S07 = 2007, S08 = 2008, and S09 = 2009) in the select list; but the number of orders for only one year is summarized (COUNT(*) and GROUP BY) in each of the four SELECT statements (one SELECT statement for 2006, one for 2007, one for 2008, and one for 2009). By formatting the four SELECT statements in this manner, I got the row format that I needed, and the result set table for each of the SELECT statements has the same format (number and type of columns). This satisfies the UNION requirement that the result set table for the first SELECT statement have the same number of columns as the result set table for the second SELECT statement, which must have the same number of columns as the result set for the third SELECT statement, and so on. The next step is to modify the SQL query so that the separate rows for the orders for each country/state/year combination are summarized into a single row for each country/state combination. This means that the total number of orders for each year will be in a separate column instead of a separate row, and final the result set table will look as follows for the state of California. Country State 2006 2007 2008 2009 Total USA CA 400 739 728 758 2625 I took advantage of the derived table capability in DB2 for i to accomplish the above. I used the SQL query from the first step (four SELECT statements and three UNION clauses) without the ORDER BY clause for the table derivation. This SQL query with derived table is shown below, with the derived table portion of the SQL query annotated, and the first five rows of the final result set table following the SQL query. Note that the derived table statements directly follow the FROM clause for the SELECT statement and must include an AS at the end of the derived table statements to name the derived table–in this case it’s named ORDERS SELECT contry AS Country, state, SUM(S06) AS "2006", SUM(S07) AS "2007", SUM(S08) AS "2008", SUM(S09) AS "2009", (SUM(S06) + SUM(S07) + SUM(S08) + SUM(S09)) AS total FROM -- Begin Derived Table (SELECT contry, state, COUNT(*) AS S06, 0 AS S07, 0 AS S08, 0 AS S09 FROM orders2006 GROUP BY contry, state UNION ALL SELECT contry, state, 0 AS S06, COUNT(*) AS S07, 0 AS S08, 0 AS S09 FROM orders2007 GROUP BY contry, state UNION ALL SELECT contry, state, 0 AS S06, 0 AS S07, COUNT(*) AS S08, 0 AS S09 FROM orders2008 GROUP BY contry, state UNION ALL SELECT contry, state, 0 AS S06, 0 AS S07, 0 AS S08, COUNT(*) AS S09 FROM orders2009 GROUP BY contry, state) AS orders -- End Derived Table GROUP BY contry, state ORDER BY total DESC, contry, state; The first five rows of the final result set table produced by this SQL query are shown below. Country State 2006 2007 2008 2009 Total USA CA 400 739 728 758 2625 USA IL 534 87 1466 239 2326 USA NY 858 838 228 169 2093 USA FL 302 507 306 450 1565 USA MA 711 414 142 258 1525 Conceptually the SQL query from the first step is used to derive or build a single use (temporary) table on the fly. The SELECT statement that begins prior to the derived table statements (shown below and reformatted to make it easier to read) queries and summarizes the rows in the derived table and then orders them in the desired sequence. SELECT contry AS Country, state, SUM(S06) AS "2006", SUM(S07) AS "2007", SUM(S08) AS "2008", SUM(S09) AS "2009", (SUM(S06) + SUM(S07) + SUM(S08) + SUM(S09)) AS total FROM + ------------------------------------------------ + | insert the derived table statements from the SQL | | query from step one | + ------------------------------------------------ + GROUP BY contry, state ORDER BY total DESC, contry, state This means that the final result set table contains one row for each country/state combination that contains the following seven columns: CONTRY a.k.a., COUNTRY, STATE, 2006, 2007, 2008, 2009 (the total number of orders for each year), and TOTAL (the total number of orders for all four years). Note that in the SELECT statement the double quotes (“) are required around each numeric year to tell SQL that this is a column name and not a numeric literal. The summarization for this SELECT statement by country and state, is done in the GROUP BY clause after the derived table statements. The ORDER BY clause following the GROUP BY clause provides the ordering criteria for the rows in the final result set table–total orders (in descending sequence), then country, then state (both in ascending sequence). The last step is to calculate the percentage for the total orders for a specific country and state in relation to the total orders for all countries and states. The mathematical formula to do this is: Country and state order percentage = (2006 orders + 2007 orders + 2008 orders + 2009 orders) for specific country and state Divided by (total 2006 orders + total 2007 orders + total 2008 orders + total 2009 orders) multiplied by 100 The SQL syntax for this formula when used within the previous SELECT statement follows below: ((SUM(S06) + SUM(S07) + SUM(S08) + SUM(S09)) / ((SELECT COUNT(*) FROM orders2006) + (SELECT COUNT(*) FROM orders2007) + (SELECT COUNT(*) FROM orders2008) + (SELECT COUNT(*) FROM orders2009)) * 100) AS percent When inserted into the SELECT statement the revised syntax for the SELECT statement (still reformatted for readability) looks as follows: SELECT contry AS Country, state, SUM(S06) AS "2006", SUM(S07) AS "2007", SUM(S08) AS "2008", SUM(S09) AS "2009", (SUM(S06) + SUM(S07) + SUM(S08) + SUM(S09)) AS total, ((SUM(S06) + SUM(S07) + SUM(S08) + SUM(S09)) / ((SELECT COUNT(*) FROM orders2006) + (SELECT COUNT(*) FROM orders2007) + (SELECT COUNT(*) FROM orders2008) + (SELECT COUNT(*) FROM orders2009)) * 100) AS percent FROM derived table statements - SQL query from step one GROUP BY contry, state ORDER BY total DESC, contry, state And, the entire SQL query, including the derived table statements looks as shown below, with the first five rows in the final result set table following the SQL Query. SELECT contry AS Country, state, SUM(S06) AS "2006", SUM(S07) AS "2007", SUM(S08) AS "2008", SUM(S09) AS "2009", (SUM(S06) + SUM(S07) + SUM(S08) + SUM(S09)) AS total, ((SUM(S06) + SUM(S07) + SUM(S08) + SUM(S09)) / ((SELECT COUNT(*) FROM orders2006) + (SELECT COUNT(*) FROM orders2007) + (SELECT COUNT(*) FROM orders2008) + (SELECT COUNT(*) FROM orders2009)) * 100) AS percent FROM -- Begin Derived Table (SELECT contry, state, COUNT(*) AS S06, 0 AS S07, 0 AS S08, 0 AS S09 FROM orders2006 GROUP BY contry, state UNION ALL SELECT contry, state, 0 AS S06, COUNT(*) AS S07, 0 AS S08, 0 AS S09 FROM orders2007 GROUP BY contry, state UNION ALL SELECT contry, state, 0 AS S06, 0 AS S07, COUNT(*) AS S08, 0 AS S09 FROM orders2008 GROUP BY contry, state UNION ALL SELECT contry, state, 0 AS S06, 0 AS S07, 0 AS S08, COUNT(*) AS S09 FROM orders2009 GROUP BY contry, state) AS orders -- End Derived Table GROUP BY contry, state ORDER BY total DESC, contry, state The first five rows of the final result set table produced by this SQL query are shown below. Country State 2006 2007 2008 2009 Total Percent USA CA 400 739 728 758 2625 7.7886 USA IL 534 87 1466 239 2326 6.9014 USA NY 858 838 228 169 2093 6.2101 USA FL 302 507 306 450 1565 4.6435 USA MA 711 414 142 258 1525 4.5248 This is exactly the result set my friend was looking for. Since I developed this SQL query using Run SQL Scripts in Navigator, the SQL query can be saved as an SQL script, sent to my friend, and he can execute it when needed. Better still, since he is also running V6R1 of Navigator, he also can use Run SQL Scripts to run the SQL query, and with a couple of clicks can save the final result set table as an Excel spreadsheet–exactly what my friend wanted to do. How long did it take me to develop the SQL Query? My false starts were spread over a couple of days, but once I got headed in the right direction it only took me between one and two hours of experimentation from start finish to develop the correctly working SQL query. Did UNION play a significant part in providing the solution to this SQL query requirement? The answer is YES. UNION provided the capability to combine the four order history tables into a single result set table, and in the process define the number and type of columns in the result set table to satisfy the requirements of the spreadsheet format. If UNION were not an option in DB2 for i, the solution could not have been provided using a single SQL query and would instead have required multiple SQL queries. Skip Marchesani retired from IBM after 30 years and is now a consultant with Custom Systems Corporation. He is also a founding partner of System i Developer and the RPG & DB2 Summit. Skip spent much of his IBM career working with the Rochester Development Lab on projects for S/38 and AS/400 and was involved with the development of the AS/400. He was part of the team that taught early AS/400 education to customers and IBM lab sites worldwide. Skip is recognized as an industry expert on DB2 for i and the author of the book DB2/400: The New AS/400 Database. He specializes in providing customized education for any area of the System i, iSeries, and AS/400; does database design and design reviews; and performs general System i, iSeries, and AS/400 consulting for interested clients. He has been a speaker for user groups, technical conferences, and System i, iSeries, and AS/400 audiences around the world. He is an award-winning COMMON speaker and has received its Distinguished Service Award. Send your questions or comments for Skip to Ted Holt via the IT Jungle Contact page. RELATED STORIES A Database Union is Not a Join
|