Using UNION for Data Analysis

March 24, 2010 Skip Marchesani

To do a quick review of my previous article, UNION is very useful when an SQL statement or query must operate on two or more tables and JOIN cannot be used to produce the desired result set table. For example, if you have multiple history tables–one for each specific time period (year, month, week, etc.), and their record formats are similar and compatible, a union is a good way to query and combine two or more of these tables to derive a single, final result set table using SQL.

Several months ago, a friend sent me the spreadsheet shown below. I’ve included only the beginning and end of spreadsheet. Creating the spreadsheet manually was taking a lot of time, and he wanted to know if he could use SQL to automate the creation.

Country         State   2006    2007    2008   2009     Total   Percent
USA               	CA    400     739      728    758      2625	7.7886
USA               	IL    534      87     1466    239      2326	6.9014
USA               	NY    858     838      228    169      2093	6.2101
…
Virgin Islands 	        0       0        0      1	      1	0.0029
Yemen	                 0       0        0      1	      1	0.0029

The purpose of this spreadsheet is to show an analysis of the total number of sales or order transactions (not dollar amounts) by state within country; and for each country/state combination show the four previous years (in this case 2006, 2007, 2008, and 2009), the total number of orders for the country/state combination, and the percentage of all orders.

There were four separate history tables that had to be analyzed–one for each year (2006, 2007, 2008, and 2009), which my friend had included when he sent the spreadsheet. There was one row in each table that summarized each distinct customer order, and customer country and state were part of the contact information for each order.

My first thought was: “No way!”

But always enjoying a good challenge, I decided to give it a try, and was able to do it by using UNION in conjunction with the derived table capability in DB2 for i. I used Run SQL Scripts, my SQL interface of choice for this type of experimentation, and I am currently running V6R1 of Navigator in conjunction with an IBM i running V5R4 of i5/OS.

After a couple of false starts, step one was to cobble together the following SQL query (i.e., SELECT statement) that had a final result set table in the format I had determined I needed. This query returns one row with a count for each country/state/year combination in the final result set table.

I used UNION ALL to combine the four history tables because all rows in the intermediate result set tables for the four history tables, including duplicates, needed to be included in the final result set table. A subset of the final result set table is shown after the SQL query.

SELECT contry, state, COUNT(*) AS S06, 0 AS S07, 0 AS S08, 0 AS S09
       	FROM orders2006
  GROUP BY contry, state
             UNION ALL
    SELECT contry, state, 0 AS S06, COUNT(*) AS S07, 0 AS S08, 0 AS S09
           	FROM orders2007
  GROUP BY contry, state 
             UNION ALL
    SELECT contry, state, 0 AS S06, 0 AS S07, COUNT(*) AS S08, 0 AS S09
           	FROM orders2008
  GROUP BY contry, state
             UNION ALL
    SELECT contry, state, 0 AS S06, 0 AS S07, 0 AS S08, COUNT(*) AS S09
           	FROM orders2009
  GROUP BY contry, state
        ORDER BY contry, state

Subset of the final result set table for above query:

Country	State	S06      S07	S08	S09
USA        CA     400        0        0        0
USA        CA       0      739        0        0
USA        CA       0        0      728        0
USA        CA       0        0        0      758

The above is a subset of the final result set for the SQL query preceding it, and shows only data for the state of California in the U.S.A. (there are many more rows for various country/state combinations). Even though each row has a column for the number of orders for each year, the orders for each year are in a separate row. There is one row each for history table year–2006, 2007, 2008, and 2009–and where there is a total for one specific year and the rest of the totals for the other three years in that row are zero.

There are four SELECT statements in the SQL query–one for each year–and they are combined by the three UNION ALL clauses. Each SELECT statement has CONTRY, STATE, and a column for each year (S06 = 2001, S07 = 2007, S08 = 2008, and S09 = 2009) in the select list; but the number of orders for only one year is summarized (COUNT(*) and GROUP BY) in each of the four SELECT statements (one SELECT statement for 2006, one for 2007, one for 2008, and one for 2009).

By formatting the four SELECT statements in this manner, I got the row format that I needed, and the result set table for each of the SELECT statements has the same format (number and type of columns). This satisfies the UNION requirement that the result set table for the first SELECT statement have the same number of columns as the result set table for the second SELECT statement, which must have the same number of columns as the result set for the third SELECT statement, and so on.

The next step is to modify the SQL query so that the separate rows for the orders for each country/state/year combination are summarized into a single row for each country/state combination. This means that the total number of orders for each year will be in a separate column instead of a separate row, and final the result set table will look as follows for the state of California.

Country       State     2006     2007     2008     2009     Total
USA            CA        400      739      728      758      2625

I took advantage of the derived table capability in DB2 for i to accomplish the above. I used the SQL query from the first step (four SELECT statements and three UNION clauses) without the ORDER BY clause for the table derivation.

This SQL query with derived table is shown below, with the derived table portion of the SQL query annotated, and the first five rows of the final result set table following the SQL query. Note that the derived table statements directly follow the FROM clause for the SELECT statement and must include an AS at the end of the derived table statements to name the derived table–in this case it’s named ORDERS

SELECT contry AS Country, state, SUM(S06) AS "2006",
           SUM(S07) AS "2007", SUM(S08) AS "2008", SUM(S09) AS "2009",
           (SUM(S06) + SUM(S07) + SUM(S08) + SUM(S09)) AS total
     FROM

-- Begin Derived Table

  (SELECT contry, state, COUNT(*) AS S06, 0 AS S07, 0 AS S08, 0 AS S09
      FROM orders2006
    GROUP BY contry, state
        UNION ALL
   SELECT contry, state, 0 AS S06, COUNT(*) AS S07, 0 AS S08, 0 AS S09
      FROM orders2007
    GROUP BY contry, state
        UNION ALL
   SELECT contry, state, 0 AS S06, 0 AS S07, COUNT(*) AS S08, 0 AS S09
      FROM orders2008
    GROUP BY contry, state
       UNION ALL
   SELECT contry, state, 0 AS S06, 0 AS S07, 0 AS S08, COUNT(*) AS S09
      FROM orders2009
     GROUP BY contry, state)
	      
         AS orders

-- End Derived Table

     GROUP BY contry, state
     ORDER BY total DESC, contry, state;

The first five rows of the final result set table produced by this SQL query are shown below.

Country   State   2006   2007   2008   2009   Total
USA          CA    400    739    728    758    2625
USA          IL    534     87   1466    239    2326
USA          NY    858    838    228    169    2093
USA          FL    302    507    306    450    1565
USA          MA    711    414    142    258    1525

Conceptually the SQL query from the first step is used to derive or build a single use (temporary) table on the fly. The SELECT statement that begins prior to the derived table statements (shown below and reformatted to make it easier to read) queries and summarizes the rows in the derived table and then orders them in the desired sequence.

SELECT  contry AS Country, 
    state, 
    SUM(S06) AS "2006",
                SUM(S07) AS "2007", 
                SUM(S08) AS "2008", 
                SUM(S09) AS "2009", 
                (SUM(S06) + SUM(S07) + SUM(S08) + SUM(S09)) AS total
     FROM
       + ------------------------------------------------ +
       | insert the derived table statements from the SQL |
       | query from step one                              |
       + ------------------------------------------------ +
     GROUP BY contry, state
     ORDER BY total DESC, contry, state

This means that the final result set table contains one row for each country/state combination that contains the following seven columns: CONTRY a.k.a., COUNTRY, STATE, 2006, 2007, 2008, 2009 (the total number of orders for each year), and TOTAL (the total number of orders for all four years).

Note that in the SELECT statement the double quotes (“) are required around each numeric year to tell SQL that this is a column name and not a numeric literal.

The summarization for this SELECT statement by country and state, is done in the GROUP BY clause after the derived table statements. The ORDER BY clause following the GROUP BY clause provides the ordering criteria for the rows in the final result set table–total orders (in descending sequence), then country, then state (both in ascending sequence).

The last step is to calculate the percentage for the total orders for a specific country and state in relation to the total orders for all countries and states.

The mathematical formula to do this is:

Country and state order percentage =

(2006 orders + 2007 orders + 2008 orders + 2009 orders) for specific country and state

Divided by

(total 2006 orders + total 2007 orders + total 2008 orders + total 2009 orders) multiplied by 100

The SQL syntax for this formula when used within the previous SELECT statement follows below:

 ((SUM(S06) + SUM(S07) + SUM(S08) + SUM(S09))
               /
((SELECT COUNT(*) FROM orders2006) +
  (SELECT COUNT(*) FROM orders2007) +
  (SELECT COUNT(*) FROM orders2008) +
  (SELECT COUNT(*) FROM orders2009)) * 100) AS percent

When inserted into the SELECT statement the revised syntax for the SELECT statement (still reformatted for readability) looks as follows:

SELECT  contry AS Country, 
    state, 
    SUM(S06) AS "2006",
                SUM(S07) AS "2007",
                SUM(S08) AS "2008",
                SUM(S09) AS "2009",
                (SUM(S06) + SUM(S07) + SUM(S08) + SUM(S09)) AS total,
                ((SUM(S06) + SUM(S07) + SUM(S08) + SUM(S09))
                       /
                ((SELECT COUNT(*) FROM orders2006) +
                 (SELECT COUNT(*) FROM orders2007) +
                 (SELECT COUNT(*) FROM orders2008) +
                 (SELECT COUNT(*) FROM orders2009)) * 100) AS percent
     FROM
	derived table statements - SQL query from step one
     GROUP BY contry, state
     ORDER BY total DESC, contry, state

And, the entire SQL query, including the derived table statements looks as shown below, with the first five rows in the final result set table following the SQL Query.

SELECT  contry AS Country,
    state,
    SUM(S06) AS "2006",
                SUM(S07) AS "2007",
                SUM(S08) AS "2008",
                SUM(S09) AS "2009",
                (SUM(S06) + SUM(S07) + SUM(S08) + SUM(S09)) AS total,
                ((SUM(S06) + SUM(S07) + SUM(S08) + SUM(S09))
                       /
                  ((SELECT COUNT(*) FROM orders2006) +
                   (SELECT COUNT(*) FROM orders2007) +
                   (SELECT COUNT(*) FROM orders2008) +
                   (SELECT COUNT(*) FROM orders2009)) * 100) AS percent
      FROM
-- Begin Derived Table 
  (SELECT contry, state, COUNT(*) AS S06, 0 AS S07, 0 AS S08, 0 AS S09
              FROM orders2006
           GROUP BY contry, state
                  UNION ALL
    SELECT contry, state, 0 AS S06, COUNT(*) AS S07, 0 AS S08, 0 AS S09
                 FROM orders2007
           GROUP BY contry, state 
                  UNION ALL
    SELECT contry, state, 0 AS S06, 0 AS S07, COUNT(*) AS S08, 0 AS S09
                 FROM orders2008
           GROUP BY contry, state
                  UNION ALL
    SELECT contry, state, 0 AS S06, 0 AS S07, 0 AS S08, COUNT(*) AS S09
                 FROM orders2009
           GROUP BY contry, state)
          AS orders
-- End Derived Table
     GROUP BY contry, state
     ORDER BY total DESC, contry, state

The first five rows of the final result set table produced by this SQL query are shown below.

Country   State   2006   2007   2008   2009   Total   Percent
USA         CA     400    739    728    758    2625    7.7886
USA         IL     534     87   1466    239    2326    6.9014
USA         NY     858    838    228    169    2093    6.2101
USA         FL     302    507    306    450    1565    4.6435
USA         MA     711    414    142    258    1525    4.5248

This is exactly the result set my friend was looking for. Since I developed this SQL query using Run SQL Scripts in Navigator, the SQL query can be saved as an SQL script, sent to my friend, and he can execute it when needed.

Better still, since he is also running V6R1 of Navigator, he also can use Run SQL Scripts to run the SQL query, and with a couple of clicks can save the final result set table as an Excel spreadsheet–exactly what my friend wanted to do.

How long did it take me to develop the SQL Query? My false starts were spread over a couple of days, but once I got headed in the right direction it only took me between one and two hours of experimentation from start finish to develop the correctly working SQL query.

Did UNION play a significant part in providing the solution to this SQL query requirement? The answer is YES. UNION provided the capability to combine the four order history tables into a single result set table, and in the process define the number and type of columns in the result set table to satisfy the requirements of the spreadsheet format. If UNION were not an option in DB2 for i, the solution could not have been provided using a single SQL query and would instead have required multiple SQL queries.

Skip Marchesani retired from IBM after 30 years and is now a consultant with Custom Systems Corporation. He is also a founding partner of System i Developer and the RPG & DB2 Summit. Skip spent much of his IBM career working with the Rochester Development Lab on projects for S/38 and AS/400 and was involved with the development of the AS/400. He was part of the team that taught early AS/400 education to customers and IBM lab sites worldwide. Skip is recognized as an industry expert on DB2 for i and the author of the book DB2/400: The New AS/400 Database. He specializes in providing customized education for any area of the System i, iSeries, and AS/400; does database design and design reviews; and performs general System i, iSeries, and AS/400 consulting for interested clients. He has been a speaker for user groups, technical conferences, and System i, iSeries, and AS/400 audiences around the world. He is an award-winning COMMON speaker and has received its Distinguished Service Award. Send your questions or comments for Skip to Ted Holt via the IT Jungle Contact page.