Guru: Emulate Control Break Reporting With DB2 For i SQL

May 13, 2019 Michael Sansoterra

When generating ad hoc user reports, it’s often the case that developers don’t have the time or luxury to write a high-level language program or to use a reporting tool such as DB2 Web Query or Crystal Reports to format the data appropriately. But when a user is given a raw dump of a SELECT statement’s result set in Excel, it can be quite ugly to wade through and format.

The user may have to do things like add total rows or remove duplicate values, which can be quite time consuming. Thankfully, DB2 for i SQL offers a way to alleviate these problems by formatting the result set in order to minimize the user’s work. NOTE: DB2 for i 7.3 is required for the demonstrated technique to remove duplicate values. The sample data is from the fictitious Microsoft AdventureWorks database which was ported to DB2 for i.

Take for example, a typical simple sales query like this one:

SELECT CAST(st.Name AS NVARCHAR(30)) AS Territory,
       COALESCE(CAST(s.Name AS NVARCHAR(60)),
          CAST(i.FirstName AS NVARCHAR(30))||' '||
          CAST(i.LastName AS NVARCHAR(30))) AS Customer,
       VARCHAR_FORMAT(soh.OrderDate,'YYYY-MM') AS OrderDate,
       SUM(sod.OrderQty*sod.UnitPrice) AS GrossSales
  FROM SalesOrderHeader soh
  JOIN Customer c ON c.CustomerId=soh.CustomerId
  JOIN SalesTerritory st ON st.TerritoryId=c.TerritoryId
  JOIN SalesOrderDetail sod ON sod.SalesOrderId=soh.SalesOrderId
  LEFT JOIN Store s ON s.BusinessEntityId=c.StoreId
  LEFT JOIN Person i ON i.BusinessEntityId=c.PersonId
WHERE soh.OrderDate BETWEEN '2007-10-01' AND '2007-11-15'
GROUP BY CAST(st.Name AS NVARCHAR(30)),
         COALESCE(CAST(s.Name AS NVARCHAR(60)),
           CAST(i.FirstName AS NVARCHAR(30))||' '||
           CAST(i.LastName AS NVARCHAR(30))),
         VARCHAR_FORMAT(soh.OrderDate,'YYYY-MM')
ORDER BY Territory,Customer,OrderDate

If you’re wondering about all of the seemingly unnecessary casts to NVARCHAR, it’s because the name columns are user-defined types (UDT) and they often require an explicit cast to participate in operations such as CONCAT.

In Excel, the abridged query result looks pretty busy. In Table 1 below we see an abridged result set as it would appear in Excel (with table formatting applied).

In newer versions of Excel, there are plenty of cool formatting features on the “data” menu such as “remove duplicates”, group, ungroup, and subtotal. However, it’s still a bit of work for the user to make the results look friendlier. Table 2 below we see a result set with duplicate values removed.

The user can be spared quite a bit of formatting grief by removing the duplicate values on their behalf in the result set (like shown above for Territory and Customer columns) by using the LAG function. Recall, LAG is a relatively new DB2 for i function (introduced in IBM i 7.3) and its purpose is to allow the database engine to identify a value from a prior row in the result set.

To evaluate whether a value has been duplicated for a particular column, DB2 needs to “peek” at the value in the previous row (and remember, the first row has no previous row so LAG will return NULL). If the current row value has the same value as the previous row, then a blank can be substituted for the real value.

This is easy to illustrate using the Territory column. The column expression below compares the current row’s territory with the prior row’s territory and if they are the same an empty string is returned, otherwise the territory value is returned:

CASE WHEN Territory=LAG(Territory,1) OVER(ORDER BY Territory) THEN '' 
ELSE Territory END AS Territory

The partial expression “LAG(Territory,1) OVER(ORDER BY Territory)” tells DB2 to get the Territory column from the first row prior to the current row. OVER(ORDER BY TERRITORY) instructs DB2 to sort the rows by Territory when evaluating the prior row. This sort instruction is required to make sure that DB2 for i will consistently and correctly evaluate the order of the rows when performing the lookup of a prior row.

The following query is just a revision of the prior query with the exception that the fields like territory, customer, etc., will return blanks if the prior row value for the column is repeated. This query was used to generate the previous sample result set (Table 2) without duplicate values:

WITH BASE_DATA AS (
SELECT CAST(st.Name AS NVARCHAR(30)) AS Territory,
       COALESCE(CAST(s.Name AS NVARCHAR(60)),
         CAST(i.FirstName AS NVARCHAR(30))||' '||
         CAST(i.LastName AS NVARCHAR(30))) AS Customer,
       VARCHAR_FORMAT(soh.OrderDate,'YYYY-MM') AS OrderDate,
       SUM(sod.OrderQty*sod.UnitPrice) AS GrossSales
  FROM SalesOrderHeader soh
  JOIN Customer c ON c.CustomerId=soh.CustomerId
  JOIN SalesTerritory st ON st.TerritoryId=c.TerritoryId
  JOIN SalesOrderDetail sod ON sod.SalesOrderId=soh.SalesOrderId
  LEFT JOIN Store s ON s.BusinessEntityId=c.StoreId
  LEFT JOIN Person i ON i.BusinessEntityId=c.PersonId
WHERE soh.OrderDate BETWEEN '2007-10-01' AND '2007-11-15'
GROUP BY CAST(st.Name AS NVARCHAR(30)),
         COALESCE(CAST(s.Name AS NVARCHAR(60)),CAST(i.FirstName AS NVARCHAR(30))||' '||CAST(i.LastName AS NVARCHAR(30))),
         VARCHAR_FORMAT(soh.OrderDate,'YYYY-MM')
ORDER BY Territory,Customer,OrderDate
)
SELECT CASE WHEN Territory=LAG(Territory,1) 
            OVER(ORDER BY Territory) THEN '' ELSE Territory END 
            AS Territory,
       CASE WHEN Customer=LAG(Customer,1) 
            OVER(PARTITION BY Territory ORDER BY Territory,Customer)
            THEN '' ELSE Customer END AS Customer,
       CASE WHEN OrderDate=LAG(OrderDate,1) 
            OVER(PARTITION BY Territory,Customer 
                 ORDER BY Territory,Customer,OrderDate) 
            THEN '' ELSE VARCHAR(OrderDate) END AS OrderDate,
       GrossSales
  FROM BASE_DATA b
ORDER BY b.Territory,b.Customer,b.OrderDate

The bolded CASE expressions demonstrate the technique to remove duplicate values using LAG. LAG is a handy tool as it allows an SQL developer to emulate the control break formatting that an RPG or COBOL program would implement. And speaking of control break programming, it typically also involves providing break level totals.

This SQL statement can be enhanced yet again to include total rows with control break formatting so that the user doesn’t have to be bothered. The final report can look like Table 3 below (bold formatting is mine to show the total lines).

Table 4 shown below is the end of the report with a final total for the last territory of the report (“United Kingdom”) and a Grand Total line.

How were these total lines slipped into the results using SQL? By using “WITH ROLLUP” in conjunction with GROUP BY. Recall that when using WITH ROLLUP, SQL will automatically add aggregate rows to the result set. The number of total rows added will depend on how many expressions are in the GROUP BY clause. In this example, there will be total rows added after each change in “Order Date”, Customer, and Territory.

The revised SQL that includes these total rows and also implements easy to read column names is shown below:

WITH BASE_DATA AS (
SELECT CAST(st.Name AS NVARCHAR(30)) AS Territory,
       COALESCE(CAST(s.Name AS NVARCHAR(60)),
CAST(i.FirstName AS NVARCHAR(30))||' '||
CAST(i.LastName AS NVARCHAR(30))) AS Customer,
       VARCHAR_FORMAT(soh.OrderDate,'YYYY-MM') AS OrderDate,
       sod.OrderQty*sod.UnitPrice AS GrossSales
  FROM SalesOrderHeader soh
  JOIN Customer c ON c.CustomerId=soh.CustomerId
  JOIN SalesTerritory st ON st.TerritoryId=c.TerritoryId
  JOIN SalesOrderDetail sod ON sod.SalesOrderId=soh.SalesOrderId
  LEFT JOIN Store s ON s.BusinessEntityId=c.StoreId
  LEFT JOIN Person i ON i.BusinessEntityId=c.PersonId
WHERE soh.OrderDate BETWEEN '2007-10-01' AND '2007-11-15'
),
SUMMARY_DATA AS (
SELECT ROW_NUMBER() OVER(ORDER BY Territory,Customer,OrderDate) AS ROW_ID,
       GROUPING(Territory) AS Is_Grand_Total,
       GROUPING(Customer)  AS Is_Territory_Total,
       GROUPING(OrderDate) AS Is_Customer_Total,
       Territory,Customer,OrderDate,SUM(GrossSales) AS GrossSales,
       COUNT(DISTINCT OrderDate) AS Number_Dates
  FROM BASE_DATA
GROUP BY Territory,Customer,OrderDate
WITH ROLLUP)
SELECT /*
       -- All Columns
       Territory,Customer,OrderDate,
       ROW_ID,Is_Grand_Total,Is_Territory_Total,Is_Customer_Total,
       */
       CASE WHEN Is_Grand_Total=1 THEN '*Grand Total:' ELSE 
           CASE WHEN Is_Territory_Total=1 THEN '*' || Territory || ' Total:' ELSE 
           CASE WHEN Territory=LAG(Territory,1) OVER(ORDER BY Territory) THEN '' ELSE Territory END END END AS "Territory",
       CASE WHEN Is_Customer_Total=1 AND Is_Territory_Total=0 THEN '*Total:' ELSE 
       CASE WHEN Customer=LAG(Customer,1) OVER(PARTITION BY Territory ORDER BY Territory,Customer) THEN '' ELSE Customer END END AS "Customer",
       CASE WHEN OrderDate=LAG(OrderDate,1) OVER(PARTITION BY Territory,Customer ORDER BY Territory,Customer,OrderDate) THEN '' ELSE VARCHAR(OrderDate) END AS "Order Date",
       GrossSales AS "Gross Sales"
  FROM SUMMARY_DATA s
 WHERE (IS_GRAND_TOTAL=0 AND IS_TERRITORY_TOTAL=0 AND IS_CUSTOMER_TOTAL=0)
    OR (NUMBER_DATES>1 OR IS_GRAND_TOTAL=1 OR IS_TERRITORY_TOTAL=1)
ORDER BY ROW_ID

I’ve highlighted and color coded the important changes to the SELECT statement from the previous version of the query.

Most importantly, WITH ROLLUP (black) was added to get DB2 for i to include total lines to the result set (see the references section for more information on ROLLUP).

In red, I’ve marked some columns to aid with understanding how the query works and to control how the total lines are displayed. The ROW_ID column just adds a sequential number to the result set that helps to verify the sorting isn’t messed up.

ROW_NUMBER() OVER(ORDER BY Territory,Customer,OrderDate) AS ROW_ID,
GROUPING(Territory) AS Is_Grand_Total,
GROUPING(Customer)  AS Is_Territory_Total,
GROUPING(OrderDate) AS Is_Customer_Total,

There are three columns that use the GROUPING function to indicate whether the row is an aggregate added by DB2 (via ROLLUP) or just a “normal” row from the result set. This function returns a 0 (normal query row) or a 1 (rollup aggregate row). Further, the function is passed a column name which is akin to finding out if a control break occurred on the column.

For instance, when GROUPING(TERRITORY) returns 1, this indicates DB2 for i added a row to the result set for the aggregate of the TERRITORY column. Since TERRITORY is the first column in the GROUP BY and GROSS SALES is being aggregated with the SUM function, adding a row at this level is really a “grand total”. When GROUPING(CUSTOMER) returns 1, this is a rollup row that DB2 for i added when the TERRITORY has changed.

For an RPG analogy, think of these GROUPING expressions like the level break indicators (L1-L9). If an RPG program uses L1-L3 for control break processing, when L3 goes on, L2 and L1 go on as well. Likewise, when GROUPING(CUSTOMER) returns a 1 (a.k.a. L2 break), GROUPING(ORDERDATE) will also return a 1 (a.k.a. L1 break).

Let’s go back to the OVER clause specified with the LAG function. In general, when removing duplicate values, the LAG function should use an OVER with a PARTITION BY with the column(s) for control break fields. Further, to insure the duplicates are removed on a control break style report, the LAG function’s ORDER BY should contain the same sort order that matches the SELECT statement’s outer ORDER BY.

Now, back to the SQL control break example. In the green section, the CASE statements now have more work to do for formatting. They will still show a blank for duplicated rows, but they will also show the appropriate total indicator for the aggregate rows DB2 has added:

CASE WHEN Is_Grand_Total=1 THEN '*Grand Total:' ELSE
    CASE WHEN Is_Territory_Total=1 THEN '*' || Territory || ' Total:' ELSE
    CASE WHEN Territory=LAG(Territory,1) OVER(ORDER BY Territory) THEN ''
ELSE Territory END END END AS "Territory",

I added asterisks in front of the total lines so that they would be easy to pick out when using the filter functionality offered with Excel tables.

The total lines are easy to identify in the Excel table filter when they are prefixed with an asterisk (*).

Finally, the orange section shows the technique I used to remove some annoying aggregate rows that DB2 added that I didn’t want to bother the user with. Most of the customer entries on the report only have one order in the given “Order Date” time frame. But when using WITH ROLLUP, DB2 will add a special aggregate row for each and every customer, even if they only have one “Order Date” entry. All of these aggregate rows make the report unwieldy, so I decided to remove the customer aggregate rows when the customer only has one “Order Date” on the report.

To accomplish this, I added a new column (not shown to the user) to track how many unique dates (i.e., number of rows) each customer has.

COUNT(DISTINCT OrderDate) AS Number_Dates

The WHERE predicate will allow all the “normal” (i.e., non aggregate rollup) rows to appear on the report:

WHERE (IS_GRAND_TOTAL=0 AND IS_TERRITORY_TOTAL=0 AND IS_CUSTOMER_TOTAL=0)

Moreover, the aggregate lines from the rollup will appear on the report for the grand total, territory total or customer total (when the NUMBER_DATES column is greater than one):

OR (NUMBER_DATES>1 OR IS_GRAND_TOTAL=1 OR IS_TERRITORY_TOTAL=1)

As one final touch to assist the user, the final column names were placed in double quotes so they are case-sensitive and can contain spaces (e.g. AS “Order Date”). This allows the user to see a nice heading like Order Date instead of ORDERDATE.

This control break programming with SQL can be a little hairy to understand. I ran the above query and included additional columns in the result set and saved it in a spreadsheet here, which you can review alongside the SQL to get a better idea of how all the pieces fit together.

In short, SQL is up to the task of doing complex formatting to remove duplicate values and to provide aggregate rows (a.k.a. control break) so that a user can be given a result set that will require minimal manipulation. Because the SQL can become complicated and cumbersome (even for a simple example like this one), it is usually better to use a reporting tool for these tasks, whenever possible. The LAG function provides an easy way to remove duplicate values from a result set. The ROLLUP feature will add aggregate rows (such as totals or averages) to the result set.

3 thoughts on “Guru: Emulate Control Break Reporting With DB2 For i SQL”

Ken says:

May 13, 2019 at 8:57 am

Michael, thank you for sharing. I did not know about lag. This will come in very handy. The count to remove the unwanted totals is really cool too.

Carlos Irigoyen says:

May 13, 2019 at 2:58 pm

Hi Michael: Excellent article, thank you. I would like to test some of the scripts, can you help with on telling me how to download the AdventuresWork Database and ported to IBM i Db2?. Yours truly. Carlos Irigoyen. carlosir@bellsouth.net

Rick says:

May 17, 2019 at 12:43 pm

Michael,
Excellent article. I reproduced it in SQL, took a while to change over example data I had. Then just for fun, I did it with Web Query in about 30-seconds. In any case, this is a very powerful example of cool stuff we can do with SQL. Thanks and keep ’em coming! Rick

This Issue Sponsored By

Table of Contents

Content archive

Recent Posts

Subscribe

Pages

Search