Running Totals in an SQL Query, Take 2
July 13, 2011 Skip Marchesani
My article, Running Totals in an SQL Query, described how to write an SQL SELECT statement that generated a running total for each row in the result set and provided an example of the SELECT statement syntax to do so. While the SELECT statement was syntactically correct, it had an omission and did not work correctly in a certain data situation. Several readers figured this out and were good enough to send an email note to make me aware of the situation. I very much appreciate the emails. I want readers to know that the article was a test to see how many were paying attention to detail, and all of you that sent emails alerting me to the issue passed the test. I wish. And if you believe that it was a test, I have a bridge for sale if you’re interested. Let’s review my running totals scenario so I can explain the omission and show how it can be easily fixed. I used an example based on a department master table called Dept_Master, which had the following format. Column Name Description Dpt_Nbr Department Number Dpt_Name Department Name Emp_Count Department Employee Count And, for the purposes of the discussion, the Dept_Master table contained the following rows listed in arrival sequence for the table. Dpt_Nbr Dpt_Name Emp_Count 901 Accounting 03 911 Sales 11 977 Manufact 27 990 Maint 07 The example that I used was to generate a running total based on the employee count for each department, and see the result set in descending sequence based on the employee count as shown below. Dpt_Nbr Dpt_Name Emp_Count Running_Total 977 Manufact 27 27 911 Sales 11 38 990 Maint 07 45 901 Accounting 03 48 The SELECT statement to do the above used a scalar subselect in the column list of the SELECT statement to calculate or derive the running total for each row as shown below. SELECT dpt_nbr, dpt_name, emp_count, (SELECT SUM(emp_count) FROM dept_master b WHERE b.emp_count >= a.emp_count) AS running_total FROM dept_master a ORDER BY emp_count DESC This SELECT statement is syntactically correct, but it breaks when the column on which the running total is based–in this case Emp_Count–contains identical values in two or more rows. This means that if the Dept_Master table in the example had two rows with the same Emp_Count, the SELECT statement shown above will not work correctly. To see what happens in this situation, let’s change the row for department number 990 to have an employee count of 11, which is the same as department number 911. The rows in the Dept_Master table would then look as follows: Dpt_Nbr Dpt_Name Emp_Count 901 Accounting 03 911 Sales 11 977 Manufact 27 990 Maint 11 If we execute the SELECT statement against this set of data, it returns the following result set. Dpt_Nbr pt_Name Emp_Count Running_Total 977 Manufact 27 27 911 Sales 11 49 990 Maint 11 49 901 Accounting 03 2 If we examine the result set, the running total for departments 911 and 990 are both 49, which is not correct. One of them should be 38 and the other 49. The reason that the SELECT statement does not work correctly is that the argument for the WHERE clause for the scalar subselect (WHERE b.emp_count >= a.emp_count) cannot handle duplicate employee counts in the rows for each department they way it is currently structured. The solution is simple and was provided by reader Anita C, who emailed to make me aware of the situation. My thanks to Anita! To resolve the issue, the argument for the WHERE clause in the scalar subselect needs to be structured so that it is unique for each row in Dept_Master. The way to do this is to combine Emp_Count with Dpt_Nbr by concatenating them together as shown in the following WHERE clause. Dpt_Nbr is a unique column in Dept_Master. WHERE digits(b.emp_count) || b.dpt_nbr >= digits(a.emp_count) || a.dpt_nbr) In addition to the change in the WHERE clause, the ORDER BY clause must have the same argument as the WHERE clause as shown below. Without the revised ORDER BY clause, the running total will not be in the correct sequence in the result set. ORDER BY digits(emp_count) || dpt_nbr In this specific case the DIGITS function is used with Emp_Count (in both the WHERE and ORDER BY clauses) to preserve the leading zero in the employee count of 03 for department number 901. If DIGITS is not used, DB2 will do an implicit cast of Emp_Count from numeric to character, the leading zero on the employee count of 03 will be dropped, the remaining three will be left justified, and the SELECT statement will not return the correct result set. The SELECT statement with the revised syntax looks like this: SELECT dpt_nbr, dpt_name, emp_count, (SELECT SUM(emp_count) FROM dept_master b WHERE DIGITS(b.emp_count) || b.dpt_nbr >= DIGITS(a.emp_count) || a.dpt_nbr) AS running_total FROM dept_master a ORDER BY DIGITS(emp_count) || dpt_nbr DESC And, the following would be the correct result set returned by the SELECT statement: Dpt_Nbr Dpt_Name Emp_Count Running_Total 977 Manufact 27 27 990 Maint 11 38 911 Sales 11 49 901 Accounting 03 52 The SELECT statement shown above, returns the result set in descending sequence of Emp_Count then Dpt_Nbr. To see the result set in ascending sequence simply change the greater than or equal to argument (>=) in the WHERE clause for the scalar subselect to be to a less than or equal argument (<=), and remove the DESC keyword at the end of the ORDER BY clause. There are a couple of things to understand about running totals to get them to work correctly in an SQL Query. Results will be unpredictable and not correct if: 1. The column name(s) used in the WHERE clause for the scalar subselect are not the same as those used in the ORDER by clause for the outer or primary SELECT statement. If the names are different, the data contained in the columns must be based on the same thing. In other words, don’t mix apples and oranges in the WHERE clause for the scalar subselect and the ORDER BY clause for the outer or primary SELECT statement. For example, you can’t use emp_count in the WHERE clause and dpt_name in the ORDER BY clause. 2. The argument of the WHERE clause must match the sequence in the ORDER BY clause. This means a WHERE clause argument of <= (less than or equal) is used in conjunction with an ascending sequence in the ORDER BY clause (ascending is the default sequence for the ORDER BY clause), and a WHERE clause of >= (greater than or equal) is used in conjunction with a descending sequence in the ORDER BY clause (ORDER BY xxxx DESC). This technique works very nicely and is easy to use when the target of the running total is an existing column in a table or view, as opposed to a derived column. If you should choose to try and use it in conjunction with a derived column, the complexity of the syntax for the scalar subselect increases with and is proportional to the complexity of the expression that is used for the column derivation. Speaking from experience, when using a derived column, the syntax for the scalar subselect becomes very complex very quickly. Understanding these considerations is the key to putting running totals to work for you. While reviewing this article, Ted Holt realized that the OLAP (On Line Analytical Processing) SQL functions that were announced in V5R4 and V6R1 could be also be used to generate running totals. The SQL OLAP functions RANK, DENSE_RANK, and ROLLUP were announced in V5R4; and CUBE, GROUPING, and GROUPING SETS were announced in V6R1. My thanks to Ted for providing this solution. WITH RankedData AS (SELECT dpt_nbr, dpt_name, emp_count, row_number() OVER (order by emp_count) AS Rank FROM dept_master) SELECT dpt_nbr, dpt_name, emp_count, SELECT SUM(emp_count) FROM RankedData AS b WHERE b.rank >= a.rank) AS Running_Total FROM RankedData AS a ORDER BY rank DESC Skip Marchesani retired from IBM after 30 years and decided to pursue a career as a consultant in the IBM i marketplace. He spent much of his IBM career working with the Rochester Development Lab on projects for S/38 and AS/400, and was involved with the development of the AS/400. Skip was part of the team that taught early AS/400 education to customers and IBM lab sites worldwide. Now recognized as an industry expert on DB2 for i (a.k.a. DB2/400), and author of the book “DB2/400: The New AS/400 Database,” Skip specializes in providing customized education for any area of the IBM i and AS/400, does database design and design reviews, and general IBM i, iSeries, and AS/400 consulting. He has been a speaker for user groups, technical conferences, and IBM i, iSeries, and AS/400 audiences around the world, and has received COMMON’s Distinguished Service Award. Skip Recently moved to Vermont and is now president of the Vermont Mid Range User Group (VTMUG). Send your questions or comments for Skip to Ted Holt via the IT Jungle Contact page. RELATED STORY Running Totals in an SQL Query
|