Running Totals in an SQL Query
March 2, 2011 Skip Marchesani
There are situations when running an SQL query for data analysis where I like to generate a running total based on a column in the table that is the target of the query, to assist with the data analysis. “What’s a running total?” you ask. Wikipedia and I define it as the summation of a sequence of numbers that is updated each time a new number is added to the sequence, simply by adding the value of the new number to the running total. To explain how to generate a running total, I use an example based on a department master table, Dept_Master, with the following format. Column Name Description Dpt_Nbr Department Number Dpt_Name Department Name Emp_Count Department Employee Count For the purposes of our discussion, the Dept_Master table has the following rows, listed in arrival sequence for the table. Dpt_Nbr Dpt_Name Emp_Count 901 Accounting 3 911 Sales 11 977 Manufact 27 990 Maint 7 For the example, I generate a running total based on the employee count for each department, and see the result set in descending sequence based on the employee count as shown below. Dpt_Nbr Dpt_Name Emp_Count Running_Total 977 Manufact 27 27 911 Sales 11 38 990 Maint 7 45 901 Accounting 3 48 To generate the running total, the employee count column for each row has to be added to the running total column from the preceding row, while processing the rows in the desired sequence. In this example the rows will be processed in descending sequence based on employee count. The first row to be processed is department number 977 with an employee count of 27. Since this is the first row in the sequence, there is no preceding row and the running total has an initial value of zero. Therefore the running total for department number 977 is 27 (0 + 27 = 27). The second row of the result set is department number 911. Since it has an employee count of 11, the running total is 38. This is the sum of the running total from the preceding row of the result set (department number 977) and the employee count of department number 911 (27 + 11 = 38). The third row in the result set is department number 990 and will have a running total of 45 (38 + 7 = 45). The fourth and last row in the result set is department 901 and will have a running total of 48 (45 + 3 = 48). Another way to look at this, and the way I look at it from an SQL perspective, is that the running total is the sum of the target column of the current row being processed and all those rows that precede the current row, processing the rows in the desired sequence. Since in this example we are calculating the running total based on the employee count in descending sequence, the running total for department number 977 is calculated as 0 + 27 = 27 (zero since no row preceded it), department 911 is calculated as 27 + 11 = 38, department 990 is calculated as 27 + 11 + 7 = 45, and department 901 is calculated as 27 + 11 + 7 + 3 = 48. The question is how to format an SQL SELECT statement to do the above. The answer is to use a scalar subselect in the column list of the SELECT statement to calculate or derive the running total for each row as shown below. SELECT dpt_nbr, dpt_name, emp_count, (SELECT SUM(emp_count) FROM dept_master b WHERE b.emp_count >= a.emp_count) AS running_total FROM dept_master a ORDER BY emp_count DESC; The scalar subselect is the SELECT statement enclosed in parentheses immediately following “emp_count” in the column list for the SELECT. The AS operator that follows the scalar subselect assigns the name “running_total” to the calculated or derived column. When executed this SQL syntax returns the desired result set. Dpt_Nbr Dpt_Name Emp_Count Running_Total 977 Manufact 27 27 911 Sales 11 38 990 Maint 7 45 901 Accounting 3 48 The scalar subselect derives the running total for the row being processed by summing the employee count for any row where the employee count (b.emp_count) is greater than or equal to the employee count for the current row (a.emp_count). (SELECT SUM(emp_count) FROM dept_master b WHERE b.emp_count >= a.emp_count) AS running_total In this example the result set with the running total is based on employee count in descending sequence. How would this syntax be changed if we wanted the sequence to be changed to ascending sequence? The answer is simple. The greater than or equal to argument (>=) in the WHERE clause for the scalar subselect is changed to a less than or equal argument (<=). The syntax for the SELECT statement then becomes: SELECT dpt_nbr, dpt_name, emp_count, (SELECT SUM(emp_count) FROM dept_master b WHERE b.emp_count <= a.emp_count) AS running_total FROM dept_master a ORDER BY emp_count DESC And the result set is: Dpt_Nbr Dpt_Name Emp_Count Running_Total 901 Accounting 3 3 990 Maint 7 10 911 Sales 11 21 977 Manufact 27 48 There are a couple of things to understand about running totals to get them to work correctly in an SQL Query. Results will be unpredictable and not correct if: 1. The column name(s) used in the WHERE clause for the scalar subselect are not the same as those used in the ORDER by clause for the outer or primary SELECT statement. If the names are different, the data contained in the columns must be based on the same thing. In other words don’t mix apples and oranges in the WHERE clause for the scalar subselect and the ORDER BY clause for the outer or primary SELECT statement. For example you can’t use “emp_count” in the WHERE clause and “dpt_name” in the ORDER BY clause. 2. The argument of the WHERE clause must match the sequence in the ORDER BY clause. This means a WHERE clause argument of <= (less than or equal) is used in conjunction with an ascending sequence in the ORDER BY clause (ascending is the default sequence for the ORDER BY clause), and a WHERE clause of >= (greater than or equal) is used in conjunction with a descending sequence in the ORDER BY clause (ORDER BY xxxx DESC). This technique works very nicely and is easy to use when the target of the running total is an existing column in a table or view, as opposed to a derived column. If you should choose to try and use it in conjunction with a derived column, the complexity of the syntax for the scalar subselect increases with and is proportional to the complexity of the expression that is used to for the column derivation. Speaking from experience, when using a derived column, the syntax for the scalar subselect becomes very complex very quickly. Understanding these considerations is the key to putting running totals to work for you. Skip Marchesani retired from IBM after 30 years and is now a consultant with Custom Systems Corporation. He is also a founding partner of System i Developer and the RPG & DB2 Summit. Skip spent much of his IBM career working with the Rochester Development Lab on projects for S/38 and AS/400 and was involved with the development of the AS/400. He was part of the team that taught early AS/400 education to customers and IBM lab sites worldwide. Skip is recognized as an industry expert on DB2 for i and the author of the book DB2/400: The New AS/400 Database. He specializes in providing customized education for any area of the System i, iSeries, and AS/400; does database design and design reviews; and performs general System i, iSeries, and AS/400 consulting for interested clients. He has been a speaker for user groups, technical conferences, and System i, iSeries, and AS/400 audiences around the world. He is an award-winning COMMON speaker and has received its Distinguished Service Award. Send your questions or comments for Skip to Ted Holt via the IT Jungle Contact page.
|