New in V5R4: OLAP Ranking Specifications
June 14, 2006 Michael Sansoterra
This tip continues Four Hundred Guru‘s series on the fantastic SQL enhancements that iSeries, i5 and AS/400 users received in the latest release of OS/400, which is called i5/OS V5R4. In V5R4, IBM gave developers an impressive set of Online Analytical Processing (OLAP) tools to aid with common ranking and numbering issues. The significance of these OLAP tools lies in the amount of work they do and the time it will save database programmers. OLAP queries are generally concerned with providing answers to high level questions about large amounts of underlying data. (OLAP is also often associated with multidimensional data stores, but these OLAP functions only relate to the relational data stored in DB2/400.) As you will see, these new specifications can provide gobs of information in a single query that, in the past, would have taken several queries to write. The OLAP tools can be divided into numbering and ranking categories. As the categories imply, these OLAP features are used to assign a numbering scheme or ranking scheme to the rows in a result set. Each OLAP ranking is placed into an additional column of the result set. Because these specifications operate on the results of a query, they are only allowed in the SELECT list or ORDER BY clause. The data type of these OLAP functions is always BIGINT and will never contain a NULL. OLAP Numbering To start, the ROW_NUMBER is a huge blessing as it finally furnishes SQL with a built-in counter function. Here’s a trivial example that adds a row number column to a result set: Select ProductName, Row_Number() Over (Order By ProductName) As ProductNumber From Products Order By ProductName The first several rows look like this:
As evidenced by the query results, we can determine that the query is run first and then the OLAP function is invoked and assigned to a new column in the query results. Try this query without V5R4! In the past, hacks such as using the relative record number (RRN) function were used to attempt to get SQL to do this kind of counter. Let’s review the syntax of this OLAP specification: Row_Number() Over (Order By ProductName) ROW_NUMBER tells DB2 to create an incremental number column. The OVER keyword instructs DB2 what column or columns should be used to control the numbering order. In this case, the OLAP Order By expression is ProductName, which means the counter will be incremented and assigned to each row in product name order. In this case the OLAP ORDER BY matches the Select’s ORDER BY, although it doesn’t have to be the same. We could have had DB2 number the results by ProductName in descending order, by CategoryId or by some other column in the table. ROW_NUMBER can also be useful for loading data into a table that requires a counter field, as in the case of this ORDERITM table: CREATE TABLE ORDERITM ( OrderId Int Not Null, ItemSeq Int Not Null, ProductId Int, OrdQty Int, UnitPrice Dec(9,2), Constraint ORDERITMPK Primary Key (OrderId,ItemSeq)); If we need to import data into ORDERITM from an external source, such as another order entry system, then we can use ROW_NUMBER to fill in the item sequence (ItemSeq) column if the external source doesn’t provide one. Here’s a sample of how to accomplish this task: Insert Into OrderItm Select OrderNo, Row_Number() Over (Order By OrderNo), ItemNo, QtyOrdered, SalePrice From OrderImport In this scenario, ROW_NUMBER will continually increment the ItemSeq field over all orders that are being imported. For this behavior we might as well create ItemSeq as an identity column. It would be nice to reset the ItemSeq to “one” every time there was a change in the order number from the import table. This reset can be easily done using a partitioning expression, which tells DB2 how to reset the numbering by a change in value or values. (For the old RPG and COBOL programmers in the audience, this is similar to specifying a control break level.) Here is an example of what it would look like: Insert Into OrderItm Select OrderNo, Row_Number() Over (Partition By OrderNo Order By OrderNo,ItemNo), ItemNo, QtyOrdered, SalePrice From OrderImport In this example, the Row_Number functionality is provided but the Partition By expression tells Row_Number to start counting all over again whenever there is a change in the OrderNo column. This way every order inserted into the table will at least have one item sequence number one. As shown in the Row_Number’s Order By expression, the sequence numbers will be assigned to rows in OrderNo,ItemNo order (although OrderNo is superfluous since it is technically serves as a control break column). In the past, this type of numbering functionality could be implemented with an external UDF using the scratchpad option (see Scribble on SQL’s Scratchpad for an example), but this method suffered from the limitation of being limited to a forward only cursor. Further, if this type of counter function is partitioned to reset based on a sorted field then more problems would occur if DB2/400 didn’t feel like ordering the data before it processed the Select columns, which would often cause bogus results. (DB2 does not guarantee the sort order of rows coming into the SELECT. Usually a contrived ORDER BY and index on the base tables were required to get this type of counter function to work correctly in a complex query.) Another option was to use a sequence object but sequences aren’t allowed in QTEMP (for temporary options) and they often need to be reset between query runs resulting in additional programming work. Further, sequences can’t do the partitioning work that ROW_NUMBER can. All of this boring discourse about alternatives simply serves as a reminder of how much easier Row_Number will make our lives. OLAP Ranking In addition to numbering, two ranking options RANK and DENSE_RANK are available. These functions can be used to assign a rank to each row in a query’s result set. For example, in the query below, product sales are summarized in the ProductSummary Common Table Expression (CTE). The aggregates include quantity sold, extended price, discount amount, and number of unique orders for each product. Using RANK, we can simultaneously tag each product’s rank (compared to all other products) in a single statement: With ProductSummary As ( Select OD.ProductID,ProductName,SupplierID,CategoryID, Sum(Quantity) As Qty, Sum(Quantity * OD.UnitPrice * (1-Discount)) As ExtAmount, Sum(Quantity * OD.UnitPrice * Discount) As DiscAmount, Count(Distinct OrderID) As NoOrders From OrderDetails OD Left Join Products P On P.ProductID=OD.ProductID Group By OD.ProductID,ProductName,SupplierID,CategoryID ) Select PS.*, Rank() Over (Order By ExtAmount Desc) As Rank_Sales, Rank() Over (Order By DiscAmount Desc) As Rank_Disc, Rank() Over (Order By Qty Desc) As Rank_Volume, Rank() Over (Order By NoOrders Desc) As Rank_Orders From ProductSummary PS Order By ProductName Examine the syntax of the OLAP specification for a moment: Rank() Over (Order By ExtAmount Desc) As Rank_Sales, To assign a numeric rank to a set of rows, use the Rank() Over syntax. This is followed by an Order By expression that defines for SQL how it should rank each row in the set. In the above example, DB2/400 is instructed (after assembling the base result set) to rank the rows in the results by the Extended Amount column from high to low. This new OLAP column is given the alias Rank_Sales and when the query is run, each row in the results will have an assigned sales rank (compared to other rows in the set) by sales from high to low. The first table associated with this story, which you can view by clicking here, illustrates the results of this query (based on Microsoft ‘s classic Northwind sample data). It would have taken a boatload of code to get the same results prior to V5R4. Looking at the results of the query, management can immediately discern which products (in relation to the others) are:
Since RANK assigns a position to each row in a given result set, what happens if there is a tie? For example, two rows with the exact same sales amount. The answer is that tied values are assigned the same rank. If, in the above example, the two top products have identical sales figures then both products will be assigned first rank. However, the next highest product will be placed as third and the rank of second will be absent. If two Olympians win the gold, no one gets the silver. So it is with OLAP ties using RANK. If you do not want this behavior exhibited when there are ties, then use the DENSE_RANK function. This function is identical to RANK except that it doesn’t include ties and therefore there will be no gaps in the numbering. Alternatively, notice also that the chances of a tie are reduced by specifying additional columns in the OLAP ORDER BY. As with numbering, RANK and DENSE RANK can also utilize a partitioning expression to reset the ranking for a change in value of a given column or columns. If, in the above example we wanted to modify our query to reset its ranking after each category we could rewrite our query to something like this: With ProductSummary As ( Select OD.ProductID,ProductName,SupplierID,CategoryID, Sum(Quantity) As Qty, Sum(Quantity * OD.UnitPrice * (1-Discount)) As ExtAmount, Sum(Quantity * OD.UnitPrice * Discount) As DiscAmount, Count(Distinct OrderID) As NoOrders From OrderDetails OD Left Join Products P On P.ProductID=OD.ProductID Group By OD.ProductID,ProductName,SupplierID, CategoryId ) Select CategoryName,ProductName,ProductId, Qty,ExtAmount,DiscAmount,NoOrders, Rank() Over (Partition By PS.CategoryId Order By ExtAmount Desc) As Rank_Sales, Rank() Over (Partition By PS.CategoryId Order By DiscAmount Desc) As Rank_Disc, Rank() Over (Partition By PS.CategoryId Order By Qty Desc) As Rank_Volume, Rank() Over (Partition By PS.CategoryId Order By NoOrders Desc) As Rank_Orders From ProductSummary PS Join Categories CT On CT.CategoryID=PS.CategoryID Order By CategoryName, ProductName Table 2 for this story, which you can view by clicking here, shows the results of this query. Please note that the query’s ORDER BY clause has no affect on the OLAP specification’s ORDER BY. The query ORDER BY was only changed in this example to make the ranking results easier to understand by sorting the rows in the same order as the primary ranking. All in all, the OLAP functions in DB2/400 in i5/OS V5R4 are a great tool to have and will save developers much effort when assigning numbers or ranking a result set row by row. The further good news is that these OLAP functions are included in other products such as DB2 for Windows and Microsoft SQL Server 2005. Start looking for ways to incorporate OLAP ranking into your queries. Management will be wowed at the information you provide. RELATED STORIES SQL Goodies in DB2 for i5/OS V5R4, Part 1 SQL Goodies in DB2 for i5/OS V5R4, Part 2 V5R4 Improvements to the i5/OS OLE DB Providers Michael Sansoterra is a programmer/analyst for i3 Business Solutions, an IT services firm based in Grand Rapids, Michigan. Click here to contact Michael Sansoterra by e-mail. |