Dealing With A MIN/MAX Quirk
March 15, 2016 Ted Holt
The MIN and MAX functions have been a part of SQL since the beginning. A database-kind-of-guy-or-gal would think that anything so established would be stable to the point of boredom, but a database-kind-of-guy-or-gal would be wrong. MIN and MAX don’t even work the same way in all implementations of SQL, and to top it off, they have a quirk that I come upon occasionally. MIN and MAX began as aggregate functions. As such, they locate the minimum and maximum non-null values within a column across a set of rows. On this, everyone in the database world is in agreement. The controversy arises with the use of MIN and MAX as scalar functions. Assume a physical file of manufacturing transactions. It has separate packed-decimal date and time fields. The date is seven digits in CYYMMDD format. The time is six digits in HHMMSS format. I want the last date a manufacturing order was completed (i.e., the manufacturing status is 40) for each item. select mxItem, max(mxDate) from mfgtrans where mxStatus = 40 group by mxItem order by mxItem Now for the quirk. Suppose I want the last date and time each operation was completed. This would be a no-brainer if the date and time were stored in a single timestamp column, but many readers of this august publication work with database tables that predate the introduction of the timestamp data type. Here’s one way that doesn’t work. select mxItem, max(mxDate, mxTime) from mfgtrans where mxStatus = 40 group by mxItem order by mxItem If you try to run this query, DB2 for i responds with message SQL0122 (Column MXDATE or expression in SELECT list not valid.) In other words, the MAX expression is not a grouping column. You might be tempted to omit GROUP BY. select mxItem, max(mxDate, mxTime) from mfgtrans where mxStatus = 40 order by mxItem This query runs, but it has two problems. First, when used as scalar functions, MIN and MAX return the smallest and largest values in a list. Here the MAX function compares the date and time and returns the greater of the two, which is not what we want. Some systems avoid this problem by using MIN and MAX only as aggregate functions and defining other functions for scalar use. Oracle, for example, has LEAST and GREATEST scalar functions. But the bigger problem is that the query no longer groups by item number, so let’s get back to the original challenge. Here’s one way to find the last date and time a manufacturing order was completed. select mxItem, max(mxDate * 1000000 + mxtime) from mfgtrans where mxStatus = 40 group by mxItem order by mxItem This makes a 13-digit number of the date and time. The system has no trouble finding the maximum value of the expression for each item. Here’s another method. select mxItem, max(digits(mxDate) concat digits(mxtime)) from mfgtrans where mxStatus = 40 group by mxItem order by mxItem I prefer this to the previous method under the general theory that string operations tend to tax the system less than arithmetic ones do. I could also convert the date and time into a timestamp field, but I am not courageous and adventurous enough to hack my way through the expression that would be required for that approach. Here’s a different approach. with Temp as (select m.*, row_number() over (partition by mxItem order by mxItem, mxDate desc, mxTime desc) as Row_Nbr from mfgtrans as m where mxStatus = 40) select mxItem, mxDate, mxTime from Temp where Row_Nbr = 1 order by mxItem I’ve used OLAP features instead of aggregate functions. The system sorts the data by item number, descending date and descending time. PARTITION BY is similar to GROUP BY. In RPG terms, you might think of this as a control break over the item number. The ROW_NUMBER function assigns a sequential number–named ROW_NBR–to each row in the result set, and because the data is partitioned on item number, renumbering restarts at 1 with each item. This result set makes up the TEMP common table expression. The main query (the second SELECT) examines the value of ROW_NBR to pick out only the first row per item. The advantage of this approach is the lack of derived (calculated) columns. How about that? One way to find the minimum or maximum doesn’t use the MIN and MAX functions. Who’d a thunk it?
|