DB2 For i Table Function Performance Considerations
March 19, 2014 Michael Sansoterra
In my prior tip, I covered some guidelines on when to use scalar user defined functions (UDF) and the potential performance hit they can impose on a query. In this tip, I’m going to follow up with a few additional guidelines about user defined table functions (UDTFs) in DB2 for i and show how they may impact query performance. Once again, I’m using data from the Microsoft AdventureWorks sample database that has been exported to DB2 for i. Shown below is a sample UDTF named “Transactions” that combines the data from the transaction history and transaction history archive tables: CREATE OR REPLACE FUNCTION Transactions (@FROM TIMESTAMP, @TO TIMESTAMP) RETURNS TABLE ( TRANSACTIONID INTEGER, PRODUCTID INTEGER, REFERENCEORDERID INTEGER, REFERENCEORDERLINEID INTEGER, TRANSACTIONDATE TIMESTAMP, TRANSACTIONTYPE NCHAR(1), QUANTITY INTEGER, ACTUALCOST DECIMAL(19, 4), MODIFIEDDATE TIMESTAMP, SOURCE VARCHAR(10) ) LANGUAGE SQL READS SQL DATA NO EXTERNAL ACTION DETERMINISTIC DISALLOW PARALLEL CARDINALITY 1000 RETURN SELECT * FROM ( SELECT TransactionId,ProductId, ReferenceOrderId,ReferenceOrderLineId, TransactionDate,TransactionType, Quantity,ActualCost,ModifiedDate, 'Open' AS Source FROM TransactionHistory TH UNION ALL SELECT TransactionId,ProductId, ReferenceOrderId,ReferenceOrderLineId, TransactionDate,TransactionType,Quantity,ActualCost,ModifiedDate, 'Archive' AS Source FROM TransactionHistoryArchive TH ) Transactions WHERE Transactions.TransactionDate BETWEEN @FROM AND @TO; Here is the benefit of the Transactions UDTF: It is available for any developer to use and it makes everyone’s life easier by combining both transaction history tables in a single function, thus preventing developers from having to repeat this logic everywhere it is required in the app. With this function, there is the usual parameter input and table output definitions. This UDTF body consists of a single RETURN statement featuring a SELECT query with a column output that is compatible in number of columns and data type to the function’s return table definition. As shown in the UDTF definition, a cardinality (row count) is defined. This is an estimate for DB2 as to how many rows, on average, are expected to be returned from the function. In this example, it’s estimated that 1,000 rows will be returned from the function on average. When using the UDTF in a query involving joins to other tables, views and UDTFs, the cardinality value is considered when DB2 figures out the order in which to process the tables, views and UDTFs. For instance, say DB2 expects a function to return 20 rows and therefore places it last in the processing order of a query as a nested loop join. If the function actually returns 10,000 rows, then having the UDTF processed last could very well be a mistake and cause the query to run slow. In addition to selecting the table processing order, cardinality estimates also affect operations such as a sort. When DB2 has to sort, it needs to allocate memory to do so. Allocating too little or too much memory can be problematic so accuracy when estimating how many rows (on average) to expect is important. If DB2 estimates a sort operation will sort 20 rows it will allocate memory accordingly. If the sort operation finds that it actually has to sort 100K rows, then it will be done inefficiently because the access plan was built for a small number of rows. When specifying the cardinality, the best a developer can do is guess on average how many rows will be returned. When I’m not sure what to pick, I estimate a little high. The following invocation of the Transactions UDTF with a large date range spanning one year returns over 100K rows. SELECT * FROM TABLE(Transactions( TIMESTAMP('2003-07-10-00.00.00'), TIMESTAMP('2004-07-09-00.00.00'))) x Even so, Visual Explain shows the query’s access plan for this statement as follows:
The complete query plan only consists of the two steps shown above. Notice when incorporated into a parent query, the UDTF is a “black box.” A UDTF operator is shown and there is no additional info about what tables the UDTF uses or how it goes about processing its logic. Therefore the accuracy of the row count is only as good as the developer’s guess. Let’s add a predicate to the above query to only select transactions associated with work orders: WHERE TransactionType='W' Now the Visual Explain representation of the access plan shows the following:
The plan shows DB2 is now estimating 100 rows to be returned, but the query actually returns 35K rows. How’d DB2 come up with the 100-row estimate? When applying the “equals” predicate to the estimated row count of the table function, DB2 assumes that it will only wind up with 10 percent of the original 1000 rows. If this query was run against a normal table or view, the estimated row count in the “final select” would be much closer to the actual rows because the database manager can use indexes and column statistics to accurately estimate the number of rows. For example, because of column statistics DB2 may know that about 30 percent of the values in the TransactionType column are a ‘W’ and it can use that percentage to approximate the row count. But alas, a table function “masks” these statistics so DB2 has to apply some generic rules for each predicate to estimate the row count:
These estimating rules are somewhat crude because even if 90 percent of the rows returned from the table function will contain a transaction type of ‘W’, DB2 is still going to assume that only 10 percent of the rows will be returned. Further, these rules are cumulative. Watch what happens if I add another predicate to the WHERE clause (say AND Quantity=2): SELECT * FROM TABLE(Transactions( TIMESTAMP('2003-07-10-00.00.00'), TIMESTAMP('2004-07-09-00.00.00'))) x WHERE TransactionType='W' AND Quantity=2 With that change to the WHERE clause, DB2 estimates the query will return 10 rows. (1000 rows x 10 percent x 10 percent = 10 rows!) Why is this row count important? Because a UDTF can cause DB2 to make some inaccurate estimates about how to run the query. How big of an impact depends on how far the row count estimates are off and how many steps there are in the query. I’ll say more on this in a bit. Alternatives To UDTFs I generally like to code SQL within a routine, having come to believe that static parameterized statements are likely to be more efficient. So early on, I found myself coding quite a few table functions. Even though it leaves more work when coding, let’s look at using a VIEW as an alternative to the UDTF. The view looks like this: CREATE VIEW vTransactions AS SELECT * FROM ( SELECT TransactionId,ProductId,ReferenceOrderId,ReferenceOrderLineId, TransactionDate,TransactionType,Quantity,ActualCost,ModifiedDate, 'Open' AS Source FROM TransactionHistory TH UNION ALL SELECT TransactionId,ProductId,ReferenceOrderId,ReferenceOrderLineId, TransactionDate,TransactionType,Quantity,ActualCost,ModifiedDate, 'Archive' AS Source FROM TransactionHistoryArchive TH ) Transactions ; Now what if the view is invoked with the same date range and work order filter as the UDTF query: SELECT * FROM vTransactions WHERE TransactionType='W' AND TransactionDate BETWEEN '2003-07-10-00.00.00' AND '2004-07-09-00.00.00'; Both versions of the query (“UDTF-Simple” and “View”) run in about the same amount of time, as shown in Table 1.
Table 1–Execution times (seconds) for the various query methods. Ignore the last column in Table 1 for now. Interesting, though, is the fact that with the view version of the query, Visual Explain shows the estimated row count from the query as 32912:
The actual count row count is 35441. With the view, DB2 uses existing indexes and statistics on the underlying tables to approximate the row count; this is certainly a far better estimate than the 100-row estimate from the UDTF. While the performance of the queries is, for all practical purposes, the same, the row count estimates are quite different. If this UDTF is incorporated in a complex query, DB2 could end up picking a suboptimal plan. Having an accurate row count is important because DB2 picks the table processing order, plans for memory consumption, etc., based on these estimates. Adding Complexity To The Query Since UDTFs are often queried in conjunction with other tables or views, I’ll add a little complexity to the queries and see how they compare. The query will search for sales order info in the transactions, and will also look for a particular product ID: SELECT * FROM TABLE(Transactions( TIMESTAMP('2003-07-10-00.00.00'),TIMESTAMP('2004-07-09-00.00.00'))) T JOIN SalesOrderHeader SOH ON SOH.SalesOrderId=T.ReferenceOrderId JOIN SalesOrderDetail SOD ON SOD.SalesOrderId=T.ReferenceOrderId AND SOD.ProductId=T.ProductId WHERE TransactionType='S' -- Sales Order AND SOD.ProductId=779 The query with the view can be similarly revised as follows: SELECT * FROM vTransactions T JOIN SalesOrderHeader SOH ON SOH.SalesOrderId=T.ReferenceOrderId JOIN SalesOrderDetail SOD ON SOD.SalesOrderId=T.ReferenceOrderId AND SOD.ProductId=T.ProductId WHERE SOD.ProductId=779 AND TransactionType='S' AND TransactionDate BETWEEN '2003-07-10-00.00.00' AND '2004-07-09-00.00.00' The next table shows how the more complicated queries compare. In this case, the version of the query that uses the view is the winner.
Table 2–Comparison of the UDTF and the view version of the complex queries (trial time given in seconds). The main reason that the view is so much faster is because there is an index on product ID on the underlying tables that DB2 can utilize. Because there is no product ID filter within the UDTF, the UDTF has to select all of the transaction rows within the date range and then filter on the product ID afterward. Of course, you can always add product ID or any parameter to the function. But expanding a function to include new criteria can cause maintenance headaches. For the most part, when it comes to non-trivial queries again the database, I find myself continually coding against base tables to preserve performance and having to change routines for new scenarios. It also makes the code more difficult to maintain but it’s worth it. Adding A Sort Step Look what happens if the following ORDER BY clause is added to the queries: ORDER BY SOH.OrderDate DESC,SOH.CustomerId,SOH.SalesOrderId The UDTF query performance increases to several seconds while the view query is minimally impacted. Visual Explain offers a clue as to how the plan was built. In particular, there is a new sorting step added to the query. Figure 4 shows the sort operator “estimate” information for the UDTF query, and Figure 5 shows the same information for the view query.
The UDTF estimate (Figure 4) looks preferable as it predicts that it will be sorting approximately 13 rows with 11KB of memory to hold the sorted list. The view estimate (Figure 5) predicts 689 rows with a 610KB memory footprint. Both queries return the same 618 rows. However, since the UDTF query access plan “low balls” the number of rows to sort as 12, it didn’t allocate enough memory to run this particular query efficiently. In the SQL Server world, when this low query memory situation occurs, things get slow because the sorted list is spilled over to disk. DB2 could be doing something similar. Fortunately though, after the same query with the same parameter set is run several times, DB2 seems to adjust things so that the performance eventually is almost as good as the view query’s. However, when I adjust one or more of the parameters, the performance comparatively suffers again and improves after several executions. Therefore, overall I still trust the view version of the query as the better performer. These results don’t always mean that a view will always perform better than a UDTF. The results of a view vs. UDTF may vary greatly depending on the nature of the queries. In case you didn’t see it, the results are a bit contrived because I picked a date range consisting of 100K rows instead of the estimated 1,000. Over a smaller number of rows, the performance differences will decrease. However, there are many times when queries against large tables end up taking a dive because of the reasons discussed here. The point is a developer or DBA has to consider carefully how DB2 is building an access plan. In particular, a UDTF may be shielding DB2 from optimizing a query because its “black box” effect prevents DB2 from getting underlying column statistics and from applying filters directly to the base tables. UDTFs With A Compound Statement Body I want to sidetrack for a common UDTF variation. The above table UDTF is very simple as its body consists of a RETURN statement with a relatively simple query. It is also possible to code UDTFs that call stored procedures, loop through cursors, execute loops, and other conditional logic. While executing, these types of UDTFs usually accumulate and store the rows the UDTF intends to return in a temporary table. When the function completes, the temporary table is queried and returned as the UDTF result. Here is the same Transactions UDTF shown again, but this time it’s written using a compound statement, meaning everything in between the BEGIN/END keywords is a compound statement: CREATE OR REPLACE FUNCTION Transactions_Compound (@FROM TIMESTAMP, @TO TIMESTAMP) RETURNS TABLE ( TRANSACTIONID INTEGER, PRODUCTID INTEGER, REFERENCEORDERID INTEGER, REFERENCEORDERLINEID INTEGER, TRANSACTIONDATE TIMESTAMP, TRANSACTIONTYPE NCHAR(1), QUANTITY INTEGER, ACTUALCOST DECIMAL(19, 4), MODIFIEDDATE TIMESTAMP, SOURCE VARCHAR(10) ) LANGUAGE SQL MODIFIES SQL DATA NO EXTERNAL ACTION DETERMINISTIC DISALLOW PARALLEL CARDINALITY 1000 SET OPTION COMMIT=*NONE,DATFMT=*ISO,USRPRF=*OWNER BEGIN DECLARE GLOBAL TEMPORARY TABLE Transactions ( TRANSACTIONID INTEGER, PRODUCTID INTEGER, REFERENCEORDERID INTEGER, REFERENCEORDERLINEID INTEGER, TRANSACTIONDATE TIMESTAMP, TRANSACTIONTYPE NCHAR(1), QUANTITY INTEGER, ACTUALCOST DECIMAL(19, 4), MODIFIEDDATE TIMESTAMP, SOURCE VARCHAR(10)) WITH REPLACE; INSERT INTO Session.Transactions SELECT * FROM ( SELECT TransactionId,ProductId,ReferenceOrderId,ReferenceOrderLineId, TransactionDate,TransactionType,Quantity,ActualCost,ModifiedDate, 'Open' AS Source FROM TransactionHistory TH UNION ALL SELECT TransactionId,ProductId,ReferenceOrderId,ReferenceOrderLineId, TransactionDate,TransactionType,Quantity,ActualCost,ModifiedDate, 'Archive' AS Source FROM TransactionHistoryArchive TH ) Transactions WHERE Transactions.TransactionDate BETWEEN @FROM AND @TO; -- Maybe do some other logic here... RETURN SELECT * FROM Session.Transactions; END; The logic in this UDTF is the same as the original so that a performance comparison can be done between the two. The main difference is a temp table is now involved. Normally, a developer wouldn’t write a UDTF in this manner since the same thing can be accomplished with a single SELECT. However, I’ve seen developers implement this type of function when the same thing could’ve been done in a single query statement. It’s worth reviewing the performance penalty for this kind of mistake. When using a table function with a compound statement, the results obviously have to be saved first in some kind of structure (in this case a temporary table) and then read by the function again in order to return it as a result set. The fourth column in Table 1 demonstrates that the same version of the UDTF that implements the compound statement runs much slower than its simpler counterpart. In fact, when querying an entire year of data, the second version of the UDTF runs about four times as slow. The lesson is that if you can make your table function return all of the data in a single statement, then do it, unless the query is so complicated it makes sense to break it into pieces. If you need to use logic, such as looping constructs or cursors, and then save your data, the function will obviously incur a performance tax. When possible, I aim to have UDTFs consist of a single RETURN statement. External UDTFs External UDTFs are written in a high-level language such as Java, C, RPG or COBOL. By design, DB2 executes these external UDTFs multiple times, with each successive call DB2 asking the HLL program for another row of data (e.g., 1,000 rows = 1,000+ calls to the HLL program). DB2 stores the rows it collects in a QTEMP table. This process can have drastic performance consequences so be careful when deciding whether to use an external table function. For normal database queries, an SQL UDTF is the way to go. If you need to do something eccentric with a table function like call an API, invoke a web service, or extract information about the files from the header of a zip file archive, then an external table function in C, RPG or Java makes sense. Points To Consider
In summary, UDTFs provide a great way to encapsulate and reuse common logic. However, when incorporating UDTFs into complex queries, they can be a black box to DB2 for i. Unlike a view, the table function’s code is not expanded into the host query’s plan, which can limit the options DB2 has for running the query. If performance is your greatest concern, then it is often beneficial to use a view, which is placed in the host query’s plan. Michael Sansoterra is a DBA for Broadway Systems in Grand Rapids, Michigan. Send your questions or comments for Mike via the IT Jungle Contact page. RELATED STORIES DB2 For i Scalar Function Performance Considerations DB2 For i, Java, And Regular Expressions Away With The SQL Slash Separator Dynamically Invoke SQL UDFs From RPG Code Include C Utilities in Your Developer Library: Evaluating a Mathematical String Expression
|