Guru: The Uncertainty of Redundant Row Selection
February 8, 2021 Ted Holt
Is redundancy good or bad? I say it depends. According to Nassim Nicholas Taleb, “Redundancy is ambiguous because it seems like a waste if nothing unusual happens. Except that something unusual happens — usually.” I have seen some unusual behavior when joining database tables, but try as I might, I can’t figure out what that unusual behavior depends on. Let me show you what I mean.
First, we need some data for illustration. Let’s say that our company uses an ERP system that was designed for make-to-stock manufacturing. What the factory builds goes into the warehouse, and customer orders are filled from the stock in the warehouse. There is no link between sales orders and manufacturing orders.
However, the Powers That Be decided years ago, before we came onto the scene, that the business should operate on a make-to-order basis, which requires manufacturing orders to be linked to sales orders. That presented a problem for IT (or MIS or Data Processing or whatever they were calling business computing in those days). Some enterprising programmer suggested they store sales order and sales order line number in an unused text field for notes in the manufacturing order, and it’s been that way since.
Here are the tables.
create table sohdr ( OrderNbr dec(5) primary key, CustNbr dec(5), OrderDate dec(8)); create table sodtl ( OrderNbr dec(5), LineNbr dec(2), ItemNbr char(6), Quantity dec(3), Price dec(5,2), primary key (OrderNbr, LineNbr)); create table mo ( OrderNbr dec(5) primary key, Item char(6), Quantity dec(3), SchedDate dec(8), Notes char(30)); insert into sohdr values ( 100, 220, 20210115), ( 101, 318, 20210115), ( 102, 190, 20210116); insert into sodtl values ( 100, 1, 'AB-101', 12, 1.00), ( 100, 2, 'DA-033', 10, 20.00), ( 101, 1, 'AB-101', 6, 1.25), ( 102, 1, 'JH-712', 9, 2.00); insert into mo values ( 751, 'AB-101', 12, 20210118, '0010001');
There is a one-to-many relationship between sales order header and sales order detail, and a one-to-one relationship between sales order detail and manufacturing order.
Now it’s time to query the database. Someone we serve wants to know the customer we’re selling to, the items they’re purchasing, and the amount of money per item that we expect to get from the customer for sales order number 100.
-- Query 1 select h.OrderNbr, h.CustNbr, d.ItemNbr, d.Quantity * d.Price as Extended from sohdr as h join sodtl as d on h.OrderNbr = d.OrderNbr where h.OrderNbr = 100;
The query selects one row (record) from the sales order header table and one or more rows from the sales order detail. The desired row is indicated by the where clause.
Of course, the query will also only select detail rows for order 100, due to the nature of the equi-join operation. We don’t have to explicitly tell SQL that we only want detail rows for order 100, but we could, like this:
-- Query 2
select h.OrderNbr, h.CustNbr, d.ItemNbr,
d.Quantity * d.Price as Extended
from sohdr as h
join sodtl as d
on h.OrderNbr = d.OrderNbr
where h.OrderNbr = 100
and d.OrderNbr = 100;
What do you think of the new WHERE clause? Not only did I test the header for order 100, I tested the detail table as well. I call this a redundant test. If you know of a formal term that’s already used for an unnecessary test against a secondary table, please let me know.
Does the redundant test improve the query? When I ran this code, I got worse performance with the redundant test than I did with Query 1. Visual Explain showed an additional logic step for Query 2. That strikes me as strange.
Next, I tried moving the redundant test into the JOIN.
-- Query 3
select h.OrderNbr, h.CustNbr, d.ItemNbr,
d.Quantity * d.Price as Extended
from sohdr as h
join sodtl as d
on h.OrderNbr = d.OrderNbr
and d.OrderNbr = 100
where h.OrderNbr = 100;
Performance of Query 3 was even worse than that of Query 2. Query 1, the simplest of the three, performed best.
Now let’s add the manufacturing order to the mix. The recipient of our efforts has requested that we add the manufacturing order number and the date the product is to be built. They also only want to know about line 1 of the sales order.
-- Query 4 select h.OrderNbr, h.CustNbr, d.ItemNbr, d.Quantity * d.Price as Extended, m.OrderNbr, m.SchedDate from sodtl as d join sohdr as h on h.OrderNbr = d.OrderNbr join mo as m on d.OrderNbr = dec(substr(m.Notes,1,5),5,0) and d.LineNbr = dec(substr(m.Notes,6,2),2,0) where d.OrderNbr = 100 and d.LineNbr = 1;
How do you like that join from sales order detail to manufacturing order? I wish I could say that that sort of join is rare, but I’ve seen it many, many times over the years.
What will a redundant test do for this query?
-- Query 5
select h.OrderNbr, h.CustNbr, d.ItemNbr,
d.Quantity * d.Price as Extended,
m.OrderNbr, m.SchedDate
from sodtl as d
join sohdr as h
on h.OrderNbr = d.OrderNbr
join mo as m
on d.OrderNbr = dec(substr(m.Notes,1,5),5,0)
and d.LineNbr = dec(substr(m.Notes,6,2),2,0)
where d.OrderNbr = 100
and d.LineNbr = 1
and m.Notes = '0010001';
Believe it or not, performance improves!
I had a situation like this in the not too distant past. When I told Run SQL Statements to run the query without the redundant test, the system churned until I finally canceled the query in disgust. When I added the redundant test, the result set popped up immediately.
We human beings, it seems to me, are comforted by hard-and-fast rules. We like to hear “Always do this!” and “Never do that!” But life has led me to believe that the best we can hope for in most cases are guidelines. This certainly applies to query optimization. So here are your guidelines for redundant row-selection tests.
- A redundant test may improve the performance of a query.
- A redundant test may degrade the performance of a query.
I wish I could give you something better, friend, but I don’t have it. The good news is that you don’t have to tune every query. Just tune the ones that don’t run fast enough. A redundant test may not help, but it’s worth a shot.
Hi Ted,
I expect this difference in behaviour because of a simple reason. For columns where indexes are likely to exist, even temporary ones, there is no benefit to adding a redundant where clause.
For query5, an index is unlikely to be created on substrings of m.Notes. So, the entire mo table has to scanned and parsed to find a matching. When m.Notes = ‘0010001’ was added, the lookup became a whole lot easier as it could just scan through any temp indexes created for m.Notes column.
If feasible, you can create indexes on substr(m.Notes,6,2) and verify the performance.
Regards,
Anoop