Guru: DISTINCT Can Hide A Performance Problem
September 28, 2020 Ted Holt
When I see the word DISTINCT in an SQL query, a little red flag goes up inside my head. Not literally, of course. But it does make me pause and scrutinize the query more closely. I have found that poorly designed queries sometimes include the word DISTINCT as a final act of redemption to forcibly return the proper result set.
The purpose of DISTINCT is to remove duplicate rows from a result set. As the DB2 for i SQL reference puts it:
The keyword DISTINCT is not considered an argument of the function, but rather a specification of an operation that is performed before the function is applied. If DISTINCT is specified, redundant duplicate values are eliminated.
And that’s great. DISTINCT is powerful. And useful. And abused.
To illustrate the abuse, we need a couple of tables. Here is some information about manufacturing operations:
select OrderNbr, Operation, JobOn, EmpID from mfgopers order by OrderNbr, Operation
ORDERNBR | OPERATION | JOBON | EMPID |
1001 | 10 | 2020-09-28 09:00:04 | 3 |
1001 | 20 | 2020-09-28 09:12:38 | 4 |
1001 | 25 | 2020-09-28 10:02:41 | 3 |
1002 | 12 | 2020-09-28 09:01:10 | 1 |
1003 | 10 | 2020-09-28 09:05:15 | 5 |
And here’s some information about employees.
select e.clock, e.name from emps as e order by e.clock;
CLOCK | NAME |
1 | Billy Rubin |
2 | Van Tastick |
3 | Polly Fonnick |
4 | Sal Monella |
5 | Will D. Beaste |
Now to solve a real problem! Find the clock numbers and names of all the employees who worked on order 1001.
select distinct op.empid, e.name from mfgopers as op join emps as e on op.empid = e.clock where op.ordernbr = 1001 order by op.empid
EMPID | NAME |
3 | Polly Fonnick |
4 | Sal Monella |
The result set is accurate, so what’s the problem? Just this: the query joins before eliminating duplicates. For this little Mickey Mouse example, that’s no big deal, but this example is not a real-world problem. The queries that I see that use this technique often access half a dozen tables or more.
Here’s the same query without DISTINCT.
select op.empid, e.name from mfgopers as op join emps as e on op.empid = e.clock where op.ordernbr = 1001 order by op.empid
EMPID | NAME |
3 | Polly Fonnick |
3 | Polly Fonnick |
4 | Sal Monella |
What has typically happened is that the “designer” of the query realized he had almost what he wanted and saw DISTINCT as an easy way to weed out the duplicates.
The better approach is to eliminate the duplicates before joining. Depending on the number of tables involved, the number of rows in each table, and the availability of indexes, this can make a significant difference in performance.
One of my favorite ways to remove the duplicates is to put DISTINCT in a common table expression, like this:
with SelectEmps as (select distinct op.empid from mfgopers as op where op.ordernbr = 1001) select s.empid, e.name from SelectEmps as s join emps as e on s.empid = e.clock order by 1
Depending on which tables the columns are selected from, you may be able to use a subquery.
select e.clock, e.name from emps as e where e.clock in (select empid from MfgOpers where ordernbr = 1001) order by 1
What happened to DISTINCT? The fact is that you can include it or not in the subquery.
where e.clock in (select distinct empid
I’ve always been told that DISTINCT doesn’t matter when using the IN predicate with a subquery.
The lesson here is that DISTINCT is not a quick fix for a bad join. When you see DISTINCT in a query, you may want to double-check that it’s not covering up a performance problem.