Guru: DISTINCT Can Hide A Performance Problem

September 28, 2020 Ted Holt

When I see the word DISTINCT in an SQL query, a little red flag goes up inside my head. Not literally, of course. But it does make me pause and scrutinize the query more closely. I have found that poorly designed queries sometimes include the word DISTINCT as a final act of redemption to forcibly return the proper result set.

The purpose of DISTINCT is to remove duplicate rows from a result set. As the DB2 for i SQL reference puts it:

The keyword DISTINCT is not considered an argument of the function, but rather a specification of an operation that is performed before the function is applied. If DISTINCT is specified, redundant duplicate values are eliminated.

And that’s great. DISTINCT is powerful. And useful. And abused.

To illustrate the abuse, we need a couple of tables. Here is some information about manufacturing operations:

select OrderNbr, Operation, JobOn, EmpID
  from mfgopers
 order by OrderNbr, Operation

ORDERNBR	OPERATION	JOBON	EMPID
1001	10	2020-09-28 09:00:04	3
1001	20	2020-09-28 09:12:38	4
1001	25	2020-09-28 10:02:41	3
1002	12	2020-09-28 09:01:10	1
1003	10	2020-09-28 09:05:15	5

And here’s some information about employees.

select e.clock, e.name
  from emps as e
 order by e.clock;

CLOCK	NAME
1	Billy Rubin
2	Van Tastick
3	Polly Fonnick
4	Sal Monella
5	Will D. Beaste

Now to solve a real problem! Find the clock numbers and names of all the employees who worked on order 1001.

select distinct op.empid, e.name
  from mfgopers as op
  join emps     as e   on op.empid = e.clock
 where op.ordernbr = 1001
 order by op.empid

EMPID	NAME
3	Polly Fonnick
4	Sal Monella

The result set is accurate, so what’s the problem? Just this: the query joins before eliminating duplicates. For this little Mickey Mouse example, that’s no big deal, but this example is not a real-world problem. The queries that I see that use this technique often access half a dozen tables or more.

Here’s the same query without DISTINCT.

select op.empid, e.name
  from mfgopers as op
  join emps     as e   on op.empid = e.clock
 where op.ordernbr = 1001
 order by op.empid

EMPID	NAME
3	Polly Fonnick
3	Polly Fonnick
4	Sal Monella

What has typically happened is that the “designer” of the query realized he had almost what he wanted and saw DISTINCT as an easy way to weed out the duplicates.

The better approach is to eliminate the duplicates before joining. Depending on the number of tables involved, the number of rows in each table, and the availability of indexes, this can make a significant difference in performance.

One of my favorite ways to remove the duplicates is to put DISTINCT in a common table expression, like this:

with SelectEmps as
  (select distinct op.empid
     from mfgopers as op
    where op.ordernbr = 1001)
select s.empid, e.name
 from SelectEmps as s
 join emps       as e on s.empid = e.clock
order by 1

Depending on which tables the columns are selected from, you may be able to use a subquery.

select e.clock, e.name
  from emps as e
 where e.clock in (select empid
                     from MfgOpers
                    where ordernbr = 1001)
 order by 1

What happened to DISTINCT? The fact is that you can include it or not in the subquery.

where e.clock in (select distinct empid

I’ve always been told that DISTINCT doesn’t matter when using the IN predicate with a subquery.

The lesson here is that DISTINCT is not a quick fix for a bad join. When you see DISTINCT in a query, you may want to double-check that it’s not covering up a performance problem.

IBM Knowledge Center — Aggregate Functions

This Issue Sponsored By

Table of Contents

Content archive

Recent Posts

Subscribe

Pages

Search