Guru: A Simple, Effective Way To Compare Data Using SQL
April 20, 2020 Ted Holt
One thing I have learned over the years is that no matter how much technology changes, people always ask the same questions like: “Why does the computer say we have 50 widgets when we only have 45?” and “Can you tell me who changed the due date of order number 12345?” and “How do sales of widgets this year compare to the same time last year?” It is the last of these that I wish to address today.
Comparing data is a very common activity. Everybody compares data: sales people, production people, purchasing people, and of course management people. This sort of thing is the reason people set up data warehouses. Fortunately, there is a good formula for answering comparison questions using SQL.
To compare data, you need five things:
- two or more data sets
- a full outer join
- the COALESCE function
- a selection of columns
- an ORDER BY clause
Let’s look at each one in turn. But first, some illustrative data:
create table saleshist (key dec(3) primary key, custid dec(3), datesold date, item char(5), quantity dec(3), price dec(5,2)); insert into saleshist (key, custid, datesold, item, quantity, price) values ( 1, 123, '2019-03-01', 'A15', 1, 3.00), ( 2, 456, '2019-03-01', 'A15', 2, 1.50), ( 3, 456, '2019-03-05', 'A15', 3, 1.50), ( 4, 456, '2019-03-05', 'X22', 1, 1.00), ( 5, 456, '2020-02-08', 'A15', 2, 2.00), ( 6, 456, '2020-03-08', 'A15', 2, 2.00), ( 7, 333, '2020-03-03', 'A15', 4, 1.50), ( 8, 567, '2019-03-10', 'A15', 2, 1.50), ( 9, 567, '2019-04-01', 'A15', 1, 1.50), ( 10, 567, '2020-03-01', 'D44', 2, 2.00);
Suppose you’ve been asked to compare sales of item A15 by customer for March of this year and March of last year. Where do you begin? You begin with two data sets. They may come from two separate tables or sets of tables. They may come from two views. They may be two SELECTS over the same table. They come from wherever the data is stored.
In my example, the data for both years is stored in the sales history table, SALESHIST, so I use two common table expressions, SALES2019 and SALES2020.
with sales2019 as (select custid, sum(quantity) as qty, sum(quantity * price) as sales from saleshist where item = 'A15' and datesold between '2019-03-01' and '2019-03-31' group by custid), sales2020 as (select custid, sum(quantity) as qty, sum(quantity * price) as sales from saleshist where item = 'A15' and datesold between '2020-03-01' and '2020-03-31' group by custid) . . . more to come ! ! ! . . .
Comparing data means that the two data sets need to be joined. A customer might have bought item A15 in only one year or in both years. This means that we must use a full outer join to get all the data.
select . . . from sales2019 as s19 full outer join sales2020 as s20 on s19.custid = s20.custid
That’s fabulous, but it doesn’t give the user any information. Let’s begin by listing all the summary columns. In this example, the only summary column is customer ID. But which customer ID do we get—the one from SALES2019 or the one from SALES2020? The only logical answer, of course, is “Yes!” Some customers bought item A15 both years. Others bought in 2019 or 2020, but not both years. The COALESCE function gives us the customer number no matter what.
select coalesce(s20.custid, s19.custid) as custid, . . . more columns . . . from sales2019 as s19 full outer join sales2020 as s20 on s19.custid = s20.custid
COALESCE returns the first non-null value in the list. We get the customer number whether it’s found in one data set or both data sets. Be sure to use COALESCE over all of the summary columns.
So far, so good, but the user needs sales figures — how many each customer bought and how much money they paid us.
select coalesce(s20.custid, s19.custid) as custid, s20.qty as "Qty 2020", s19.qty as "Qty 2019", s20.sales as "Sales 2020", s19.sales as "Sales 2019" from sales2019 as s19 full outer join sales2020 as s20 on s19.custid = s20.custid
I’ve grouped two the two quantity fields and the two sales figures, but how you group the columns depends on the task. Most likely you’ll give the output of such queries to people in Excel, and they’ll rearrange them as they want them.
The last thing to do is to present the result set in a logical sequence.
order by 1
Use the ORDER BY clause to sort the data any way you need to. The sequence will often be the summary columns, but that’s not necessarily so. In this example, ORDER BY 1 means to sort the result set on the first column, which is the customer ID.
Putting it all together, we have this query:
with sales2019 as (select custid, sum(quantity) as qty, sum(quantity * price) as sales from saleshist where item = 'A15' and datesold between '2019-03-01' and '2019-03-31' group by custid), sales2020 as (select custid, sum(quantity) as qty, sum(quantity * price) as sales from saleshist where item = 'A15' and datesold between '2020-03-01' and '2020-03-31' group by custid) select coalesce(s20.custid, s19.custid) as custid, s20.qty as "Qty 2020", s19.qty as "Qty 2019", s20.sales as "Sales 2020", s19.sales as "Sales 2019" from sales2019 as s19 full outer join sales2020 as s20 on s19.custid = s20.custid order by 1
And this result set:
CUSTID | Qty 2020 | Qty 2019 | Sales 2020 | Sales 2019 |
123 | – | 1 | – | 3.00 |
333 | 4 | – | 6.00 | – |
456 | 2 | 5 | 4.00 | 7.50 |
567 | – | 2 | – | 3.00 |
It’s a very simple formula, as simple as a pencil, which also has five components. Don’t let the simplicity fool you — it’s a powerful tool to have in your toolbox.
Great, simple idea, well presented. Thanks.