Finding Duplicate Rows in a Name and Address Table
May 25, 2011 Skip Marchesani
Many companies still use the U.S. Postal Service to mail promotional and marketing materials to existing and potential customers. With rising printing expenses and postal rates, it’s important to minimize waste by identifying and eliminating duplicate rows in name and address tables. However, these tables can present some interesting challenges. For instance, the following two addresses are duplicates, but how do you identify them as potential duplicate rows? John Sampson John Samson Companies can write their own software, which has its own challenges, or purchase software from a third party, which can be quite expensive. This tip shows how to use an SQL inner join along with some simple algorithms to identify such rows. I developed this technique several years ago when cleaning up a name and address table used for mailing a conference brochure, and I still use this method today. This was an iterative process and there were several refinements before this technique produced the desired results. I chose the inner join because a result set row is returned only when a row in the left or outer table has a matching row in the right or inner table. Thus the name inner join (inner matching row). This discussion will review the process I went through to develop the technique and my sample data will be on a name and address table with the following attributes: Table Name: MAILLIST Total Rows: 30,000+
The first time I tried using the inner join, I thought it would be a simple matter of doing an inner join of the MAILLIST table to itself using the following SQL syntax. SELECT a.idnbr, a.fname, a.lname, a.compny, a.addrl1, a.addrl2, a.city, a.state, a.zip FROM maillist a INNER JOIN maillist b ON a.fname = b.fname AND a.lname = b.lname AND a.compny = b.compny AND a.addrl1 = b.addrl1 AND a.addrl2 = b.addrl2 AND a.city = b.city AND a.state = b.state AND a.zip = b.zip AND a.idnbr < > b.idnbr ORDER BY a.lname, a.compny a.fname Note that only the columns of the MAILLIST table containing address data are used as part of the join criteria (the OTHER DATA is not used), with the join condition for all but IDNBR being an “equal” condition. The join condition for IDNBR must be a “not equal” condition since this column provides unique identification for each row in the table. If the join condition for IDNBR is equal, every row in the table is returned in the result set. When I ran the above SQL statement, no rows were returned in the result set. The reason for this is obvious if you look at the duplicate rows for John Sampson at the beginning of this article. There are differences in the last name, company, address, and city for each of the John Sampson rows. Even with the differences between the two rows, mailing a letter to each of these addresses is going to result in two pieces of mail being delivered to the same address and person. This brings us to a definition of a duplicate row. From a pure database perspective, two rows would be considered duplicates when every position of every column in each row is the same. In other words, the two rows are a duplicate of each other when the data contained in each and every position for each row is identical. This is not the case with a name and address table used for mailing purposes. In this case, two rows may be a duplicate if the data contained in the same columns for each row contains similar data or data that closely matches but may not be an exact match. The trick is to figure out how to use SQL to identify such rows in the name and address table. After many iterations and much analysis I drew the following conclusions and was able to simplify the SQL syntax used for the inner join and therefore get better results. 1. A single company could have multiple street addresses for the same location in the same city and state. Including ADDRL1 (address line 1) and ADDRL2 (address line 2) has a negative impact on the result set (decreased the number of rows returned). Therefore, these two columns were not included as part of the join criteria. 2. Including the ZIP (zip code) column had no impact, positive or negative, on the result set. Therefore it was not included as part of the join criteria. 3. Spelling or keying errors, and/or differences in name representation for first and last name, company, and city were the root causes for no records being returned. Upon further analysis, I determined that spelling or keying accuracy was very high for the first three characters of a name string, with accuracy falling off dramatically from the fourth character on. The same was true for most differences in name representation; the first three characters of the name were the same. Therefore a subset of the data for a given column–the first three characters–could be used to get better results instead of the entire column. Based on the above conclusions I revised the SQL syntax for the inner join as follows. SELECT a.idnbr, a.fname, a.lname, a.compny, a.city FROM maillist a INNER JOIN maillist b ON a.state = b.state AND SUBSTR(a.fname,1,3) = SUBSTR(b.fname,1,3) AND SUBSTR(a.lname,1,3) = SUBSTR(b.lname,1,3) AND SUBSTR(a.compny,1,3) = SUBSTR(b.compny,1,3) AND SUBSTR(a.city,1,3) = SUBSTR(b.city,1,3) AND a.idnbr < > b.idnbr ORDER BY a.lname, a.compny, a.fname The first time I ran this SQL statement, only the following four rows were returned in the result set, and I knew there were more duplicates than those four rows. Let’s look at these four rows first before we discuss why they were the only ones returned.
The second two rows (IDNBRS 50168 and 50167) are probably not duplicates since the last name is followed by a Jr in one case and a Sr in the other, and it is very likely that a father and son could be working a the same company. The first two rows (IDNBRS 08314 and 40384) most likely are duplicates of each other, with the differences being the misspelling of the last name of Smerr (Smerr versus Smer) and the name representation for Metal Fabricators (Metal Fabricators versus Metal Fabrication Inc). If these two rows are indeed duplicates, which one do you delete? The answer in my case was the one that was most current. The row format for the MAILLIST table contains a column called LASTACTD or Last Activity Date, and the row with the oldest activity date was the one deleted. What if both records contained historical data that had value? In that case you would have to determine how to merge that data into one of the two rows before deleting the other row, assuming merging the data was possible. If you don’t have a way to determine which row contains the most current data, try flipping a coin. Now let’s address the question of why only four rows were returned. The cause was differences in name representation of company and city. Specifically it was abbreviations of company and city versus the complete spelling out of the name.
After several iterations of testing, I determined that the solution was to have two versions of the SQL syntax for the inner join as shown below. The first one included company and excluded city, and the second one excluded company and included city. CITY Excluded from Join Criteria SELECT a.idnbr, a.fname, a.lname, a.compny, a.city FROM maillist a INNER JOIN maillist b ON a.state = b.state AND SUBSTR(a.fname,1,3) = SUBSTR(b.fname,1,3) AND SUBSTR(a.lname,1,3) = SUBSTR(b.lname,1,3) AND SUBSTR(a.compny,1,3) = SUBSTR(b.compny,1,3) AND a.idnbr < > b.idnbr ORDER BY lname, compny, fname COMPNY Excluded from Join Criteria SELECT a.idnbr, a.fname, a.lname, a.compny, a.city FROM maillist a INNER JOIN maillist b ON a.state = b.state AND SUBSTR(a.fname,1,3) = SUBSTR(b.fname,1,3) AND SUBSTR(a.lname,1,3) = SUBSTR(b.lname,1,3) AND SUBSTR(a.city,1,3) = SUBSTR(b.city,1,3) AND a.idnbr < > b.idnbr ORDER BY lname, compny, fname When I ran each of the above SQL statements, I got the results that I was looking for. Over 400 rows were identified as potential duplicates with only four of those rows existing in both result sets. Using the above SQL syntax, the two rows for John Sampson at the beginning of this article would be correctly identified when CITY was excluded from the join criteria. In this article, this SQL syntax for inner join was used to identify potential duplicate rows in a single table, where the table was joined to itself. It could also be used to identify duplicate rows that exist in two different tables simply by inserting the table names in the appropriate place in the JOIN syntax of the FROM clause in the SELECT statement. Will the use of an inner join as shown in this article find every potential duplicate record? No, it will not. But in my case it did find about 95 percent of the duplicates, I didn’t have to purchase expensive code from a third party, and I didn’t have to write complex application code to do the job. In the end the SQL syntax required to accomplish the task is relatively simple and straightforward. Skip Marchesani retired from IBM after 30 years and decided to pursue a career as a consultant in the IBM i marketplace. He spent much of his IBM career working with the Rochester Development Lab on projects for S/38 and AS/400, and was involved with the development of the AS/400. Skip was part of the team that taught early AS/400 education to customers and IBM lab sites worldwide. Now recognized as an industry expert on DB2 for i (a.k.a. DB2/400), and author of the book “DB2/400: The New AS/400 Database,” Skip specializes in providing customized education for any area of the IBM i and AS/400, does database design and design reviews, and general IBM i, iSeries, and AS/400 consulting. He has been a speaker for user groups, technical conferences, and IBM i, iSeries, and AS/400 audiences around the world, and has received COMMON’s Distinguished Service Award. Skip Recently moved to Vermont and is now president of the Vermont Mid Range User Group (VTMUG). Send your questions or comments for Skip to Ted Holt via the IT Jungle Contact page.
|