Guru: Phonetic Functions In SQL, Part 1
September 17, 2018 Paul Tuohy
In my next two articles I am going to discuss the use of phonetic functions in SQL. You can use phonetic functions to select or order rows based on the phonetic sound of a string as opposed to the actual characters in the string. The obvious use of phonetic functions is with names, but they can be used with any string columns.
I must admit that this touches on one of my pet peeves — the spelling of my surname. I have lost count of the number of times I have had to spell my name two, three, or four times for someone on the phone — and they still get it wrong. (Why is it that when I spell TUO, they hear TOU?) I have also lost count of the number of times I have had to dig out an account number because I “don’t appear to have an account.” All of this could be avoided if the person at the other end was using a “fuzzy” search with a phonetic function.
In order to demonstrate the use of phonetic functions, I created the following table:
CREATE OR REPLACE TABLE PHONETIC ( TESTNO INTEGER NOT NULL DEFAULT , BASE VARCHAR(20) NOT NULL DEFAULT, NAME VARCHAR(20) NOT NULL DEFAULT );
There will be a number of rows for a test (identified by TESTNO). Each row, for a test, will have the same BASE value and a different value for NAME.
Before getting into the nitty gritty, you should be aware that phonetic functions may not give you what you expect. There are so many variables to take into account — language, national pronunciation, and regional pronunciation, to name a few. Therefore, you should not treat the result of a phonetic function in the same way as you would standard column values.
SOUNDEX
The starting point for phonetic functions is SOUNDEX: it is a scalar function which returns a four-character phonetic code for a string. Although a standard function in every database, the SOUNDEX algorithm was actually patented in 1918. If anyone has ever searched a genealogy database, there was probably a SOUNDEX search option available.
Let’s start by looking at what SOUNDEX gives us with some of the various spellings of my name, using the following SELECT statement.
SELECT BASE, NAME, CASE WHEN SOUNDEX(BASE) = SOUNDEX(NAME) THEN 'Match' ELSE 'No Match' END SOUNDEX, SOUNDEX(BASE) BASE_SX, SOUNDEX(NAME) NAME_SX FROM PHONETIC WHERE TESTNO = 1 ORDER BY NAME;
The result set is as follows:
This is just a small sample of different spellings of my name that I have received in correspondence over the years. The phonetic sound of my name is 2E (which I got as well). See why I get peeved? So many different spellings of my name return the same SOUNDEX phonetic code.
Another use of SOUNDEX is when it comes to ordering rows. This statement orders rows by NAME:
SELECT NAME FROM PHONETIC WHERE TESTNO = 4 ORDER BY NAME;
The “problem” with the resulting sequence is that the “Smiths” are separated (phonetically) by other rows.
If, on the other hand, we had ordered the rows using SOUNDEX:
SELECT NAME FROM PHONETIC WHERE TESTNO = 4 ORDER BY SOUNDEX(NAME);
The resulting rows will be ordered by their phonetic code:
Apart from the fact it is based on U.S. English, the major problem with SOUNDEX is that the first character of the string is always returned as the first character of the phonetic code. This means that we lose certain combinations, such as words beginning with ‘PH’. The result set for test 2 highlights this problem:
The first three rows show as “No Match” although they are phonetically the same. If you look at the SOUNDEX codes for the two columns you will note that the only difference is with the first character.
DIFFERENCE
One approach to the SOUNDEX problem is us the DIFFERENCE function. The DIFFERENCE function returns a value from 0 to 4 representing the difference between the sounds of two strings based on applying the SOUNDEX function to the strings. A value of 4 is an exact match. If we apply DIFFERENCE to the test 2 data, as follows:
SELECT BASE, NAME, CASE WHEN SOUNDEX(BASE) = SOUNDEX(NAME) THEN 'Match' ELSE 'No Match' END SOUNDEX, CASE DIFFERENCE( BASE, NAME) WHEN 4 THEN 'Hit' WHEN 3 THEN 'Ballpark' WHEN 2 THEN 'Middle' WHEN 1 THEN 'Faint Hope' ELSE 'No Hope' END DIFFERENCE, SOUNDEX(BASE) BASE_SX, SOUNDEX(NAME) NAME_SX, DIFFERENCE( BASE, NAME) DIFFVALUE FROM PHONETIC WHERE TESTNO = 2 ORDER BY NAME;
We see that DIFFERENCE gives us a value of 3 for the three rows that SOUNDEX missed:
Unfortunately, DIFFERENCE is just performing a calculation between the resulting SOUNDEX codes. This means that you end up with some surprising results, as shown with the result set for test 3.
Basic Functionality
The standard SOUNDEX and DIFFERENCE functions provide basic functionality when it comes to phonetic processing. But what if you want to use a language other than U.S. English or really take care of that static first character in SOUNDEX? Why, you write your own function, which we will look at in the next article.
Paul Tuohy, IBM Champion and author of Re-engineering RPG Legacy Applications, is a prominent consultant and trainer for application modernization and development technologies on the IBM Midrange. He is currently CEO of ComCon, a consultancy firm in Dublin, Ireland, and partner at System i Developer. He hosts the RPG & DB2 Summit twice per year with partners Susan Gantner and Jon Paris.
Very cool mr 2e! Thanks.