Guru: Phonetic Functions In SQL, Part 2
September 24, 2018 Paul Tuohy
In my previous article I discussed the use of the standard SOUNDEX and DIFFERENCE functions for phonetic processing. I also identified the major problems with SOUNDEX (based on U.S. English and the first character is static) and DIFFERENCE (loose because it is based on SOUNDEX).
In this article I will discuss how to tackle these problems by writing a customized phonetic function to use in place of or in conjunction with SOUNDEX. The good news is that you do not have to become an expert in phonetics — others have already done the job for you. There are a number of algorithms available on the web, such as caverphone, cologne_phonetic, metaphone, double_metaphone and refined_soundex.
I will be using the METAPHONE algorithm to demonstrate how you can write your own phonetic function.
As described in the first article, this is the table I created in order to demonstrate the use of phonetic functions:
CREATE OR REPLACE TABLE PHONETIC ( TESTNO INTEGER NOT NULL DEFAULT , BASE VARCHAR(20) NOT NULL DEFAULT, NAME VARCHAR(20) NOT NULL DEFAULT );
There will be a number of rows for a test (identified by TESTNO). Each row, for a test, will have the same BASE value and a different value for NAME.
The METAPHONE Function
The Metaphone algorithm was created by Lawrence Philips and was published in an article in “Computer Language” in December, 1990. Since then, Lawrence Philips has gone on to develop double_metaphone (for which the code is freely available) and Metaphone 3 (which is a commercial product sold as source code).
Although not a standard function in DB2, you will find a Metaphone function in some databases and as a standard function in some programming languages (such as PHP). Metaphone is different from SOUNDEX in that, instead of translating a string to a code, Metaphone returns a string of consonant sounds.
I have done my best to translate the Metaphone algorithm into SQL PL, but, please note, that the code has not been fully tested. Let the user beware!
This is the code to create the METAPHONE function:
CREATE OR REPLACE FUNCTION METAPHONE(stringInput varchar(70)) RETURNS varchar (25) LANGUAGE SQL SPECIFIC METAPHONE DETERMINISTIC CONTAINS SQL RETURNS NULL ON NULL INPUT NO EXTERNAL ACTION NOT FENCED SET OPTION ALWBLK = *ALLREAD , ALWCPYDTA = *OPTIMIZE , COMMIT = *NONE , DECRESULT = (31, 31, 00) , DFTRDBCOL = *NONE , DYNDFTCOL = *NO , DYNUSRPRF = *USER , SRTSEQ = *HEX /* Metaphone Function Created by Lawrence Philips. Published in an article in "Computer Language" December 1990 issue. */ BEGIN declare returnString varchar(25); declare processString varchar(70); declare currentChar char(1); declare twoChar char(2); declare threeChar char(3); declare previousChar char(1); declare nextChar char(1); declare addChar char(1); declare totalLength smallint; declare processCount smallint; set processString = ltrim(lower(stringInput)); set totalLength = length(processString); set processCount = 1; set returnString = ''; --Process beginning exceptions set twoChar = left(processString,2); if (twoChar in ('ae', 'gn', 'kn', 'pn', 'wr')) then set processString = right(processString , totalLength—1); set totalLength = totalLength—1; elseif (twoChar = 'wh') then set processString = 'w' concat right(processString , totalLength—2); set totalLength = totalLength—1; end if; set currentChar = left(processString,1); if (currentChar = 'x') then set processString = 's' concat right(processString , totalLength—1); end if; if (currentChar in ('a','e','i','o','u')) then set processString = right(processString , totalLength—1); set totalLength = totalLength—1; set returnString = currentChar; end if; set currentChar = ' '; while processCount <= totalLength do set previousChar = currentChar; set currentChar = substring(processString,processCount,1); set addChar = ''; if (processCount < totalLength) then set nextChar = substring(processString,processCount + 1,1); else set nextChar = ''; end if; if (nextChar <> currentChar or currentChar = 'c') then set twoChar = substring(processString,processCount,2); set threeChar = substring(processString,processCount,3); if (currentChar in ('f','j','l','m','n','r')) then set addChar = currentChar; elseif (currentChar = 'q') then set addChar = 'k'; elseif (currentChar = 'v') then set addChar = 'f'; elseif (currentChar = 'x') then set addChar = 'ks'; elseif (currentChar = 'z') then set addChar = 's'; elseif (currentChar = 'b') then set addChar = 'b'; if (processCount = totalLength and prevoiusChar = 'm') then set addChar = ''; end if; elseif (currentChar = 'c') then if ((twoChar = 'ch') or (threeChar = 'cia')) then set addChar = 'x'; elseif ((twoChar in ('ci','ce','cy')) and (previousChar <> 's') ) then set addChar = 's'; else set addChar = 'k'; end if; elseif (currentChar = 'd') then if (threeChar in ('dge','dgy','dgi')) then set addChar = 'j'; else set addChar = 't'; end if; elseif (currentChar = 'g') then set addChar = 'k'; if ( ( (twoChar = 'gh') and (processCount < (totalLength—1)) and (threeChar not in ('gha','ghe','ghi','gho','ghu') ) ) or (twoChar = 'gn') or ( (previousChar = 'd') and (twoChar not in ('ga','ge','gi','go','gu') ) ) ) then set addChar = ''; elseIf ( (previousChar <> 'g') or (twoChar in ('ge','gi','gy') ) ) then set addChar = 'f'; end if; elseif (currentChar = 'h') then if not ( ((previousChar in ('a','e','i','o','u')) and (twoChar not in ('ha','he','hi','ho','hu')) ) or (previousChar in ('c','s','p','t','g')) ) then set addChar = 'h'; end if; elseif ((currentChar = 'k') and (previousChar <> 'c') ) then set addChar = 'k'; elseif (currentChar = 'p') then if (twoChar = 'ph') then set addChar = 'f'; else set addChar = 'p'; end if; elseif (currentChar = 's') then if ((threeChar in ('sia','sio')) or (twoChar = 'sh') ) then set addChar = 'x'; else set addChar = 's'; end if; elseif (currentChar = 't') then set addChar = 't'; if (threeChar in ('tia','tio')) then set addChar = 'x'; elseif (twoChar = 'th') then set addChar = '0'; elseif (threeChar = 'tch') then set addChar = 't'; end if; elseif ((currentChar = 'w') and (twoChar in ('wa','we','wi','wo','wu')) ) then set addChar = 'w'; elseif ((currentChar = 'y') and (twoChar in ('ya','ye','yi','yo','yu')) ) then set addChar = 'y'; end if; end if; set returnString = returnString concat trim(addChar); set processCount = processCount + 1; end while; return returnstring; END
Without going through the code in detail, the basic principle is that the input string is processed character by character and character combinations are “translated” to a consonant sound (e.g. ‘z’ becomes ‘s’, ‘ph’ becomes ‘f’, all vowels are dropped).
Using the METAPHONE Function
This statement compares the result of the METAPHONE function with SOUNDEX and DIFFERENCE:
SELECT BASE, NAME, CASE WHEN SOUNDEX(BASE) = SOUNDEX(NAME) THEN 'Match' ELSE 'No Match' END SOUNDEX, CASE DIFFERENCE( BASE, NAME) WHEN 4 THEN 'Hit' WHEN 3 THEN 'Ballpark' WHEN 2 THEN 'Middle' WHEN 1 THEN 'Faint Hope' ELSE 'No Hope' END DIFFERENCE, CASE WHEN METAPHONE(BASE) = METAPHONE(NAME) THEN 'Match' ELSE 'No Match' END METAPHONE, SCHEMAPT.PT_METAPHONE(BASE) BASE_MP, SCHEMAPT.PT_METAPHONE(NAME) NAME_MP, SOUNDEX(BASE) BASE_SX, SOUNDEX(NAME) NAME_SX, DIFFERENCE( BASE, NAME) DIFFVALUE FROM PHONETIC WHERE TESTNO = 2 ORDER BY NAME;
Our METAPHONE function provides a match on all rows, including those that caused problems for SOUNDEX. The result set also shows the difference in the values returned by the METAPHONE and SOUNDEX functions.
There Is Always A BUT
Although the METAPHONE function solves some of the issues with SOUNDEX, you quickly realize that METAPHONE also stricter that SOUNDEX. When we run the same statement against test 1 (my name), we see that SOUNDEX returns more hits:
Roll Your Own
There is no perfect solution! But, since you are writing the code, you can program in any combination you wish. There is nothing to stop you from changing or adding to the algorithm. A change to two lines would include more hits in the previous example
And what about languages other than English? All of the extra phonetic algorithms have open source projects for different languages — most of which are available on Git hub.
Although phonetic functions are not something that I use on an everyday basis, they are something that I include as part of a “fuzzy” search wherever names are listed. I hope you find a use for them — they are fun!
Paul Tuohy, IBM Champion and author of Re-engineering RPG Legacy Applications, is a prominent consultant and trainer for application modernization and development technologies on the IBM Midrange. He is currently CEO of ComCon, a consultancy firm in Dublin, Ireland, and partner at System i Developer. He hosts the RPG & DB2 Summit twice per year with partners Susan Gantner and Jon Paris.