A Fuzzy Search Algorithm
August 30, 2002 Timothy Prickett Morgan
Hey, Ted:
Most inquiry programs require a user to enter the exact value or exact beginning portion of a value to be located. Unfortunately such programs do not find the desired data if the user doesn’t remember the exact spelling of the search term. I have an algorithm that allows searching by any portion of a database field, even if the user misspells the search argument.
|
According to this algorithm, the position of each character of the input value is to be compared with the position of the same characters found in the corresponding field from the file. It calculates differences between positions of the matching characters.
I assign a matching rate based on the difference in the number of positions. If a certain character is found in the same position of both the search string and the database field, I assign a matching rate of 100. If the two are one position away from each other, the rate is 90. Here is the complete scale:
Absolute difference between character positions | Matching rate |
0 | 100 |
1 | 90 |
2 | 80 |
3 | 70 |
4 | 60 |
5 | 50 |
6 | 40 |
7 | 30 |
8 | 20 |
9 | 10 |
10 or more | 0 |
These differences are converted into a single matching-rate value that indicates how much the search value and the field value from the file differ from one another.
Consider, for example, a file that consists of two fields: a part number and part description. This file contains three records, as per this example:
Part Number | Part Description |
111 | WIFE |
222 | STOP |
333 | KNIFE |
The user of the search program misspells KNIFE as NIVE and the search begins.
For a search value of NIVE, these are the calculated matching rates:
WIFE | 200 |
STOP | 0 |
KNIFE | 300 |
The more the search argument and database field value match, the higher the score.
— Victor Pisman
Searching a database is far from an exact science, and tools like wild-card matching and SQL’s SOUNDEX function are few and limited in what they can do.
Victor’s source code is shown below. It consists of three objects–an input item master file, an output file to hold the results of the search, and an RPG program to perform the search.
Victor, yours is an interesting algorithm. Thanks for sharing it with us.
— Ted
Input file IMIN
A* IMIN - Item Master File A A R FM$IM A IMITNO 15 A IMDESC 15 A K IMITNO
Output file IMOUT
A REF(IMIN) A R FM$OUT A XXITNO R REFFLD(IMITNO) A XXDESC R REFFLD(IMDESC) A XXRATE 3 0 A K XXRATE
Program IM
* Fuzzy search algorithm * written by Victor Pisman * FIMIN IF E DISK FIMOUT O E DISK A * E AAA 15 1 E BBB 15 3 0 * I 'ABCDEFGHIJKLMNOPQRST-C UC I 'UVWXYZ' * I 'abcdefghijklmnopqrst-C LC I 'uvwxyz' * C *ENTRY PLIST C PARM ##PARM 15 * C LC:UC XLATE##PARM ##PARM * C READ IMIN 99 C *IN99 DOWEQ*OFF C EXSR SEARCH C Z-ADDTTRATE XXRATE C MOVELIMITNO XXITNO C MOVELIMDESC XXDESC C WRITEFM$OUT C READ IMIN 99 C ENDDO * C MOVE *ON *INLR C*********** C SEARCH BEGSR *********** * Get the length of the search argument C MOVEL##PARM ENTRY 15 P C ' ' CHEKRENTRY L 20 * FILL THE ARRAY AAA WITH THE INPUT FIELD VALUE. C CLEARAAA C Z-ADD1 I 50 C 1 DO L I C 1 SUBSTENTRY:I AAA,I C ENDDO * * FILL THE ARRAY BBB. C CLEARBBB C Z-ADD*ZERO BBBTOT C *LIKE DEFN BBB BBBTOT+ 1 C Z-ADD1 I * * CHECK FOR EXACT MATCH FIRST. C ENTRY:L SCAN IMDESC:1 50 C *IN50 IFEQ *OFF NO EXACT MATCH * C 1 DO L I C LC:UC XLATEIMDESC XXNAME C *LIKE DEFN IMDESC XXNAME C AAA,I SCAN XXNAME RATE 30 50 C *IN50 IFEQ *ON C I SUB RATE BBB,I C ADD BBB,I BBBTOT C ELSE C Z-ADD99 BBB,I C ENDIF C ENDDO * C ENDIF EXACT MATCH * * CALCULATE THE NUMBER OF LEADING BLANKS X IN THE INPUT FIELD. C L IFGT *ZERO C BBBTOT DIV L X 30H C ENDIF * * CALCULATE THE ABSOLUTE VALUE OF INDIVIDUAL DIFFERENCES. C Z-ADD1 I C 1 DO L I C BBB,I IFNE 99 C SUB X BBB,I C ENDIF C BBB,I IFLT *ZERO C MULT -1 BBB,I C ENDIF * C BBB,I IFLT 10 C MULT -10 BBB,I C ADD 100 BBB,I C ELSE C Z-ADD*ZERO BBB,I C ENDIF * C ENDDO * CALCULATE TOTAL MATCHING RATE C XFOOTBBB TTRATE 50 * C ENDSR
Sponsored By COMMON |
REGISTER FOR COMMON IN DENVER, OCT. 13-17 Get the IT training you need by attending COMMON Users Group’s Fall 2002 IT Education Conference & Expo, October 13-17 in Denver. Early Bird registration is $1,150 until September 4. Choose from over 720 sessions and labs covering a wide range of industry topics. Also receive training from J.D. Edwards, MAPICS, and other vendors. Don’t miss out! Go to www.common.org |