Guru: Searching Teraspace
March 5, 2018 Jon Paris
In my previous tip, Sorting In Teraspace, I looked at how the qsort API can be used to sequence data in ways that SortA can’t begin to approach. Now that the data is sorted, one of the things I may want to do with it is to search it. In this tip I will look at how that can be achieved using the bsearch API. The approach I will be discussing allows for partial key searches and handling the fact that such a search may result in multiple hits.
As you will see when we work through the code, bsearch works in a very similar manner to qsort in that you supply a call-back subprocedure that makes the comparisons between the search value and the candidates. Just as with qsort, your procedure simply has to determine whether the candidate array entry is higher, lower, or equal to the search value. If your response is anything other than equal, then bsearch uses your high/low response to determine the next candidate or if all possible candidates have been exhausted and no match found. A match is indicated by bsearch returning a pointer to the matching entry. If no-match is found then a null pointer is returned.
This story contains code, which you can download here.
When I first began working on this code I was planning on allowing the use of an “*” at the end of the search string to indicate a wild card search. Subsequently I decided that there was little point in bothering with the “*” and that I could achieve the desired results simply by constraining the comparison logic to compare only the leading characters of the supplied search value, i.e. by ignoring any trailing spaces in the search string.
bsearch always uses a binary search technique. This means that, in order for it to work correctly, the array must be in the sequence of the search key before the search is performed. If it is not in sequence, bsearch (like %LOOKUP) will not issue any errors but you will almost certainly get a lot of false negatives, i.e., bsearch will indicate that a search key is missing when it is in fact present. It is worth noting that this same behavior can also be exhibited by %LOOKUP when ASCEND or DESCEND is specified on the array definition. If you want to understand how and why this happens, I suggest that you read this article on %LOOKUP that I wrote back in 2002.
Studying The Code
The logic for loading the array from the file is the same as in the previous example, so I am just going to jump straight into describing the search process. You can download the complete code for this tip here if you want to follow along with the full program in front of you.
Once the load and sort processes have completed we enter a DOU loop (A) that allows us to perform multiple searches. We then use the DSPLY operation to ask for a search value, exiting the loop if the keyword “exit” is entered. I chose to make the search string a variable-length field because it makes the logic much cleaner. I can simply use its name in comparisons rather than having to use %SUBST all the time to limit the scope to the length of the entered search string.
At (B) we strip off any trailing spaces from the search string and then capture its length. It is this length that will later on allow us to restrict the scope of the comparison to partial keys.
At (C) we invoke bsearch. The format and values of the parameters are basically the same as for qsort except for the addition of the pointer to the search value as the first parameter. The call returns a pointer which is stored in p_currentCustomer and then tested at (D) to determine if we had a match. We will look at the logic in the comparison routine in just a moment. If the pointer is null we simply display a “No match” message (I).
If we did get a hit, the next part of the logic, beginning at (E), is designed to position us at the first matching entry. If we are simply doing a product code look-up, for example, where found/not found is all we are interested in, this would not be needed. The simple null test would be sufficient. But in our case we are allowing partial key searches and therefore there is always the possibility that multiple elements will match the search request. For example, suppose we have sorted into name sequence and the names Jackson, James, Johnson, and Jones are present. If we were to simply search for “J”, then the pointer returned by bsearch could be to any of these four elements. If we want to process all of the matching elements then we must “walk” back up the list to find the beginning of the group.
If the current record matches the search criteria then we enter the DOW loop and store (F) the current value of the pointer. This is the current candidate as the first matching record. Next we check (G) to see if we are at the start of the list and if so we simply exit the loop. If not then the pointer is moved back to point at the prior entry. In other words, we keep moving back up the list either until we have reached the first element or we no longer have a matching search value.
When we exit the DOW loop we know that p_candidateEntry points to the first matching entry in the list. So it is copied into the record pointer p_currentCustomer (I) and we then loop through and display all of the matching elements in the list.
Dsply ( 'Data is now in Name sequence ...'); (A) DoU forever; // forever is an indicator with a value of *Off Dsply ( 'Search for' ) '' searchData; If searchData = 'exit'; Leave; EndIf; (B) searchData = %TrimR ( searchData ); // Trim trailing spaces compareLength = %Len( searchData ); // And capture length // Initiate the search (C) p_currentCustomer = bsearch( p_searchData : p_customers : count : %Size(customerAddr) : p_searchName ); // Did we find a hit? (D) If p_currentCustomer <> *null; Dsply ('Match found for ' + searchData ); // Locate first entry that matches the search string (E) DoW ( %Subst( customerAddr.name: 1: compareLength ) = searchData ); (F) p_candidateEntry = p_currentCustomer; (G) If p_currentCustomer <> p_customers; // At the start of the array ? // If not then move backwards one element p_currentCustomer -= %Size(customerAddr); // Move back one element Else; Leave; EndIf; EndDo; // First matching element's address is in p_candidateEntry // So now we can display all matching elements (H) p_currentCustomer = p_candidateEntry; // Set pointer to first match DoW %Subst( customerAddr.name: 1: compareLength ) = searchData; Dsply ( 'Name: ' + %TrimR(customerAddr.name) ); p_currentCustomer += %Size(customerAddr); // Move to next element EndDo; Else; (I) Dsply ( 'No match for ' + searchData ); EndIf; endDo;
Time now to take a quick look at the comparison function used by bsearch. As you can see (J) this is very similar to the one we used for qsort. The only real difference is that for qsort the parameters represent the two candidate elements – for bsearch the first is to the search string and the second to the candidate element. You may notice that I put a comment in to highlight that the routine accesses the global variable compareLength. Generally I try to avoid using globals in subprocedures and this is just to remind me that I have used one here. If you are wondering why I did not use %LEN(searchFor) instead of compareLength, it’s because I just felt that it just made the code look ugly and harder to understand. You may well feel differently.
The comparisons begin at (K) and basically follow the same pattern as with qsort. I should point out here that this is another instance where I break my own rules by having multiple RETURN operations. This is normally considered bad practice, but I feel that in this kind of situation it makes the operation and purpose of the code more obvious.
dcl-proc SearchName; // Search by name ( or portion of name ) // - Uses Global variable compareLength (J) dcl-pi *n Int(10); searchFor Like( searchData ); element LikeDS( customerAddr ); end-pi; (K) if searchFor > %Subst( element.name: 1: compareLength ); Return HIGH; elseif searchFor < %Subst( element.name: 1: compareLength ); Return LOW; else; Return EQUAL; endif;
Wrapping Up
When I teach this topic at conferences or in private classes, there are a couple of questions that I am frequently asked, so I thought I would just briefly address them here.
The first question is “Can I use this technique to perform a case-insensitive search?” The answer is “Yes” – providing, of course, that the sort was also done in a case-insensitive manner. There are two main ways to do it.
The first is to use %XLATE (or your API of choice) to unicase the comparison strings. So, for example, the comparison at (K) above could have been written as:
if %Xlate(lower: upper: searchFor) > %Xlate(lower: upper: %Subst( element.name: 1: compareLength ));
But to say this is ugly is to put it mildly. And of course it is very inefficient if you consider the number of times the translation is needed. My preferred approach is to unicase the data as the array is loaded. If I also need the original mixed-case values, for example for display/printing purposes, then I simply add an additional field for each of the required fields. I refer to the unicase versions for sorting and searching and the originals for display purposes.
The second question is “Can I use wildcard characters like in an SQL LIKE clause?” The answer to this one is a definite “No”. You can do partial searches as I have demonstrated in this tip — but they only work because if the whole field is in sequence then by definition the first n characters of the field are also in sequence — and that is the key requirement. Bsearch can only work if the data to be searched can be sequenced in the required order, and with a generic wildcard search that cannot be done.
I suppose if you worked at it you could perform an “ends in” search by reversing the strings and sorting them that way but that is way more work than it is worth I suspect and is not really what bsearch was designed for.
Another common question is whether qsort/bsearch can be used for arrays in descending sequence rather than ascending. The answer is “Yes” – you just have to remember to use LOW when it is high and HIGH when it is low – if you see what I mean! It would also be a really good idea to really heavily comment the code or those who follow you will be truly confused!