Guru: Sorting In Teraspace
January 29, 2018 Jon Paris
In an earlier tip, Teraspace To The Rescue, I discussed how to use teraspace in RPG to permit the storage and manipulation of data that requires more storage than RPG’s 16MB limit. In this tip I am going to discuss how such data can be sorted. Hint: SORTA won’t work!
Before I get to that though, there is one thing I should mention. In that original tip I said: “For reasons I have never understood, DEALLOC will not null your pointer. So it is a good idea to do it yourself. . . .” IBM’s Barbara Morris pointed out, in a comment to that tip, that DEALLOC does actually offer that feature via the op-code extender “N”. Apparently I “missed the memo” on that particular feature. So, had I coded DEALLOC(N), I would not have needed to manually null the pointer. Of course, any related pointers, such as p_currentPosn in the original code, still need to be manually nulled.
This story contains code, which you can download here.
OK — on to sorting things out. I hinted earlier that SORTA can’t help you here and, if you think about it, that makes perfect sense because RPG would not allow us to define the data as an array in the first place. However, we can use the qsort() API. While this API can look a bit intimidating when you first encounter it, it is really quite simple to use. Even better, once you’ve used it, all future uses are pretty much just clones. I described the basic usage of this API in a short tip back in 2012, but I’ll cover a few more details pertinent to this example later in this discussion.
For the purposes of my example, I will be using a data structure based on the records in a simple Customer file. You can see the definition of the DS in the RDi Outline View image here.
I chose this example so that I could demonstrate both single- and multi-key sorts. It is worth noting at this point that the technique shown here of using a DS defined with the LIKEREC option is basically the same as the one I use to provide sorting capabilities for 5250 subfiles. Combine these two concepts and you can provide sorting capabilities for subfiles of almost unlimited size.
As you will see if you download and study the code, the program simply loads every record from the file into my teraspace “array”. Once loaded, qsort is used to sequence the array. The results are then displayed and another sort performed.
Let’s take a look at the code, beginning with the creation and loading of the array. Since the majority of this code is much the same as in the previous tip, I will just cover the highlights.
The program variable currentmax is used to monitor the maximum number of records which our current memory allocation can accommodate. We begin (A) by using that value, multiplied by the record size, to allocate an amount of storage large enough to contain this number of records. The resulting pointer is stored in p_customers, which will always contain the storage location for the first record in the array. Its value is then copied into p_currentCustomer, which is the basing pointer for the record DS.
Once that basing pointer has been set, I can read the file (B) specifying the customerAddr DS as target for the data. The first record having been read, I can now enter the read loop that will load the rest of the data. Obviously you can apply any selectivity you like when loading the data. For that matter you could use SQL rather than a native READ operation.
Within the read loop, I check if there is sufficient space available for another record in the current allocation (C). If there is then the pointer is simply incremented to the next position. If there is not, then the maximum is increased and a %REALLOC used to get the necessary additional storage. The critical part of this operation is identified at (D) where the pointer value for the current record is recalculated. As I noted in the earlier tip, this is vital because there is no guarantee that when we reallocate the memory the starting address (p_customers) will be the same as it was before. Don’t worry, even if the memory location changes, the system will have copied your data to the new position for you so you never lose anything. However, the value presently in p_currentCustomer is to a location in our previous memory allocation so it must be recalculated before we can add any more records.
// Allocate enough storage for current maximum (A) p_customers = %Alloc( %Size( customerAddr ) * currentMax ); p_currentCustomer = p_customers; // Set current position to beginning (B) Read Customers customerAddr; // Loop priming read // Load data from file (or any other source) DoW Not %EOF( Customers ); count += 1; // Increment record count (C) if count < currentMax; // Still space remaining ? // If so advance to next element p_currentCustomer += %Size( customerAddr ); else; // Otherwise make room for more by increasing max size and then // reallocating storage. // Note: p_bigDS may now point to different storage so you MUST // recalculate the current customer position pointer currentMax += INCREMENT; p_customers = %ReAlloc( p_customers: %Size( customerAddr ) * currentMax ); (D) p_currentCustomer = p_customers + (%Size( customerAddr ) * ( count )); endif; (B) Read Customers customerAddr; // Add next address to array EndDo; Dsply ( %Char(count) + ' records loaded');
Once all the records have been loaded the first sort operation is performed (E). We pass qsort the address of the start of the data ( p_customers ), the number of entries in the “array” ( count ), the length of each entry ( %Size(customerAddr) ), and finally the procedure pointer ( p_SortCityState ) which identifies the procedure for the sort – in this case to sort the data in City within State. We will look briefly at that procedure in a moment.
The reason for using a procedure pointer is because we need to provide logic to control the processing of the qsort utility function. Unlike SORTA, qsort knows, and assumes, nothing about your data. Its sole purpose in life is to move blocks of memory around so that they are in a sequence that is determined by your logic.
Here’s how it works. Once invoked, qsort calls the routine you identified (via the pointer p_SortCityState ) and passes it a pair of elements from the array. Your logic then simply has to determine which of the two is higher in the desired sort sequence (or lower, of course, if descending sequence is what you want). It then returns the answer (HIGH,LOW, or EQUAL) to qsort, which will then pick another candidate pair and call your code again. This continues until qsort determines that everything is in sequence, and at that point it returns control to your mainline code. We’ll look at the code for the sequencing procedure a little later.
While unusual in an RPG environment, this kind of “call-back” processing, as it is known, is commonly used in other programming languages. It provides a much more flexible approach to controlling the behavior of a utility function than the more traditional approach of using exit programs. RPG’s only native use of this approach is with the %HANDLER options of XML-INTO and XML-SAX.
Once the sort has been completed, the routine DisplaySortResults is called to demonstrate that the correct sequence has been achieved. The program then goes on to sort the array into Name sequence and display those results. Needless to say, if you decide to run this program against your own data I suggest that you not use too large a file or you’ll be waiting a long time to see all the results flash by!
// Sort array into City/State sequence (E) qsort ( p_customers: count: %Size(customerAddr): p_SortCityState ); Dsply 'Data is in City/State Sequence ...'; DisplaySortResults(); // Sort array into name sequence qsort ( p_customers: count: %Size(customerAddr): p_SortByName ); Dsply 'Data is in Name sequence ...'; DisplaySortResults();
Time now to take a look at the logic for the City/State sequencing routine (F). If you were to look at the technical specifications for the qsort sequencing routine, you would see that it is defined as taking two parameters one for each of the two elements that it is to compare. Each parameter is a pointer and is passed by value. So technically speaking the procedure interface should look like this:
dcl-proc SortCityState; dcl-pi *n Int(10); p_element1 Pointer Value; p_element2 Pointer Value; end-pi; dcl-ds element1 LikeRec( custRec ) Based( p_ element1 ); dcl-ds element2 LikeRec( custRec ) Based( p_ element2 );
Notice that this approach forces us to manually define data structures for each of the elements, and to use the passed pointers as their basing pointers – a bit clumsy, and not at all RPG-like. Instead I have “cheated” and used my knowledge of how subprocedure parameters are actually passed to allow myself to code this in a much simpler and cleaner fashion. I personally think it is also far easier to understand. Here’s the version I actually used:
(F) dcl-proc SortCityState; dcl-pi *n Int(10); element1 LikeDS(customerAddr); element2 LikeDS(customerAddr); end-pi;
So how and why does this substitution work? Simply put, under the covers, passing a pointer containing the address of a field by value, has the same effect as passing that field by reference. Both of them result in a pointer, containing the address of the data, being placed on the stack. If you want to understand the mechanics behind this a little better then you can read my tip “Parameter Passing Fundamentals Of Programs Versus Procedures” and in particular the section headed “Why the Difference.”
Regardless of which way we code the procedure interface, the basic logic for the sequencing is the same. We begin by comparing the high order keys (in this case the state) and notify qsort via the RETURN value whether element 1 is higher (HIGH – which has the value 1) or lower (LOW – the value -1) in the sequence than element2. Note that HIGH, LOW and EQUAL are named constants in our procedure. You have probably guessed that the value of EQUAL is zero.
In the event that the State is the same in both elements, then the comparison moves on to the City. Hopefully you can see that if in future we were to need to add (say) the customer name to the sequence then we would simply insert the appropriate comparisons between element1.name and element2.name _before_ the RETURN EQUAL operation.
// Sequence by City in State if element1.state > element2.state; Return HIGH; elseif element1.state < element2.state; Return LOW; elseif element1.city > element2.city; Return HIGH; elseif element1.city < element2.city; Return LOW; // Add any additional low-order tests here ... else; Return EQUAL; endif;
As you can see, coding these sequencing procedures is very simple, and that makes it easy to add additional sequences to meet user’s requests. It would, for example, be possible to add a case-insensitive sort simply by unicasing the fields as part of the comparison. It would not be the most efficient way to do it, but it would work.
Next Time
In the next tip I will discuss how the bsearch API can be used to perform the equivalent of %LOOKUP on our teraspace arrays and in particular how we could implement partial key searches. In the meantime, if you have any questions please let me know via the comments section.
RELATED STORIES
Parameter Passing Fundamentals Of Programs Versus Procedures
Sorry if this is a duplicate, first machine was giving me 400 bad requests
When your arrays are pointer based, the offset to the first array element is zero, so, shouldn’t:
(D) p_currentCustomer = p_customers
+ (%Size( customerAddr ) * ( count ));
be:
(D) p_currentCustomer = p_customers
+ (%Size( customerAddr ) * ( count-1 ));
Love the article, though I store the data in the heap, and use an array of pointers. This makes sorts faster (and inserts, if you’re going to
be maintaining sorted arrays, which you have to to use bsearch) and allows me to store dynamic data structures, varying lenght fields, etc.
Hi Chris, sorry for the delay in responding but for some reason I don’t get notified when comments are posted.
I think (hope? ) the part you are missing is that the count value contains the number of entries already loaded into the array. Since the pointer is being set for the _next_ record the use of count * size is correct.
Just to be sure I ran it in debug and used the bigview field to make sure I wasn’t leaving any “gaps” and sure enough the code works just fine.
I’d be interested in knowing more about the technique you use. I’m having trouble seeing how storing an array of pointers helps since (unless you also store a key with them) sorting them doesn’t achieve anything. As to adding elements, I normally do it by simply adding them to the end and sorting the array again.
I thought for a bit about why I might want an array of pointers, and it does make sense. Say I want to sort on the 10 bytes after KEY=, which can start anywhere in an element, which may be of multiple different lengths. I can pile my data in my teraspace, sticking the pointer to the start of each element in my array.
The call back routine does not require that what it sorts on actually be in the array. It can retrieve the pointed to data for each element (not slow at all), search for the KEY=, and then compare the retrieved fields to decide if the elements are high, low, or equal. (this part might not be all that fast, although I bet you it is not that slow.) So you can leave the data unsorted in the teraspace as one lump after another, and have an array of pointers in the desired order even though the key field is not in a fixed position in the elements.
I frequently use QSORT to sort arrays returned by external stored procedures. In this case, my parameter for the elements being compared in my call back routine are declared as LIKEDS the array element, and I have another two data structures with the same field names with the fields I want to sort on. EVAL_CORR loads them up (as long as no negative numbers–I have to get cute for those) for both elements, and then I can just compare the comparison structures. Sometimes I get cute: I make an entry sort to the top even though it is not the first alphabetically. All you have to do is be able to determine the low, equal, high numbers in some way.