Guru: An Introduction to Processing XML With RPG, Part 3
September 5, 2018 Jon Paris
Author’s Note: In part 1 and part 2 of this XML series, I introduced you to the basics of using RPG’s XML support. In this tip we begin to explore some of the challenges that you may face when processing commercial XML documents, and the support XML-INTO offers to handle them. In particular we will be reviewing how to ignore parts of the document through the use of the path= %XML option. We will also review how to handle XML documents that make use of namespaces and how to handle XML element names that include characters that are not valid in RPG field names.
To begin, take a look at the snippet of an XML document below:
<ItemsXRef> <Header RefId="xxxxx" TimeStamp="2011-11-30T00:06:06.643Z"> <to id="nnnnnn" name="nnnnnn"/> <from id="nnnn" name="A Company in Canada"/> <TransactionType>ItemXref</TransactionType> </Header> <Items> <SKU> <SKUID>10050322</SKUID> <UPC>6866261486</UPC> <WIN>30269675</WIN> <StatusCode>A</StatusCode> </SKU> <SKU> ... <snip> ... </SKU> <EnterpriseCode>CAN</EnterpriseCode> <RecordCount>46</RecordCount> </Items> </ItemsXRef>
As you can see it is not a particularly complex document but it demonstrates a common issue, namely that part of the data is not really terribly useful (the <Header> section in this case) and would require the coding of data definitions that in the end we would just subsequently ignore. In this particular case it would not have been a particularly arduous task, but when industry standard documents are used the required definitions can be far more complex. And at the end of the day would still be “thrown away.”
The magic to processing documents such as this is to make use of the path= processing option of the %XML built in. This effectively allows us to specify the position in the document at which the XML parser should begin its work, thereby skipping over unwanted elements.
This story contains code, which you can download here.
If you study the hierarchy of the sample XML above you will see that at the top level we have the <ItemsXRef> element. This in turn has two child elements, <Header> and <Items>. It is <Items> that contains the actual data that we want to process. Namely the SKU, EnterpriseCode, and RecordCount. <Header> also has subordinate elements but they are of no interest to us.
To ignore the header information all we need to do is to direct the parser to only begin its work when it reaches the <Items> element. We can do this by specifying a path that details the nodes to be traversed in order to arrive at the first element that we actually want. These paths are specified just as a path in the IFS (or for that matter on a Windows or Mac system) would be. The only difference being that we are referencing the element name hierarchy rather than a directory name hierarchy.
So to tell the parser to start processing at the <Items> element we simply use the directive path=ItemsXRef/Items. Note that the actual element name beginning (<) and ending (>) markers are not included in the path specification.
This particular document was submitted by a reader, and in their case they were only interested in the SKU element details and did not need to capture the EnterpriseCode and RecordCount data. So they simply had to have the path “dive down” one more level to position to the first of the SKU elements. This was achieved by simply adding /SKU to the end of the path, so the full path directive became path=itemsXref/Items/SKU. You can see it in action at (C) below.
Since we are only interested in the SKU data, the required data definitions are also really simple. They consist of a simple DS array as you can see at (B) below. Because we are targeting an array, we are able to take advantage of the RPG supplied element count in the PSDS (A).
The resulting program (XMLPATH1) is, as you can see, very simple indeed.
Dcl-Ds progStatus psds; (A) xmlElements Int(20) Pos(372); End-Ds; (B) Dcl-Ds sku Dim(9999) Qualified; skuid Char(15); upc Char(15); win Char(15); statuscode Char(1); End-Ds; Dcl-S pause Char(1); // Change the file path to match where you placed the XML xml-into sku (C) %xml('ITEMXREF.xml': 'case=any doc=file path=itemsXref/Items/SKU'); Dsply ( %Char(xmlElements) + ' SKU records loaded.' ) ' ' pause; *inlr = *on;
You may be wondering what changes would be required had the reader been required to also extract the EnterpriseCode and RecordCount data. It is really very simple, but does require that we apply the count prefix option that I described in XML-INTO part 2, and, of course, we will also need to modify the IFS path.
Here is the modified version of the program (XMLPATH2I) that demonstrates the data declarations and additional logic that is needed. The major changes are:
- Addition of the count_sku field (D). This is needed since we can no longer use the RPG supplied count in the PSDS.
- Adding the countprefix= option to %XML (E) and of course modify the path= value.
Dcl-Ds items Qualified; (D) count_sku Int(5); Dcl-Ds sku Dim(9999); skuid Char(15); upc Char(15); win Char(15); statuscode Char(1); End-Ds sku; enterpriseCode Char(3); recordCount Int(5); End-Ds items; (E) xml-into items %xml('ITEMXREF.xml': 'case=any doc=file countprefix=count_ + path=itemsXref/Items'); If items.count_sku = items.recordCount; Dsply ('Counts match - ' + %Char(items.recordCount) + ' processed' ); Else; Dsply ('Count Mismatch - Actual: ' + %Char(items.count_sku) + ' Expected: ' + %Char(items.recordCount)); EndIf;
Handling Namespaces
I haven’t got time to go into all the whys and wherefores of namespaces. For now let’s just say that they allow you avoid name collisions by qualifying element and attribute names. In this respect they are similar to adding the QUALIFIED keyword to an RPG DS in that you can now have multiple fields with the same name and use the DS name to qualify which one you mean. Since you will encounter them in many documents it is important to understand the options you have in RPG for dealing with them. (You can learn more about namespaces here.)
This XML extract, from the document we will be using, shows the simple usage of a namespace.
<p400:OrderDetail p400:OrderNumber="12345" p400:Date="2015-11-14" xmlns:p400="http://partner400.com"> <p400:Address p400:Type="Bill"> <p400:Name>James Smith</p400:Name> ....
Notice that all the element names are prefixed by the characters “p400:”. This is the shorthand notation for the namespace and it is associated with a URI. In this case that association is made through the attribute xmlns:p400=”http://partner400.com”. As I discussed earlier in the series, in order for XML-INTO to process the XML, the element and attribute names in the document must match the names and hierarchy in the RPG data structure used to receive the data. Teeny tiny problem: The colon (“:”) is not a valid character in an RPG name.
RPG’s solution is to provide options to either remove the namespace prefix or to convert the colon to a valid RPG name character. This is done via the ns option. Specifying ns=remove as an option to the %XML BIF will simply strip the namespace qualifier completely. ns=merge on the other hand will retain the qualifier and convert the colon to an underscore.
Below are extracts from the sample programs that demonstrate the two options. The first example shows how the DS needs to be structured when the ns=merge option is specified.
Dcl-Ds p400_OrderDetail Qualified; p400_OrderNumber Char(5); p400_Date Char(10); count_p400_Address Int(5); Dcl-Ds p400_Address Dim(2); p400_Type Char(4); p400_Name Char(40); p400_Street Char(40); .... xml-into p400_OrderDetail %xml( xmlFileName: 'case=any doc=file ns=merge ');
This second version shows the changes required when ns=remove is used.
Dcl-Ds OrderDetail Qualified; ns_OrderNumber Char(4); OrderNumber Char(5); Date Char(10); count_Address Int(5); Dcl-Ds Address Dim(2); Type Char(4); .... xml-into OrderDetail %xml( xmlFileName: 'case=any doc=file ns=remove');
So which option should you choose? In most cases the remove option will be the best choice as it simplifies the variable names, which makes them both easier to type and easier to read. Of course this only works in cases where there is no duplication of names within the document. Luckily that is normally the case. This option also simplifies the processing of standardized XML documents that only differ in the namespace used. But there is a drawback to this.
Suppose that you receive the same basic document from two or three different suppliers. The remove option allows you to use the same data structures etc. to process the document, but what if you need to vary the processing based on the source of the document? Luckily IBM thought of this and supplied us with an option to capture the value of the namespace that is being stripped off. You do this by specifying the nsprefix option. This operates in a very similar fashion to the countprefix option, which we covered in part 2. If we specify nsprefix=ns_ then by adding a field to the receiver DS with the appropriate name we can capture the namespace that was removed from the corresponding element.
In the example below I have modified the %XML BIF to include the nsprefix option and added the variable ns_OrderNumber (F) in order to capture the prefix associated with the OrderNumber element. Notice that as with the count prefix option the variable to receive the prefix must appear at the same level in the hierarchy as the element whose prefix is to be captured.
Dcl-Ds OrderDetail Qualified; (F) ns_OrderNumber Char(4); OrderNumber Char(5); Date Char(10); .... xml-into OrderDetail %xml( xmlFileName: 'case=any doc=file ns=remove nsprefix=ns_');
Handling Element Names Containing “Illegal” Characters
There is one more processing option that we haven’t discussed that you will find you need quite often. Just as the colon is not valid in an RPG name, nor are a number of other characters that are valid in XML names. In North America, and most English speaking countries, the most common one that you will encounter is the hyphen. While names including a hyphen are valid in COBOL (for example) they are not valid in RPG. XML-INTO deals with this by allowing us to specify an extension to the case processing option which will cause any hyphens to be converted to underscores. So with case=convert specified, an element name such as Street-Address would be converted to Street_Address as an RPG name.
Outside of the English speaking world it is also possible to have XML element names that include accented and other characters that are not valid in RPG names. For example characters such as é, ö, and á. With the convert option in use, these would all be converted to their upper case equivalents. i.e., E, O and A. But in some cases there is no simple upper case equivalent, for example the “double” characters æ and œ. In these cases the characters are replaced by underscores.
It may occur to you at this point, that by the time all these conversions are completed, the resulting name may contain multiple consecutive underscores. That would make them very hard to type accurately since they would tend to appear on the screen as a single long line. XML-INTO deals with this by converting all consecutive underscores to a single one. Thus A___B__C would become A_B_C. In theory this could result in duplicate names but it is highly unlikely to occur in practice.
Wrapping Up . . .
There are a few other XML-INTO options that you may have need of from time to time, but hopefully I have covered all of the ones you are likely to need in your daily work. So that you can “play” with the code I have also included the XML used in the examples. You can download the zip file here.
Jon Paris is one of the world’s foremost experts on programming on the IBM i platform. A frequent author, forum contributor, and speaker at User Groups and technical conferences around the world, he is also an IBM Champion and a partner at Partner400 and System i Developer. He hosts the RPG & DB2 Summit twice per year with partners Susan Gantner and Paul Tuohy.