DB2 For i, Java, And Regular Expressions

January 29, 2014 Michael Sansoterra

In today’s world, IBM i developers often find themselves working with more than the relational data stored inside their trusty DB2 database. Coders can find themselves tasked with parsing many kinds of text data including: SMTP email, file contents of an IFS Folder, Java Properties files, free form comments imported from other systems, OCR data for invoices, resumes, XML, HTML, code fragments (RPG, COBOL, C, Java, SQL, VBA), and application logs.

While there are many tools available to examine plain text, a regular expression engine provides one of the best generic text processing engines available.

Yipes, it’s been over 10 years since my first set of articles on integrating Java regular expressions with SQL user-defined functions was published. That explains the hair loss and color change! It’s probably a good time to revisit the topic and introduce a few additional concepts about how useful regular expressions can be.

In case you need a review on what regular expressions can do, or how to combine Java regular expressions and SQL user-defined scalar functions, you might want to check out a couple of my past articles: Empower SQL with Java User-Defined Functions and Staggering SQL String Handling with Regular Expressions

The download for this tip’s Java source code is here. This tip’s code will simply add to the UDFs.java source file published with the prior tips. Two additional things Iâ€™d like to introduce in this tip are “named capturing groups,” which was available starting in Java 7 on IBM i 7.1 – licensed product 5761JV1 options 14 (32-bit) and 15 (64-bit), and user-defined table functions (available since V5R2). I’ll explain more on these concepts in a bit.

For those RegEx pros who want to skip reading and get right to the heart of the matter, in addition to the functions published in the prior tips, the newly added SQL user-defined functions are:

re_Split_TF–This is a table function that will split text based on a regular expression pattern and return each “split” value as its own row. This is similar to the Java and .NET string “split” methods. This function is a companion to the prior re_Split scalar function, which is somewhat limited because it can only return a single value per invocation.
re_Named_Capture–This scalar function will accept text, a regular expression pattern, and the name of a capturing group and return the first occurrence of the named group’s matching values. This function has three parameters: text data, regular expression pattern, and the name of the capturing group. An overloaded second instance of the function has an extra parameter that will allow the developer to choose which occurrence of the pattern to return.
re_Named_Capture_TF–This table function is like the scalar function except it will return all occurrences of the requested named capture group as individual rows in the result set.

The full instructions on how to compile the java UDFs program and how to issue the SQL CREATE FUNCTION statements are in the source code. Make sure the compiled java class is placed in the following special DB2 for i folder so that it can be accessed by DB2: /qibm/userdata/os400/sqllib/function. If you decide to add a package name (the sample code uses the default package), the EXTERNAL PROGRAM value on the CREATE FUNCTION statement will need to be changed to include the package name.

The re_SPLIT_TF User-Defined Table Function

Table functions are used to return many pieces of information in a tabular format. The original RegEx articles only included scalar functions that only return a single value. However, if you need to return an unknown number of values from a regular expression match or split, then the table function is the better tool.

The re_SPLIT_TF table function is used to split text into rows based on a RE pattern. The two parameters for this function are the text to be processed and the RE pattern. In this example, which only works in free-form SQL tools like System i Navigator’s RunSQL scripts, a line feed character (EBCDIC 0x25) is supplied as the RE pattern to use to break apart the supplied text line by line:

SELECT *
  FROM TABLE(QGPL.re_SPLIT_TF('Line 1
Line 2
Line 3
',X'25')) TEMP;

The query returns:

RowNo

Value

Line 1

Line 2

Line 3

The RowNo column is created by the table function to keep track of the relative position of the text as it is split. Each of the new table functions features this auto numbering column.

Named Capturing Groups

While the concept of RE capturing groups isn’t new, the ability to reference a capturing group by “name” is new to Java in version 7. Loading the Java 7 SDK is a prerequisite to using this code. Named capturing groups provide developers an easy way to extract a piece of information by name instead of ordinal out of plain text. The syntax for naming a capturing group within a RE pattern is as follows:

 (?<name>X) where X is the RE pattern to capture

Say you’re mining text documents for invoice numbers formatted as “INVnnnnn” followed by an order number formatted as “ORDnnnnn” or “CMnnnnn” (CM=credit memo). The sample document text may look like this:

This letter is in regard to invoice INV01432 (from store.com order ORD15423) blah blah blah

You come up with a RegEx pattern to scan for the invoice and order number text as follows:

 (INVd{5}).*?(ORDd{5}|CMd{5})

This pattern has two capturing groups, each enclosed in parenthesis:

INVd{5}
ORDd{5}|CMd{5}

With Java regular expressions, the text identified by these groups can be accessed by ordinal number (with ordinal zero representing the entire pattern). Specifying a capture group of 1 from the sample document text above will return the invoice number (INV01432). Specifying a capture group of 2 from the sample text will return the order number (ORD15423).

With named capturing, names can now be assigned to each group so you don’t have to worry about referencing ordinals (which can change) when capturing text from a particular pattern. Here, the capturing groups are assigned the names “invoice” and “order”:

 (?<invoice>INVd{5}).*?(?<order>ORDd{5}|CMd{5})

Using the new re_Named_Capture function, you can extract text as follows:

SELECT QGPL.re_Named_Capture(
'This letter is in regard to invoice INV01432 (from store.com order
ORD15423) blah blah blah' ,'(?<invoice>INVd{5}).*?
(?ORDd{5}|CMd{5})','invoice') AS Invoice
FROM SYSIBM.SYSDUMMY1

The query returns the value of the invoice number: INV01432. Substituting the third parameter (capturing group name) with ‘order’, the function will return ORD15423. The capturing group names are case sensitive.

If you’re expecting a capturing group to find only one occurrence, or if you’re only concerned with a specific single occurrence, then the scalar function re_Named_Capture should be used within your code.

However, in some cases, a capturing group may capture many values from the input text. In this case, the table function, re_Named_Capture_TF, will allow the capture of many values with each value returned as a row in a result set. For example, let’s say you need to mine plain text for email addresses. You invent your own or search online to find this RE pattern.

^[_A-Za-z0-9-+]+(.[_A-Za-z0-9-]+)*
      @[A-Za-z0-9-]+(.[A-Za-z0-9]+)*(.[A-Za-z]{2,})$;

The first line of the pattern retrieves the email info and the second line retrieves the domain name. Using this pattern (modified just a tad), this next SQL statement uses re_Named_Capture_TF to extract all of the email addresses out of the text:

SELECT * FROM TABLE(re_Named_Capture_TF(
'Please copy the following people:
joe.shmoe@abc.com. john.doe@xyz.com, jane_pain@ghi.net.  Also, CC:  
mike_sanso@sql.tv.  For more info, visit: http://mydata.net/',

'(?<email>([A-Za-z0-9-+_]+(.[A-Za-z0-9-]+)*@
(?<domain>([A-Za-z0-9-]+(.[A-Za-z0-9]+)*(.[A-Za-z]{2,})))))',

'email')) x

This query returns the following result:

joe.shmoe@abc.com
john.doe@xyz.com
jane_pain@ghi.net
mike_sanso@sql.tv

Substituting a different capturing group name, the domain name is returned instead of the full email address:

SELECT * FROM TABLE(re_Named_Capture_TF(
'Please copy the following people:
joe.shmoe@abc.com. john.doe@xyz.com, jane_pain@ghi.net.  Also, CC:  
mike_sanso@sql.tv.  For more info, visit: http://mydata.net/',

'(?<email>([A-Za-z0-9-+_]+(.[A-Za-z0-9-]+)*@
(?<domain>([A-Za-z0-9-]+(.[A-Za-z0-9]+)*(.[A-Za-z]{2,})))))',

'domain')) x

That will return the following list:

abc.com
xyz.com
ghi.net
sql.tv

Back Reference–A Cool Feature

One thing to note about named capturing groups in regular expressions is that you can also reference a prior pattern match within an expression. For example, say you need to do a quick and dirty parse of an HTML snippet, trying to find the ALT attribute of all image (<img>) tags. Using the ibm.com/db2 web page’s HTML as an example, here is a typical image tag with the alternate text (ALT) specified:

With HTML attributes, either single or double quotes are allowed as delimiters. The following RE pattern can be used to get the entire image tag, the type of quote delimiter used with the alt attribute and the alt attribute value itself:

(?<imgtag>(<img.*alt=s*(?<quote>("|'))
(?<altvalue>.*?)(k<quote>).*/>))

Of particular importance is the “quote” named capturing group that determines whether a single or double quote is used. A back reference (k<quote>) is specified to make sure the end of the attribute is terminated with the same quote character used at the start. In other words, the k<quote> capturing group says “find the same value identified in the quote capturing group.”

Many text documents (HTML, XML, Java Property Files, etc.) have special tools that can be used to programmatically extract information from them. However, depending on the task at hand, sometimes it’s a burdensome coding chore to break out the “right tool” for what might be a one-time code execution or a job fit for a “simple” database retrieval task. While regular expressions aren’t specifically designed to do XML/HTML parsing, they can nevertheless be used to aid with quick and simple queries.

Keep in mind that these examples use Java’s RE engine. Many other RE implementations exist (PHP, .NET, C, etc.) but they’re not always 100 percent compatible.

In summary, named capturing groups are handy because you can embed several capturing instructions within a single pattern and pull out the piece you need by name. Further, using table functions will allow queries to pull all values from a RE operation and return them in a result set. Regular expressions can parse and validate all kinds of text including phone numbers, mailing addresses, serial number formats, email addresses, etc., so integrate them in your applications and reap the benefits of this powerful text processing tool.

Michael Sansoterra is a DBA for Broadway Systems in Grand Rapids, Michigan. Send your questions or comments for Mike via the IT Jungle Contact page.

Table of Contents

Content archive

Recent Posts

Subscribe

Pages

Search