Native Regular Expressions In DB2 For i 7.1 And 7.2
May 19, 2015 Michael Sansoterra
Blast it! Another suite of custom code I have written and used over the years has recently been deprecated (or partially deprecated) by IBM. The good news is that regular expressions (abbreviated RegEx) are now a native part of DB2 featuring one new predicate (REGEXP_LIKE) and four new scalar functions: REGEXP_COUNT, REGEXP_INSTR, REGEXP_SUBSTR, and REGEXP_REPLACE. For IBM i 7.1, TR9 must be installed and for IBM i 7.2, TR1 must be installed. The new Regular Expression Functions Require licensed product #39 5770-SS1 International Components for Unicode to be installed on both IBM i 7.1 and 7.2. (This product is free with IBM i.) If you do not have this installed you will receive this error: SQL State: 42704 Vendor Code: -204 Message: [SQL0204] QQQSVREG in QSYS type *SRVPGM not found. Cause . . . . . : QQQSVREG in QSYS type *SRVPGM was not found. In short, regular expressions are used for advanced pattern matching, well beyond what a typical LIKE predicate offers. If you’re unfamiliar with regular expressions and how they can be used in a database environment, check out Staggering SQL String Handling with Regular Expressions. For many parts of this tutorial, I’ll be using the EMP_RESUME table in the CORPDATA sample database. REGEXP_LIKE Predicate Similar to the standard LIKE predicate, REGEXP_LIKE performs pattern testing on a text column using a regular expression pattern, rather than a simple wildcard pattern. This REGEXP_LIKE example shows how the new predicate can be used to find all plain-text resumes that have a correctly formatted birthday of the form: month name day, yyyy SELECT * FROM CORPDATA.EMP_RESUME WHERE RESUME_FORMAT='ascii' AND REGEXP_LIKE(RESUME, 'Birthdate:s*(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST| SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)s*d{1,2},s*d{4}' ,'i'); The first argument passed to REGEXP_LIKE is the expression to search, in this case the RESUME column. The second argument is the regular expression pattern, and it does take some practice to get used to creating a RegEx pattern. The full list of regular expression control characters (as recognized by DB2 for i) used to build a pattern can be found in the SQL Reference manual. The third argument is one or more flag values that control how to interpret the RegEx processing. Shown in the example above is ‘i’, which means ignore case sensitivity. The possible flag values for this parameter are:
These flag values are worth studying as they can be used with all of the REGEXP scalar functions. As expected, the search criteria can be negated so that the following query returns resumes that do not contain a correctly formatted birthday: SELECT * FROM CORPDATA.EMP_RESUME WHERE RESUME_FORMAT='ascii' AND NOT REGEXP_LIKE(RESUME, 'Birthdate:s*(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST| SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)s*d{1,2},s*d{4}' ,'i'); Another optional parameter that can be used is the start position to search within the string. For example, to ignore the first 100 characters of the resume include 100 as the start position:
SELECT *
FROM CORPDATA.EMP_RESUME
WHERE RESUME_FORMAT='ascii'
AND REGEXP_LIKE(RESUME,
'Birthdate:s*(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|
SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)s*d{1,2},s*d{4}'
,100,'i');
New Scalar Functions: REGEXP_COUNT This function is used to count the number of occurrences of a regular expression pattern within a string. For instance, this function counts the folder depth of an IFS file path: VALUES(REGEXP_COUNT('/qibm/ProdData/Java400/bin/JavaDoc','/')) The result is 5. The following function invocation returns the number of occurrences of a simple phone number pattern within a plain text resume: SELECT EMPNO,REGEXP_COUNT(RESUME, '(PHONE:|PH:|MOBILE:|CELL:)s*({0,1}d{3}){0,1}s*d{3}(s+|-)d{4}', 'i') FROM CORPDATA.EMP_RESUME WHERE RESUME_FORMAT='ascii'; Similar to REGEXP_LIKE, the function’s parameters (from left to right) are the text to search, the regular expression pattern and, optionally, any flags to use. In this case, the “ignore case” flag is used so that both “MOBILE” and “mobile” will be identified as a pattern match. REGEXP_SUBSTR (a.k.a. REGEXP_EXTRACT) This function will extract a substring from a given column or expression based on a regex pattern. The plain text version of the RESUME column in table EMP_RESUME contains a header that looks like this: Address: 1150 Eglinton Ave Mellonville, Idaho 83725 Phone: (208) 875-9933 Birthdate: September 15, 1925 Sex: Female Marital Status: Married Height: 5'2" Weight: 120 lbs. The phone number line can be extracted from the text with this REGEX pattern: PHONE:s*({0,1}d{3}){0,1}s*d{3}(s+|-)d{4} Using the REGEXP_SUBSTR function, the phone number can be extracted from the resume as follows: SELECT EMPNO,REGEXP_SUBSTR(RESUME, 'PHONE:s*({0,1}d{3}){0,1}s*d{3}(s+|-)d{4}' ,1,1,'i') FROM CORPDATA.EMP_RESUME WHERE RESUME_FORMAT='ascii'; The extracted phone number look like this: Phone: (208) 875-9933 This result could require a little more formatting such as ignoring extraneous spaces and eliminating the word “Phone”. To make these corrections, parenthesis can be placed around the area code and phone number in the regex pattern to make these elements of the pattern a “capture group” (see bolded red letters below): SELECT REGEXP_SUBSTR(RESUME, 'PHONE:s*(({0,1}d{3}){0,1}s*d{3}(s+|-) d{4})' ,1,1,'i', 1) AS PHONE FROM CORPDATA.EMP_RESUME WHERE RESUME_FORMAT='ascii' The extra parenthesis in the pattern make the phone number itself a capture group. The optional sixth parameter passed to the REGEXP_SUBSTR function tells it to extract capture group number 1 (capture group 0 represents the entire pattern). The improved result is just the phone number without any leading or trailing words, spaces, etc.: (208) 875-9933 REGEXP_REPLACE As Yoda would say: a powerful ally in the Jedi developer’s toolbox, the REGEXP_REPLACE function is. It can make complex text updates a breeze. Say the area code 208 is about to be eliminated with area code 222 taking its place. The resume phone numbers need to be updated accordingly. However, a blanket exchange of literal ‘208’ with ‘222’ using the SQL REPLACE function may inadvertently update something else in the document such as a non-area code portion of a phone number, a zip code, a house number or an apartment number. REGEXP_REPLACE comes in handy here because a regex pattern allows the search to be context specific to ensure the update only applies to an area code of a phone number and not something else. Here is an UPDATE statement that will accomplish just such a feat: UPDATE CORPDATA.EMP_RESUME SET RESUME=REGEXP_REPLACE(RESUME, '(.*?PHONE:s*({0,1})208( ){0,1}s*d{3}(s+|-)d{4}.*?)', '$1222$2',1, 0,'is') WHERE RESUME_FORMAT='ascii' Before I explain this mess of cryptic characters, study again what a plain text resume looks like in the sample data (abridged to show portions of the header): Resume: Delores M. Quintana Personal Information Address: 1150 Eglinton Ave Mellonville, Idaho 83725 Phone: (208) 875-9933 Birthdate: September 15, 1925 Sex: Female Department Information Employee Number: 000130 Manager: Sally Kwan Position: Analyst Phone: (208) 385-4578 Hire Date: 1971-07-28 Notice that phone numbers with the 208 area code appear twice. After the above statement runs, all occurrences of the 208 area code are changed to 222. The important things to note about the use of REGEXP_REPLACE for this purpose are highlighted in red. The first thing to consider is the pattern itself: (.*?PHONE:s*({0,1})208(){0,1}s*d{3}(s+|-)d{4}.*?) Remember, adding parentheses around elements of a pattern turn the elements into a “capturing group.” In the pattern above, the first capturing group identifies everything in a document prior (starting from the beginning of the document or the last pattern match) to the start of an area code: (.*?PHONE:s*({0,1}) The second capturing group identifies everything after the area code until another phone number appears or until the end of the document is reached: (){0,1}s*d{3}(s+|-)d{4}.*?) The next thing to note about the use of REGEXP_REPLACE is the replacement expression as shown in parameter 3: '$1222$2' This is the text we want to use to replace the source text. Since we’re updating the RESUME column, we need to include the entire content of each resume. The $1 and $2 are special identifiers that refer to ordinal based capturing groups within a pattern. The regex pattern was created to identify everything before an area code (capture group 1), the area code itself, and everything after the area code (capture group 2). When the replace is done, the replacement text will be built by concatenating a string consisting of the content of capture group 1 (everything prior to the area code), followed by the replacement area code literal ‘222’, followed by the content of capture group 2 (everything after the area code). Note that $0 refers to the default capturing group which is the entire regular expression pattern. An alternative syntax for $1 and $2 is 1 and 2. The fifth REGEXP_REPLACE parameter represents the occurrence to replace so if three area codes are identified in the resume column, passing a 2 will only update the second match. Passing a zero as shown in this example instructs the REGEXP_REPLACE function to update all occurrences of the matched pattern. The last thing to note is the ‘s’ flag is also specified in the flags parameter:
'is'
This flag specifies that the ‘.’ control character in the pattern matches any character including a line terminator character. This allows the “reluctant quantifier” portion of the pattern element .*? to include text that spans multiple lines. Without this flag, the ‘.’ match would not go beyond a single line. Of course, the REGEXP_LIKE predicate can be added to limit the search to only those that contain phone number patterns within the 208 area code. UPDATE CORPDATA.EMP_RESUME SET RESUME=REGEXP_REPLACE(RESUME, '(.*?PHONE:s*({0,1})208(){0,1}s*d{3}(s+|-)d{4}.*?)', '$1222$2',1,0,'is') WHERE RESUME_FORMAT='ascii' AND REGEXP_LIKE(RESUME,'({0,1}208){0,1}s*d{3}(s+|-)d{4}','i') REGEXP_INSTR This function returns the position of a pattern within a string. The first character position in a string is considered position one so don’t expect it to be zero as in a zero based language such as Java. In this example, REGEXP_INSTR is used to identify the last component of an IFS path (as identified in capture group 2: (w+)):
VALUES(
REGEXP_INSTR('/qibm/ProdData/Java400/bin/JavaDoc',
'(/w+)*/(w+)'
,1,1,0,'', 2))
The result is 28, the position of ‘JavaDoc’ in the given IFS path. Parameters 3 through 7 of REGEXP_INSTR are optional and represent start position, occurrence, return option (more on this in a minute), flags and capture group ordinal. The “2” value passed as parameter 7, instructs the function to return the position of capture group 2 rather than the position of the entire pattern. One other WAY COOL feature of REGEXP_INSTR is parameter 5, the “return option”. This value can be a zero or a one. When set to zero (as shown above), the function returns the start position of the requested pattern. When set to one, the function returns the position after the matched pattern.
VALUES(
REGEXP_INSTR(
'/qibm/ProdData/Java400/bin/JavaDoc',
'(/w+)*/(w+)'
,1,1,1,'',2))
In this case, the function returns 35, the position after the pattern ends. In string processing, it’s common to identify the start position of a search string and then add the length of the same search string to the original start position. The “return option” parameter prevents the need to do this separate addition step. Since regex pattern matches can vary greatly in length, this information (start and after positions) can help you determine the length of the identified pattern. Express Yourself with RegEx Regular expressions are a powerful tool in the dev toolbox and it’s great to have them available with DB2 for i. Be careful to note that not all regular expression engines support the exact same features. For example, it doesn’t look like the DB2 for i includes support for named capture groups. Also, DB2 for i doesn’t offer a RegEx split function, which is extremely useful for parsing delimited lists. Likewise, there may be some subtle differences in the control character patterns supported by various engines. Michael Sansoterra is a DBA for Broadway Systems in Grand Rapids, Michigan. Send your questions or comments for Mike via the IT Jungle Contact page. RELATED STORIES DB2 For i, Java, And Regular Expressions Staggering SQL String Handling with Regular Expressions
|
[…] ITJungle era già uscito un po’ di tempo fa un ottimo articolo di Michael Sansoterra: “Native Regular Expressions In DB2 For i 7.1 And 7.2” che presentata degli ottimi esempi di REGEXP_LIKE, REGEXP_SUBSTR, REGEXP_REPLACE e […]
I created a flat file using CRTPF. This creates a file where the one field name is the same name as the file name. If I use REPLACE, it will work. If I use REGEX_REPLACE, it won’t work and returns with: Message: [SQL0171] Argument 01 of function REGEXP_REPLACE not valid.
I really want to use REGEXP_REPLACE because of the wild card feature and case insensitive. Any idea why the two behave differently?