Guru: In Pursuit Of Reasonable Data
March 9, 2020 Ted Holt
My awareness of the indifference of computers toward the reasonableness of data dawned when I enrolled in the university. The band director told me to sign up for band and jazz band for zero semester hours. He said the stupid computer wouldn’t know that no one takes a course for zero hours, wouldn’t flag me to obtain permission from the dean to carry an overload, yet would grant me the credit. He was right.
Love ‘em or hate ‘em, we live with them, we learn to get around them when they limit us, and we deal with them when they make mistakes. Except that you and I know that computers don’t make mistakes. As someone has said, “He who blames the computer is making two mistakes, one of which is blaming the computer.”
For a computer to do its job properly, two things are necessary: reliable, dependable, bug-free software, and reasonable data. I wrote about the latter recently, and I wish to continue with that topic today.
Domains
You have undoubtedly heard the general database terms that IBM encourages us to use these days: table instead of physical file, row instead of record, column instead of field. Let me introduce another relational database term: domain. To quote a textbook I found online, “a domain is a set of acceptable values that a column is allowed to contain.” I don’t believe anyone could give a better definition of domain.
Enforcing a set of acceptable values is easy when those acceptable values are few in number. We need only add an appropriate constraint, and the database will not accept invalid data. Notice the ID and Type columns in the following table definition.
create or replace table EmployeeMaster for system name EmpMast ( ID dec(5) primary key check (ID > 0), Name varchar(16), Type char(1) check (Type in ('S', 'H', 'T', 'P')), Salary dec(9,2), Wage dec(5,2) )
The acceptable values for the ID and Type columns can easily be listed. Given enough time, I could list all the positive five-digit whole numbers. I could list the four permitted values of Type much more quickly.
But what about a column for which the number of acceptable values is impractical or impossible to list? What we do about that?
We look for a pattern.
Let’s say that we work for a company that maintains inventory. Each inventory item has an ID number. The first question to ask ourselves is “what is the domain of item numbers?” Or to say it in plain English, “What does an item number look like?”
In our pretend company, an item number is one of the letters A, B, or C, followed by four digits.
Some valid item numbers | Some invalid item numbers |
A2157 | 2A157 |
B3200 | C32 |
C0000 | BR549 |
The following check constraint enforces this pattern:
create or replace table . . . item char(5) check (translate(item, 'AA000000000','BC123456789') = 'A0000'), . . .
The TRANSLATE function returns the first value after the characters in the second value have replaced the corresponding characters in the third value. In this example, all Bs and Cs are converted to As and digits 1 through 9 are converted to zeros. Everything else, including As and zeros, remains unconverted. If the converted value is A0000, the item number is in the domain.
Suppose item numbers have two patterns. Some follow the convention just mentioned. Others consist of two of the letters A, B and C in any combination followed by three digits. Here’s the check constraint:
item char(5) check (translate(item, 'AA000000000','BC123456789') in ('A0000', 'AA000'))
TRANSLATE converts the characters to the corresponding characters and the result can be either of two values.
Let’s consider one more example. Let’s check a name for reasonableness. What’s the domain for a name of a person?
In English, names consist of letters and two punctuation marks: hyphen and apostrophe. Some people have names with an embedded blank. There’s no way to check for all possible names, but we can check for a pattern. Here’s one possibility:
name char(25) check (translate(name, ' ', 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-''') = ' '),
TRANSLATE replaces all uppercase letters, all lowercase letters, hyphens and apostrophes with blanks. Since the second parameter is shorter than the third parameter, the system pads the second parameter with enough blanks to make it as long as the third parameter. If the name has a character that is not found in the third parameter, TRANSLATE will not convert that character to blank and TRANSLATE will return a non-blank value.
This doesn’t completely enforce reasonableness, of course. It would allow a name like JXnD’pt—-k, but the computer can only do so much.
Is the Best Yet to Come?
Some other relational database management systems allow the use of regular expressions in constraints. For example, you might do this to check for the A0000 pattern:
item char(5) check (regexp_like(item,'[A-Z][0-9]{4}'))
While DB2 for i supports regular expressions in other contexts, it doesn’t let us use them in constraints — yet. I suspect that it’s only a matter of time until IBM adds this support. Regular expressions would allow us to be even more specific about pattern matches than TRANSLATE does, but TRANSLATE will have to do for now.
Midrange shops have traditionally enforced the reasonableness of data through validation routines in file maintenance and data entry programs. This is not an effective strategy for keeping an accurate database. If it were, I would not have seen so much bad data over the past decades. We need constraints and triggers in addition to the validation routines.
RELATED STORIES
Guru: Better Check Constraints
Database Design – 2nd Edition – Chapter 7 The Relational Data Model (Adrienne Watt)