A Database Union is Not a Join

January 23, 2008 Hey, Brian

We recently acquired a company that has had its own payroll system for many years. We are going to combine the two payrolls starting next year but in the meantime, we would like to be able to use some of our internal cost analysis programs on the combined data without combining the two files.

We have created an earning master physical file in which we are staging the data that we will use to merge the two systems next year. This file has the exact shape as our current payroll master file. Without creating another file and physically merging them, thus creating another burden for our staff, I would like to “join” all the records using DDS into one file and override the report programs so that they use this logical file for reporting. However, when I try to join them, I do not get all the records. In fact, depending on which file I specify first in the join, I get exactly that many records in my output view. Offhand, do you know what I am doing wrong?

–Aaron

Hi Aaron,

E. F. “Ted” Codd, the designer of relational database, defined several relational operations. The select operation limits the rows (records) in a view. The project operation limits the columns (fields) in a view. The join is also a powerful operation that combines data from matching records in multiple files. However, the join does not have the ability to take two separate record sets and merge them into one file, such that the number of records in the merged file equals the total number of records in both files. The merge operation as defined by Ted Codd is called union.

Though there is no UNION keyword in DDS, there is a way to code for UNION. In the next two examples, you have an old payroll master file and a new master file. In essence this is your situation. Your objective is to marry (logically merge or union) the two files, using fields that are common to both of them.

Unlike a join, the result of a union is a view of similar data in multiple files as if the data were all lumped together, and then sorted by the union key level field. In other words, if there were a three-file union, and each of the three physical files had 33 records, the union view would have 99 records–all sorted together like one big homogenous file view.

Now, contrast union with join. For example, let’s say you joined the same three physical files. Assume, for this instance, that you know that the data is perfect and that the resulting join would produce a logical file that was really a natural, inner, and matching join (three names for the same thing). In other words, there would be a matching record in each of the secondaries (underlying physical files), for each primary record. In other words again, both secondary physical files would each have one data record to supply for each primary record presented by the join logical view.

If this were all true, then after the join, the resulting view would not show 99 records, as in the union example. Instead it would show just 33 records. As a friend of mine would say, “‘combinatorically’ speaking, the join combines the three records into one record.” Therefore, each record would show a piece (one or several fields) from each of the physical files that were joined.

The following is an example of how to code a union for your shop. First we show the two physical files, NEWMSTR and OLDMSTR and then the test data in the files and then the join with some of the fields.

 A*  PAYROLL MASTER   NEWMSTR
 A          R EARNMSTR     ... TEXT('EARNINGS MASTER')
 A            EMPLNAME      25                        
 A            EMPINL         2                        
 A            ACCT#          6  0                     
 A            PAYCOD         1                        
 A            RATE           6S 2                     
 A            EMPNO          6  0                     
 A            LCHDAT          L ... TEXT('ADDED FIELD')
 A          K EMPNO

 A*  PAYROLL MASTER   OLDMSTR
 A          R OLDEARNR  ...     TEXT('OLD EARN MASTER')
 A            EMPLNAME      25                         
 A            EMPINL         2                         
 A            ACCT#          6  0                      
 A            PAYCOD         1                         
 A            RATE           6S 2                      
 A            EMPNO          6  0                      
 A          K EMPNO

Query Against NEWMSTR

 Line   ....+....1....+....2....+....3....+....4....+....5....+....6....+.
        EMPLNAME        EMPINL  ACCT# PAYCOD     RATE    EMPNO  LCHDAT    
 000001 Snidely           JJ   245765   D        3.45        0  2007-12-13
 000002 Norman            CJ   897345   X        5.63       20  2007-12-14
 000003 Binsnon           JX   567345   K        4.53       30  2007-12-15
 ****** ********  End of report  ********

Query Against OLDMSTR

 Line   ....+....1....+....2....+....3....+....4....+....5....+ 
        EMPLNAME        EMPINL  ACCT# PAYCOD     RATE    EMPNO  
 000001 Waldono           GG   567567   H        5.53        5  
 000002 Finstermick       KC   094536   L        4.67       15  
 000003 Walloby           FM   345298   H        6.89       25  
 ****** ********  End of report  ********

DDS LF – UNION – Some Fields

A* LOGICAL FILE (PRUNSOME)
A*                        
A          R PRUNSOMR    ...    PFILE(NEWMSTR OLDMSTR) 
A            EMPLNAME                                         
A            EMPINL                                           
A            RATE                                             
A            EMPNO                                            
A            PAYCOD                                           
           K EMPNO

Query Against PRUNSOME

 Line   ....+....1....+....2....+....3....+....4....+...
        EMPLNAME        EMPINL     RATE    EMPNO  PAYCOD
 000001 Snidely           JJ       3.45        0    D   
 000002 Waldono           GG       5.53        5    H   
 000003 Finstermick       KC       4.67       15    L   
 000004 Norman            CJ       5.63       20    X   
 000005 Walloby           FM       6.89       25    H   
 000006 Binsnon           JX       4.53       30    K   
 ****** ********  End of report  ********

Union takes two almost identically structured files and combines them. A good DDS coding example is that shown above, in which there are new or current masters (NEWMSTR) and old masters (OLDMSTR). The new, combined, view of the data (PRUNSOME logical file) includes only the named (projected) fields from all records in both the old Payroll Earnings Master (OLDMSTR) and the new Payroll Earnings Master (NEWMSTR).

Overall, up to 32 files can be unioned. Though there are two files defined in the PFILE, in this example, without library qualifiers, this is not a join, since there is no Join Record Format (J in column 17). Each record contains data from just one physical file, not both–as in a JOIN. A key field (EMPNO in this case) must be specified in a union. The key field sorts the records in this case by employee number. Note in the PRUNSOME query output that the records from the NEWMSTR and OLDMSTR files are interleaved due to the sort on EMPNO.

The trick is to use non-join file DDS and to specify both files using the PFILE keyword as shown in the example above. It provides the same facility as physically merging the files and when you override your reporting analysis programs to this file, as long as the fields are exposed, both companies’ data will be available for management analysis. Hope this does the trick for you.

Here are a few other examples of union using DDS.

DDS for PRUNALL Logical File

 A*  UNION A FILE WITH ITSELF -- UNION LOGICAL FILE PRUNALL   
 A*                                                           
 A          R EARNMSTR                  PFILE(NEWMSTR NEWMSTR)
 A          K EMPNO

This union is with the fields from all records in both the “old payroll master” and the “new payroll master.” Notice the format name is EARNMSTR. This says that the union will use the format (all the fields) of the new master file. There cannot be any extra fields in NEWMSTR, compared to the second file, whether it is OLDMSTR or another file or another reference to NEWMSTR as in the example above. Since all fields in NEWMSTR also exist in the second reference to NEWMSTR, this is a valid union. If there are any field differences in the two or more files involved in the join, the logical file will not compile and it will produce severe error messages, and thus will not be created. The second copy of NEWMSTR can be replaced in the above union by a different physical file with identical fields as NEWMSTR and this will work the same. However, OLDMSTR cannot be used in this example since it does not have the LCHDAT field as does the NEWMSTR.

Note that the data from NEWMSTR is repeated twice when the new logical file PRUNALL is queried as shown below:

Query of Logical Union FIle PRUNALL

       EMPLNAME        EMPINL  ACCT# PAYCOD     RATE    EMPNO  LCHDAT     
000001 Snidely           JJ   245765   D        3.45        0  2007-12-13 
000002 Norman            CJ   897345   X        5.63       20  2007-12-14
000003 Binsnon           JX   567345   K        4.53       30  2007-12-15
000004 Snidely           JJ   245765   D        3.45        0  2007-12-13
000005 Norman            CJ   897345   X        5.63       20  2007-12-14
000006 Binsnon           JX   567345   K        4.53       30  2007-12-15
****** ********  End of report  ********

Now, let’s try a new wrinkle on UNION. Let’s use a format keyword to reformat the unioned data. In the next example, we assume that we have added a new field to the new earnings file, and we want a unioned view of the fields in the OLDMSTR file with the NEWMSTR file. The OLMMSTR file does not contain the new field, LCHDAT. This may be something you would do to accommodate consolidations, etc., so that you can get the known information from two similar groups of records. Take a look at the DDS below and ask yourself if it has a new format; if so, what format is that?

In this example, you see another union, but this time, assume that the files have different formats. Since we still want the records unioned, you are going to have to select a format that has fields that exist in both files to govern the new unioned view.

DDS LF PRUNF – UNION – Format

 A*  Logical FIle PRUNF        
 A          R OLDEARNR                  PFILE(NEWMSTR OLDMSTR)
 A                                      FORMAT(OLDMSTR)       
 A          K EMPNO

This new logical view of data called PRUNF, uses the format of the old earnings file. The LCHDAT field is no tin the OLDMSTR file. Thus, we cannot use the EARNMSTR format for the union, since there is no LCHDAT field in the OLDEARN file.

This new format shares the OLDEARNR format from the old payroll master file. Fields from the new master, which are not in the old format, are not included in the new unioned view, called PRUNF.

Query of Logical Union File PRUNF

       EMPLNAME        EMPINL  ACCT# PAYCOD     RATE    EMPNO 
000001 Snidely           JJ   245765   D        3.45        0 
000002 Waldono           GG   567567   H        5.53        5 
000003 Finstermick       KC   094536   L        4.67       15 
000004 Norman            CJ   897345   X        5.63       20 
000005 Walloby           FM   345298   H        6.89       25 
000006 Binsnon           JX   567345   K        4.53       30 
****** ********  End of report  ********

                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot