Merge Into the Synchronization Fast Lane with DB2 for i 7.1

September 22, 2010 Michael Sansoterra

One of the tedious chores that many developers face is keeping many different data sources synchronized. It is common for mid- to large-sized companies to have multiple software packages with requirements to keep synchronized copies of the item master, customer master, employee master, and other master files.

Other common synchronization tasks involve keeping aggregate tables of detailed data or even synchronizing DB2 data with a relational or non-relational remote data source, including remote databases, Web services, flat files, etc. Fortunately, to help with all of this, IBM has updated DB2 for i 7.1 to include support for the SQL MERGE statement.

MERGE (part of the SQL:2008 standard) is a statement that combines the functionality of INSERT and UPDATE statements into a single statement, generally for the purpose of synchronizing data. MERGE makes use of a target table (or updateable view) and a source query to do the synchronization.

A Simple Example

Let’s look at a quick example of how we might use MERGE to synchronize exact copies of an EMPLOYEE master file (from the DB2 for i EMPLOYEE sample table), which are placed in two different libraries (a.k.a., schemas). These files have a single column primary key: EMPNO. We’ll also assume this is a unidirectional synchronization, with the source library being merged into the target library. In the past, a complete synchronization of the tables would be done with three SQL statements:

/* Step 1 - Update Target Rows from Source Rows */
UPDATE TARGETLIB.EMPLOYEE AS TARGET
SET ROW=(SELECT * 
           FROM SOURCELIB.EMPLOYEE AS SOURCE
          WHERE TARGET.EMPNO=SOURCE.EMPNO)
WHERE EXISTS
(SELECT * 
   FROM SOURCELIB.EMPLOYEE AS SOURCE
  WHERE TARGET.EMPNO=SOURCE.EMPNO)

/* Step 2 - Remove Target Rows that are not in Source */
DELETE FROM TARGETLIB.EMPLOYEE AS TARGET
WHERE NOT EXISTS
(SELECT *
   FROM SOURCELIB.EMPLOYEE AS SOURCE
  WHERE TARGET.EMPNO=SOURCE.EMPNO)

/* Step 3 - Insert New Rows from Source into Target */
INSERT INTO TARGETLIB.EMPLOYEE
SELECT *
  FROM SOURCELIB.EMPLOYEE AS SOURCE
WHERE NOT EXISTS
(SELECT *
   FROM TARGETLIB.EMPLOYEE AS TARGET
  WHERE TARGET.EMPNO=SOURCE.EMPNO)

Now let’s take a look at a sample MERGE statement that is similar to the above statements:

MERGE INTO TARGETLIB.EMPLOYEE AS EMPLOYEE_TARGET
USING SOURCELIB.EMPLOYEE AS EMPLOYEE_SOURCE

ON (Employee_Target.EmpNo=Employee_Source.EmpNo)

WHEN MATCHED THEN
    UPDATE SET
    Employee_Target.FIRSTNME =Employee_Source.FIRSTNME,
    Employee_Target.MIDINIT  =Employee_Source.MIDINIT,
    Employee_Target.LASTNAME =Employee_Source.LASTNAME,
    Employee_Target.WORKDEPT =Employee_Source.WORKDEPT,
    Employee_Target.PHONENO  =Employee_Source.PHONENO,
    Employee_Target.HIREDATE =Employee_Source.HIREDATE,
    Employee_Target.JOB      =Employee_Source.JOB,
    Employee_Target.EDLEVEL  =Employee_Source.EDLEVEL,
    Employee_Target.SEX      =Employee_Source.SEX,
    Employee_Target.BIRTHDATE=Employee_Source.BIRTHDATE,
    Employee_Target.SALARY   =Employee_Source.SALARY,
    Employee_Target.BONUS    =Employee_Source.BONUS,
    Employee_Target.COMM     =Employee_Source.COMM

WHEN NOT MATCHED THEN
    INSERT (EMPNO,FIRSTNME,MIDINIT,LASTNAME,WORKDEPT,PHONENO,
            HIREDATE,JOB,EDLEVEL,SEX,BIRTHDATE,SALARY,BONUS,COMM)
    VALUES(
    Employee_Source.EMPNO,Employee_Source.FIRSTNME,
    Employee_Source.MIDINIT,Employee_Source.LASTNAME,
    Employee_Source.WORKDEPT,Employee_Source.PHONENO,
    Employee_Source.HIREDATE,Employee_Source.JOB,
    Employee_Source.EDLEVEL,Employee_Source.SEX,
    Employee_Source.BIRTHDATE,Employee_Source.SALARY,
    Employee_Source.BONUS,Employee_Source.COMM)

ELSE IGNORE;

Let’s examine the components of MERGE in a little more detail.

MERGE INTO begins the statement and is followed by an updateable table or view, known as the “target.”

The USING keyword specifies a “table reference, ” which can include just about any query, but does not include a common table expression. This is considered the source data for the merge.

The ON keyword specifies the correlation between the source and the target, usually a unique row identifier. However, you can also specify additional criteria here, such as limiting the MERGE to employees within a certain department, etc.

WHEN MATCHED THEN is used to indicate the operation to be performed when the ON clause condition has been met. In this case, an abbreviated UPDATE statement, using only a SET assignment clause, is used within the MERGE statement to set all the target row columns to be equal to the corresponding source row columns.

WHEN NOT MATCHED THEN indicates the operation to be done when a source row does not have a matching row in the target table. Typically, this is an abridged INSERT/VALUES statement. Please note that the table name is not specified because the target table is automatically used by MERGE. For the instance when the source and the target should always contain an exact number of matching rows, the SIGNAL statement can be coded to indicate an error condition.

While this example was somewhat trivial, please note that multiple WHEN MATCHED and NOT MATCHED clauses can be specified with additional criteria. These clauses will be evaluated in the order specified in the statement. This is similar to the evaluation of a CASE statement; once the first condition is true the specified operation will be evaluated with the remainder being ignored.

ELSE IGNORE specifies that nothing further is to be done when the matching condition for all prior WHEN clauses are false.

Please note when using MERGE that the synchronization is a uni-directional proposition–always from the source to the target. In the example above, you cannot update the source table based on data in the target. If you need two-way synchronization, then you will need a second MERGE statement with the source and target reversed.

Deletion Considerations

I said MERGE is “similar” to the three SQL statements because while DB2 can add new rows to the target and can update existing rows in the target, it cannot DELETE rows in the target for the circumstance where there is no matching row in the source. However, it can delete rows from the target in the case where there is a matching row from the source. I mention this because SQL Server’s implementation of MERGE has the option to delete rows from the target when there is no matching row in the source.

I initially expected DB2’s MERGE to be able to perform a target DELETE when there is no matching source row. However this behavior, I learned, is SQL Server specific and not part of the ANSI SQL standard. Bummer!

A DELETE statement is allowed in the WHEN MATCHED clause and can be used to remove rows from the target table when a match is found. However, a DELETE usually isn’t done unless additional criteria are specified (unless you want to delete all rows in the target when there is a matching source row). For example, if we wanted to remove MERGED employees from the target when the department is T01 (temporary employees), we could code a WHEN MATCHED THEN clause to the above MERGE statement like this:

WHEN MATCHED AND Employee_Source.WORKDEPT='T01' THEN
    DELETE

Because order of evaluation is important, this specific WHEN MATCHED would have to be placed prior to the more generic WHEN MATCHED clause shown earlier.

If temporary employees are not supposed to be in the source data, an error condition can be artificially created using the SIGNAL statement like this:

WHEN MATCHED AND Employee_Source.WORKDEPT='T01' THEN
    SIGNAL SQLSTATE '70001' 
       SET MESSAGE_TEXT='Temporary Employees Found'

Another Example

Let’s take a look at one more example. Consider the following Sales Summary table that summarizes customer sales by year and customer ID:

CREATE TABLE SalesSummaryByYear (
SalesYear      SMALLINT NOT NULL,
CustomerId     INT NOT NULL,
OrderQty       INT,
OrderLineValue DEC(19,4),
NumberLines    INT,
RankBySales    INT,
RankByOrders   INT,
CreatedOn      TIMESTAMP,
LastUpdatedOn  TIMESTAMP,
CONSTRAINT pk_SalesSummaryByYear 
           PRIMARY KEY (SalesYear,CustomerId))

You can use the MERGE statement to INSERT new and UPDATE existing rows in this table:

MERGE INTO SalesSummaryByYear Target
USING (
SELECT YEAR(OrderDate) AS SalesYear, CustomerId,
           COUNT(DISTINCT SOD.SalesOrderId) As NumberOrders,
           SUM(OrderQty) AS OrderQty,
           SUM(UnitPrice*OrderQty*(1-UnitPriceDiscount)) 
               AS OrderLineValue,
           COUNT(*) AS NumberLines,
           RANK() OVER(PARTITION BY CustomerId 
                       ORDER BY SUM(UnitPrice*OrderQty*
                           (1-UnitPriceDiscount)) DESC) 
                    AS RankBySales,
           RANK() OVER(PARTITION BY CustomerId 
                       ORDER BY COUNT(DISTINCT SOD.SalesOrderId) DESC)
                    AS RankByOrders
  FROM SalesOrderDetail SOD
  JOIN SalesOrderHeader SOH ON SOH.SalesOrderId=SOD.SalesOrderId
GROUP BY YEAR(OrderDate),CustomerId
) Source
ON Target.SalesYear=Source.SalesYear
AND Target.CustomerId=Source.CustomerId

WHEN MATCHED THEN
UPDATE SET
SalesYear=Source.SalesYear,
CustomerId=Source.CustomerId,
OrderQty=Source.OrderQty,
OrderLineValue=Source.OrderLineValue,
NumberLines=Source.NumberLines,
RankBySales=Source.RankBySales,
RankByOrders=Source.RankByOrders,
LastUpdatedOn=CURRENT_TIMESTAMP

WHEN NOT MATCHED THEN
INSERT (SalesYear,CustomerId,
OrderQty,OrderLineValue,NumberLines,
RankBySales,RankByOrders,CreatedOn)
VALUES (Source.SalesYear,Source.CustomerId,
Source.OrderQty,Source.OrderLineValue,Source.NumberLines,
Source.RankBySales,Source.RankByOrders,CURRENT_TIMESTAMP)
;

In this example, a complex query is used as MERGE’s source. If the sales order history was extremely large, a WHERE clause could be added to limit the query’s sales to only the past year or two, which should be OK because sales orders over one year old are generally never updated.

Using this technique is beneficial because:

The entire order history doesn’t need to be materialized.
A single statement handles inserts and updates into the target table.

If you need to synchronize with remote data sources, table functions can be used to aid with this task, if the situation is appropriate. For example, if table function TFXML reads customer master data from an XML source (such as a Web service or an integrated file system (IFS) file), then a MERGE statement like this is valid:

MERGE INTO MYDATA.CUSTOMER AS TARGET
USING (SELECT * FROM TABLE(MYPGM.TFXML()) AS SOURCE
ON (SOURCE.CustomerId=TARGET.CustomerId)
...

Other Considerations

One other important thing to know about MERGE concerns how it acts when running within a transaction. By default, the statement will run as an ATOMIC statement, that is, all of the modifications must succeed or the entire statement will fail and no changes are made. However, NOT ATOMIC may also be specified. In this case the statement is processed row by row, and all of the rows successfully processed up to the point of an error will be committed. Further, when NOT ATOMIC is specified, an additional option to continue processing (i.e., ignore the error) or stop processing is available.

Benefits of using MERGE include:

Only one DML statement is required to perform INSERT and UPDATE statement functions (and in some limited cases DELETE).
Complex source query expressions do not have to be repeated in individual SQL statements.
This statement has been implemented by other major database vendors including Microsoft (SQL Server 2008) and Oracle (10g Release 2).
There is a high degree of granularity in how modifications are handled within a transaction by specifying ATOMIC or NOT ATOMIC.
Source code maintainability may be increased because a developer will be forced to examine the implications to both an INSERT and UPDATE when changes are made.
In my opinion, MERGE provides a cleaner syntax when updating a table based on another query or view when compared to using an UPDATE statement with correlated subqueries in the SET and WHERE clauses.

In summary, MERGE is another one of those great new statements that makes the coding of a common task easier and squeezes even more work out of DB2.

Michael Sansoterra is a DBA for Broadway Systems in Grand Rapids, Michigan. Send your questions or comments for Mike via the IT Jungle Contact page.

                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot