Admin Alert: Beyond Replication in an i5/OS High-Availability Environment

June 3, 2009 Joe Hertvik

One company I deal with has two different high-availability setups that mimic production partitions for its main systems. They’ve been running this configuration for two years, and it’s amazing how many high availability issues occur that have little or nothing to do with basic replication. This week, I’ll look beyond basic replication in building a Capacity BackUp (CBU) system and how these issues can affect failover processing.

The CBU in One Paragraph

A CBU is a specially configured iSeries/System i/Power i machine that communicates with your main production partition to replicate production data and applications by using high availability software installed on both machines. The CBU duplicates a production system; the production system is sometimes referred to as the source or production box or machine, and the CBU is sometimes referred to as the target box or machine. In the event of a disaster, the CBU can be switched over to “impersonate” the production box with very little downtime, servicing users, devices, and companion servers. When the main production machine comes back up, the CBU relinquishes its role and production is switched back to the regular system. See the articles in the Related Stories section for more information about i5/OS CBU boxes.

When Basic Replication Isn’t Working Out as Planned

In a high availability environment, all relevant data must be replicated from the production box to the CBU as it is created, changed, or deleted. However, when replicating objects between a source box and a target CBU, I’m always amazed to find two mistakes that can and will bite you whenever you attempt to failover to your backup box.

The first mistake is taking replication for granted. Don’t assume that new libraries or folders on the source system will automatically be added to the target system. They won’t. Replicating data between systems is an on-going process that must be looked at weekly, if not daily. So your first duty beyond basic replication is to set up daily auditing reports that inform you when a library is present on the source system that isn’t present on the target system. When you find a new library that should be added to your CBU, start replicating it over to the CBU immediately, so that you don’t get a nasty surprise the next time you failover. Many popular replication packages offer comparison reports that can be used to compare which libraries are present on your production system that are not present on your target system. Use and audit these reports every single day to keep your libraries in sync.

The second mistake occurs when administrators don’t make sure that replicated data stays in sync. Before a failover, perform further auditing on your data groups to make sure that someone hasn’t accidentally removed a library from the replication scheme. My shop ran a test last month where we found a critical library was present on both the target and source systems, but its contents hadn’t been replicated in six months. Replication had accidentally been turned off; the programs worked but the data was old. So in addition to making sure that you have the same libraries on both systems, make sure that the data is being kept in sync. Otherwise, you may have replicated the file structure perfectly but your data may not be up to date.

Are All Your TCP/IP Settings in Sync?

Chances are good that your CBU failover routines already contain provisions to activate the same TCP/IP interfaces on your target CBU system that you already have on your source system. But be careful that your target system also contains these other TCP/IP entries that your source system uses to communicate with other systems.

TCP/IP Host Table entries–i5/OS host table entries create local DNS-like names for sending application data to the IP addresses of any companion servers and machines. If you’re not keeping your CBU host table up to date with the production host table, the CBU may not be able to exchange data with partner machines while it is failed over. You can find your i5/OS’ host table on the green screen by opening the Configure TCP/IP (GO CFGTCP) menu and selecting option 10, Work with TCP/IP host table entries. To maintain your system host table in iSeries Navigator V5R4M0 (OpsNav), right-click on the Network→TCP/IP Configuration node and select Host Table from the pop-up menu that appears.
TCP/IP routes–Your source machine may be set up with specific TCP/IP routes that direct IP traffic to the best or only way to reach another system on your network or on the Internet. When you failover, you may also need to duplicate or recreate TCP/IP source machine routes on your target system. TCP/IP routes can found by selecting option 2, Work with TCP/IP routes, off the Configure TCP/IP (GO CFGTCP) menu. Routes can be found in OpsNav by opening either the Network→TCP/IP Configuration→IPV4→Routes node or the Network→TCP/IP→IPV6→Routes node, depending on which IP version you are using.
System Distribution Directory replication–The System Distribution Directory (SDD) is used by many IBM licensed applications and third-party applications to distribute output to other devices or to the Internet. Applications using the directory include the iSeries Access product line, the AnyMail Server Framework (MSF), and email and fax applications. It’s important to coordinate or recreate the production system’s SDD to a target CBU during failover. To learn how to do this, check out my article on How to Recreate/Restore a System Distribution Directory.
SMTP Name Table–The SMTP name table is an old and reliable system construct that associates SNADS user IDs and addresses with specific SMTP email address. Some packages use the SMTP name table to send email, so it may need to be replicated to your target system. To view the SMTP name table on your system, type in the Work with Names for SMTP (WRKNAMSMTP) command on a green-screen command line. The SMTP name table is easy to replicate between systems. If you make sure that your high availability software replicates the SYSALIASES member of the QATMSMTPA file in the QUSRSYS library, the contents of the SMTP name table will be replicated between your systems.

Subsystem Descriptions, Job Queues, and Job Descriptions

In order to run your target system as an exact duplicate of your production partition, all of your CBU subsystem descriptions, job queues, and job descriptions must match their production system counterparts. For example, some batch processes may be set up to submit specific jobs to specific job queues that are attached to specific subsystems. If the job queues used on the production subsystem aren’t present on the CBU, the job will not be submitted. Similarly, if the job queues on the CBU aren’t associated with the same subsystems on the production system, jobs may be submitted to a job queue in the target CBU system but the job may not run in its intended subsystem or it may not run at all.

Job descriptions are slightly different but the idea is the same. Submitted jobs rely on job descriptions to retrieve their job priority, output priority, initial library list, and other job parameters. If the job descriptions on the CBU aren’t exactly the same as the job descriptions on the production box, submitted jobs may fail or run with the wrong parameters.

Here are the rules of thumb for making sure that all the correct job descriptions, job queues, and subsystem descriptions are the same on both systems.

If any of these objects reside in a production library that is being replicated (i.e., a third-party application library), they will be correct on the target system unless you have excluded any of these objects from replication.
If any of the objects reside in a library that is not being replicated, it’s up to you to make sure that all the right objects with all the right properties are present on both systems. You can ensure that the job queues, subsystem descriptions, and job descriptions are the same on both systems by either including them in your replication lists or by manually checking the objects on one system against the other system. However, be careful with any objects that reside in IBM ‘Q’ libraries, such as the QBATCH subsystem description that generally resides in QSYS. Your replication software may not be set up to replicate objects in QSYS or other IBM libraries.

IBM Licensed Programs

CBU failover scenarios can become complicated if different IBM licensed programs are loaded on your target system than on your CBU system. Some software packages may need various IBM licensed programs or options to work (such as the Portable Applications Solution Environment, the CCA Cryptographic Service Provider, or the Java Developer Kit). If these packages aren’t available for a failover, critical applications could refuse to run or incorrectly run.

As you’re setting up your CBU, audit the IBM licensed programs on the source system against the IBM licensed programs on the CBU. You can do this by running the Display Software Resources (DSPSFWRSC) command on both machines and comparing the results. In general, any IBM licensed program or option that is present on the source machine should also be present on the target machine. If it isn’t, make arrangements to load them on the target machine (along with relevant PTFs) and call IBM for license keys for each package, as needed. The success of your CBU failover scenarios may depend on having the correct IBM products loaded.

Beyond Basic Replication

While the tips in this article won’t solve all your high availability failover problems, they will alert you to some issues that may not have been readily apparent when you first set up your CBU. Remember, you’ll learn something new every time you failover production processing to a CBU. Use these tips and your own experience as a way to improve your high availability solution, even if you never have to use it in a disaster.