Admin Alert: Five Things that Kill Backups (and What to Do About Them)

March 28, 2007 Joe Hertvik

For system administrators, there isn’t any worse feeling than reaching for a backup tape only to realize that your backup system has failed and you can’t restore a critical file. While you can’t always prevent backup failures, you can monitor for simple things that will alert you to a disruption. To that end, here’s my top five list of i5 backup issues and what you can do about them.

Issue #1: Not changing your backup tapes

With many automated backup routines, you simply put in the backup media and the system does the rest by formatting the media and performing the backup. But what happens when you forget to change the media? In that case, the previous night’s backup can get clobbered, leaving a gap in your backup schedule.

In big shops with dedicated system operators (a vanishing breed), this may not be a big issue as there is usually a schedule to ensure that your media is changed and your drives are ready to go. In smaller shops, however, programmers or network administrators may be in charge of backup in addition to their other duties. If they get busy, backup media changes can get lost in the shuffle and last night’s backup media could be overwritten with tonight’s backup. One way to guard against this human error is to take advantage of auto-loader mechanisms on your backup drives.

If you’re using a backup drive with an auto-loader mechanism, you can load several days’ worth of media in the drive and then issue a simple Check Tape command (CHKTAP) at the end of your backup; the CHKTAP command will unload the last saved tape, triggering the drive to load up the next available tape in the loader. To do this, you would simply alter your backup routine to execute the following CHKTAP command after the backup finishes.

CHKTAP DEV(media_device_name) ENDOPT(*UNLOAD)

CHKTAP is a harmless command that does two things. First, it checks the volume label of the media sitting in your drive; and, two, it performs the action listed in the End of Tape Option parameter (ENDOPT) when it’s finished checking the media. CHKTAP is designed to check whether a certain media volume is present in the drive, but you can also use it to unload a tape, as shown here (by setting ENDTOP to *UNLOAD). Combined with an auto-loader, CHKTAP is a nice way to pop the existing media out of your drive and load other media without having your support staff getting involved.

Issue #2: Not discovering a backup drive failure early enough

If your backup drive fails, you want to be notified as quickly as possible so that you don’t lose too many backup opportunities before the drive is fixed. For some shops without appropriate monitoring, a drive failure may go undetected for perhaps a day or two. What you need is an early warning system that something is wrong.

One of the best investments an i5 shop can make is to buy a system monitoring and paging/email software package such as Bytware‘s MessengerPlus software or Help/Systems‘s Robot/Alert product. Packages like these monitor messages as they occur. When an error is detected (such as a media device with a status of ‘Failed’ or a backup device which is varied off), the products can be configured to take action or to alert an administrator that something is wrong with the system.

Issue #3: Not answering drive messages

When I’ve worked in smaller shops, every once in a while someone would take out the previous backup tape from the tape drive and forget to put in a new tape. Depending on whether or not my iSeries partitions were monitoring for messages, this scenario could put the tape job in a message waiting state where the job is waiting for an answer before completing the backup. If the operators (or staff sitting in for operators) are particularly sloppy and don’t notice or answer the message, the waiting job could monopolize the tape drive and prevent the next backup from going off.

The solution here (again) is to put some kind of monitoring software on your backup job so that your staff is immediately alerted whenever a backup error occurs.

Issue#4: Not auditing your backup results

Sometimes shops set up backup jobs and never check to see whether the job is actually backing up everything they want. I always recommend that people save the joblogs from their backup jobs and then review those joblogs occasionally, just to ensure that their backup is working correctly.

It’s easy to save any job’s joblog. All you have to do is add the following Display Job Log command (DSPJOBLOG) to the end of your backup job stream:

DSPJOBLOG OUTPUT(*PRINT)

This will output all the commands and related messages from the current job to a spooled file that can be reviewed or printed later. If you prefer to output the job information to an output file that can be reviewed or manipulated by another process, you can send the job information to a file by running DSPJOBLOG like this:

DSPJOBLOG OUTPUT(*OUTFILE) OUTFILE(library_name/file_name)

You don’t have to check the joblog every day, but you may want to review one of your joblogs every month or two to ensure that nothing has changed in your backup procedure.

Issue #5: Make sure your tape drive is available after IPLing your system

Unlike other systems that start with a ‘W’, IPLing (rebooting) an i5, iSeries, or AS/400 is an event that doesn’t happen every day, or even every week or every month. As a result, you may be surprised at what does or does not come back up after an IPL. Make sure that your backup drives are up and available after an IPL, because it isn’t a given that your drives will automatically be available and ready to go. In particular, there are two specific situations I’ve run into regarding backup drives and IPLs.

First, your backup device may not be set to automatically vary on after an IPL. This can be checked by looking at the Online at IPL parameter (ONLINE) of your backup media device. If this parameter is set to *YES, the system should automatically vary on the drive after an IPL. If ONLINE is set to *NO, the drive will come up in a varied off state after an IPL and it will not be available until you manually vary it on again. You can check the ONLINE setting of your backup drive by running the following Display Device Description parameter (DSPDEVD):

DSPDEVD DEVD(drive_name)

If ONLINE is set to *NO, you can change the setting by running the Work with Device Descriptions command (WRKDEVD) as follows:

WRKDEVD DEVD(drive_name)

On the Work with Device Descriptions screen that appears, place a 2=Change in front of your drive name and change the ONLINE parameter to *YES. You should also note that your media drive device may need to be varied off in order to make this change.

The second situation affecting IPL availability involves media drives that are controlled by an i5, iSeries, or AS/400 input/output processor (IOP) that can be shared and moved between partitions. Sometimes on an IPL, a movable IOP may actually migrate back to the original partition it was attached to before it was moved. If that happens and your tape drive moves with the IOP to another partition, you may need to locate the partition that now holds your media drive IOP and return it to its proper location by using your system’s Hardware Management Console (HMC) software. If you’re not familiar with using the HMC to configure your partitions, this problem could result in having to hire technical support that can modify and save your configuration for you. This is a little more unusual situation but I have seen it happen on my own systems with movable IOPs.

About Our Testing Environment

All configurations described in this article were tested on an i5 box running i5/OS V5R3. However, most of these commands and features are also available on i5/OS V5R4 and most earlier versions of OS/400 V4R5 and below running on AS/400 and iSeries machines.

                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot