Admin Alert: A Checklist For Monitoring Your IBM i Environment

April 3, 2013 Joe Hertvik

IBM i administration has elements of old-time system operations and real-time monitoring. You need to ensure that everything is working correctly and that problems aren’t silently developing that can: a) interfere with customer processing; and b) violate customer service-level agreement (SLA) requirements or create audit violations. This week, let’s flesh out a checklist of items to help IBM i admins achieve these goals.

The Essential Piece

In order to catch trouble before it occurs, I highly recommend that you set up an IBM i monitoring system to catch developing situations and alert you via email or text when a problem is occurring. A monitoring system is critical for lights-out monitoring. Without it, it’s very difficult to catch many items I’ll be mentioning here. There are several common IBM i system monitoring products you can use for automated error messaging, including:

Bytware MessengerConsole
CCSS QSystem Monitor
Halcyon Software IBM i (i5/OS, System i, iSeries, AS/400) Monitoring, Scheduling & Automation Software
Help/Systems Robot/ALERT
SEA absMessage

Contact these vendors to determine the best products for your system.

The Checklist

In some of my previous articles, I discussed how to set up an IBM i system monitoring system, as well as items you should be automatically monitoring on your system. A list of these articles is included in the Related Stories section at the bottom on this article. In one article, I suggested that your monitoring system should send out alerts when the following seven situations occur.

Long-running batch jobs.
Excessive number of jobs in job queues.
Jobs that should be running, but aren’t.
Critical lines, controllers, or devices that aren’t active.
IP interfaces not active.
Interactive users using a large amount of CPU.
Interactive response time spiking.

Building on this list, I also recommend that you monitor your IBM i partitions for these six additional items I’m reviewing today.

Disk space utilization above 85 percent.
Software problems reported in the Work with Problems display (WRKPRB).
Monitoring QSYSOPR and other message queues for inquiry messages related to application programs.
Ensuring that your daily, weekly, and monthly backup jobs complete normally.
Monitoring and reordering consumable items, including special forms, printer cartridges, and ribbons for critical system printers.
Monitoring for replication errors on your high availability solution.

Some of these items can be automatically monitored with alerts sent out by your system monitoring package. Others you may have to monitor the old fashioned way: by physically checking each item and keeping a log. Together, these 13 items form a good starter checklist for any shop implementing an IBM i monitoring system.

Let’s look at each of the six new items and see why you should be monitoring for them.

Situation #8: Disk space utilization above 90 percent

IBM i systems traditionally do best when disk space utilization is under 90 percent. Once utilization breaches 90 percent, the system can start behaving erratically. In a worst case scenario, your disk can fill up and crash the system. Passing this threshold may also signal that an interactive or batch job is looping and filling up disk space with excessive file records or spooled files. No matter what the cause, you’ll definitely want to know when this situation is occurring.

By default, IBM i sets a storage threshold value of 90 percent for auxiliary storage pools. When disk utilization passes 90 percent, the following CPF0907 message is sent to the system operator message queue (QSYSOPR).

CPF0907 - Serious storage condition may exist. Press HELP

Depending on how serious the storage overflow condition is, you may also see these messages show up in QSYSOPR.

CPF0908 - Machine ineligible condition threshold reached 
CPF0909 - Ineligible condition threshold reached for pool &1

These are all serious messages that I recommend you set up your monitoring software to look for. These messages are defined in the QCPFMSG message description file in the QSYS library.

In my shop, we changed the ASP storage threshold from 90 percent to 85 percent. We did this to give us more time to react before a run-away job fills up disk storage and crashes the system. You can change your ASP storage threshold values in the Start System Service Tools (STRSST) menu. To find the process for changing ASP storage thresholds, check out this older article on protecting your system from critical storage errors.

Situation #9: Software problems reported in the Work with Problems (WRKPRB) display

When a system issue occurs, the IBM i operating system will usually issue a message with a severity of 80 or above, to the QSYSOPR message queue. You should configure your system monitoring software to automatically send out an alert when it sees one of these messages.

But in some situations, a system problem report or resolution can also be written to the system problem log without necessarily sending out an alert. Some items such as an automatic PTF download may be reported in the problem log without a message written to QSYSOPR.

So on a monthly basis, you may want to check if there are any items in your IBM i problem log that need attention. You can view the problem log by typing in the following Work with Problem (WRKPRB) command.

Situation #10: Monitoring QSYSOPR and other message queues for inquiry messages related to application programs

You definitely want to catch any inquiry messages requiring a response that are sent to the QSYSOPR message queue. To do this, you can generally set up your monitoring software to look for QSYSOPR inquiry messages with a message severity of 99.

Severity 99 will catch all inquiry messages in QSYSOPR, but you will want to refine it to ignore certain classes of severity 99 messages. This includes any severity 99 message that come from jobs running in the QSPL subsystem as these are printer messages for when a printer is out of paper, when forms need to be loaded, etc. Printer messages are not critical messages that need to be sent to a technician monitoring the system. So ignore them.

But printer messages may not be the only messages you’ll want to ignore. As you’re setting up your monitoring system, you’ll quickly discover which severity 99 messages can be safely ignored and which ones need to be tended to.

For programming errors on systems with a lot of RPG programs, you may want to start monitoring for certain classes of inquiry messages with a severity level roughly greater than 50. These message IDs start with the following characters.

RN*
LBE*
RPG*
CBE*

Again, you’ll want to experiment with which messages to monitor for and which messages to ignore for your particular system. But these are valuable inquiry messages to monitor for in a traditional IBM i environment.

Situation #11: Ensuring that your daily, weekly, and monthly backup jobs completed normally

You’re probably already doing this, but you’ll want to double-check that your backups are completing normally and that all objects are properly backed up. Depending on how your monitoring system is configured, it may send up a flag if an object is skipped because it’s in use.

In certain regulatory and auditing environments, there may be a requirement to document that backups completed normally. So also consider whether you need to monitor and document completed backups.

Situation #12: Monitoring and reordering consumable items, including special forms, printer cartridges, and ribbons for critical system printers

Outside of electronic monitoring, you may want to set up a system to ensure that you order consumable items before they run out. Examples might be specially printed forms for invoices and shipping tickets, packing labels, printer cartridges and ribbons, and other items needed for critical processes, such as sending orders to customers. And if you’re still using tape media, don’t forget to inventory your tape library and order more tapes when you get low.

Situation #13: Monitoring for replication errors on your high availability solution

If you’re running certain types of IBM i high availability software, you may have to define which libraries and objects are replicated to your target system. In this case, you will want to audit for replication errors and IBM i libraries that are not being replicated to your target. Many of these packages offer audit features that allow you to quickly locate and find replication errors and omissions. If you’re not auditing your replication environment on a regular basis, you may find you are missing key objects when it’s time to switch over to your backup machine.

More To Come?

Keep in mind this is a starter monitoring list that you will need to add items to for your specific situation. I tried to hit the most common items, but if you find something else that should be added to the list, please feel free to email me with your suggestions.

Reader Request: BRMS Expertise Needed for IFS Incremental Backup

After publishing my recent article on incremental IFS backups, reader Michael Lindley checked in on the joehertvik.com website with the following question about expanding my incremental backup routine to make it usable with IBM’s Backup, Recovery, and Media Services (BRMS) licensed program.

Good article on the IFS incremental backup. Any suggestions on how to incorporate this idea within BRMS? I have looked for this option with the BRMS service within OPS Navigator, but I cannot see anything within my back control groups.

Since my shop uses custom written backup programs along with the occasional GO SAVE option 21 Full System Backup, I don’t use BRMS. So to help Michael out, I’m throwing this question out to my readers (i.e., you). If anyone knows how to apply the Time period for last change (CHGPERIOD) parameter from the green-screen SAVE (SAV) command to BRMS processing, please email me and I’ll publish any valid replies in a future Admin Alert column.

Follow Me On My Blog, On Twitter, And On LinkedIn

Check out my blog at joehertvik.com, where I focus on computer administration and news (especially IBM i); vendor, marketing, and tech writing news and materials; and whatever else I come across.

You can also follow me on Twitter @JoeHertvik and on LinkedIn.

Joe Hertvik is the owner of Hertvik Business Services, a service company that provides written marketing content and presentation services for the computer industry, including white papers, case studies, and other marketing material. Email Joe for a free quote for any upcoming projects. He also runs a data center for two companies outside Chicago. Joe is a contributing editor for IT Jungle and has written the Admin Alert column since 2002.