Which Job Is Filling Up My System Storage?

August 21, 2013 Hey, Joe

Every so often, our system storage breaches its threshold level, and we have to find the job that is filling up storage before it crashes our system. This is usually a race against time. Do you have any tips on how to quickly find runaway jobs that are gobbling up storage?

–Mike

I feel your pain on this one. Hitting the threshold can be a recurring event in some shops, and it’s always better to find out sooner when this problem occurs rather than later.

Here’s how I handle this issue.

Set Your ASP Threshold At A Reasonable Level

On a newly installed IBM i partition, the default ASP storage threshold is set at 90 percent. When your storage usage goes over 90 percent, a message is sent to the QSYSOPR message queue indicating that the ASP storage pool is almost full. But by then it can be too late, as the system will get more unstable as storage rises above 90 percent. When the system runs out of available disk for processing, your partition will automatically shut down.

So the first step is to check your ASP storage threshold and see if it’s at a comfortable level to give you enough warning if storage starts filling up unexpectedly. On IBM i 6.1, you can set the ASP storage threshold either through the system’s Dedicated Service Tools (DST) or the System Service Tools (SST). Follow these steps to get to and modify your ASP storage threshold in SST.

Start SST by typing in the Start System Storage Tools (STRSST) command. Sign in with the correct Service tools user ID and Service tools password for your system.
Off the System Service Tools (SST) menu, take the following options: a) 3=Work with disk units; b) 2=Work with Disk Configuration; c) 3=Work with ASP threshold.
The Select ASP to Change Threshold screen will appear. Type a 1=Select in front of the ASP threshold value you want to change.
Note the current threshold value for the ASP. It may still be at the system default of 90 percent. If you want to change it to a lower value for earlier warnings (say 80 percent or 85 percent), type a different value into the New Threshold field and press enter.

This process will change the threshold value for when the system should send an alert to the system operator message queue (QSYSOPR). Lowering it will give you more time to look for run-away jobs that are filling up storage.

Enlist Your System Monitoring Software To Help

Most system monitoring packages have special alerts that can send email or text messages when an unusual system storage issue occurs. In the monitoring package that I use, I can set up the following disk storage alerts to notify my staff when these error messages occur:

A secondary threshold value for sending an alert whenever a specific number of unused megabytes are left on my disk storage or when the amount of available storage has fallen under a certain level (for instance, 10 percent to 15 percent of unused storage left). This is the opposite of the system ASP threshold value, where the message is triggered by how much storage is used, not how much storage is available.
When a spike occurs in system ASP usage. The package can send an alert if storage usage has increased by a certain percentage or number of megabytes over the last hour, day, week, or month.

By doing this, my monitoring software helps me determine when the system is misbehaving and may be ready to overflow its disk drives. It gives me a better chance to find a growing issue before it becomes a system threatening problem.

So be sure to check your own system monitoring software for similar capabilities and activate disk monitoring to get an early wanting when problems develop, if you can.

Finding The Rogue Job

Once you get a message that system storage usage is growing at an alarming pace, it’s time to start hunting for the out of control job. Many times, it can be a user query, a looping job that’s generating too many spooled files or job logs, or an output file that’s getting too big.

The best tool for hunting for out of control jobs is the old fashioned green-screen Work with Active Jobs (WRKACTJOB) command. WRKACTJOB shows all the jobs running in your system and its sort capacity will help you find jobs that may be misbehaving. Here’s my drill for using WRKACTJOB to locate and end jobs that may be running wild.

Bring up the WRKACTJOB screen.
Press the F14 key (shift+F2) to add any suspended or disconnected jobs that aren’t usually displayed on the WRKACTJOB display (WRKACTJOB does not include these jobs, by default). Sometimes a suspended job can cause a disk overflow situation.
Move the cursor under the CPU % column on the WRKACTJOB screen, and press the F16 key (shift + F4) to sort the display according to the jobs that are using the highest CPU percentage. Check these jobs to see if they are a) writing excessive records; b) writing excessive spooled files, especially job logs; or c) looping or disconnected from the system.
Determine if any of your candidate jobs from step 3 may be filling up the system. If so, go to step 7.
If you can’t find the offending job after you sort by CPU %, press the F11 key once to show the Elapsed Data view of WRKACTJOB information. This screen shows you the elapsed number of operator interactions for each job since the last time the system was reset; each job’s average response since the last reset; the elapsed number of auxiliary IO operations for each job; and the elapsed CPU percentage used.
Move your cursor under the AuxIO column under the WRKACTJOB Elapsed Data view, and press the F16 key to sort all the jobs by the elapsed number of Auxiliary IO. The jobs at the top of the sorted list have the most IO operations since the last time the screen was refreshed. Check out any jobs that have excessive I/O that may be filling up storage.
End all jobs that look like they might be filling up storage. If the job is filling up storage through excessive spooled files, you may also have to go into the job and delete all the spooled files associated with it. When you find and end your runaway job, you should start to see your disk usage percentage dramatically decrease. It may take a few minutes but after you kill the offending job, your system storage should go back to its normal disk usage.

These techniques should help you get an early warning when your disk drives fill up, and then identify the jobs that are causing problems.

HTH

Follow Joe Hertvik on His Blog, on Twitter, and on LinkedIn

Check out Joe’s blog at joehertvik.com, where he focuses on computer administration and news (especially IBM i); vendor, marketing, and tech writing news and materials; and whatever else he come across.

You can also follow Joe on Twitter @JoeHertvik and on LinkedIn.

Joe Hertvik is the owner of Hertvik Business Services, a service company that provides written marketing content and presentation services for the computer industry, including white papers, case studies, and other marketing material. Email Joe for a free quote for any upcoming projects. He also runs a data center for two companies outside Chicago. Joe is a contributing editor for IT Jungle and has written the Admin Alert column since 2002.

                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot