Admin Alert: Alternate Ways to Ensure a Subsystem Ends
February 4, 2004 Joe Hertvik
In “Three Common Mistakes in CL Administration,” I presented a technique for ensuring that a subsystem ends before processing other commands. However, several readers quickly spotted some errors in that technique and wrote in with corrections and suggestions on how to make that routine better.
To start with, I originally provided the following piece of CL code to end a subsystem, to copy and clear a file that is locked by that subsystem, and then to restart the subsystem:
ENDSBS SBS(SUB1) MONMSG MSGID(CPF0000) DLYJOB DLY(300) CPYF FROMFILE(LIB/CRITICAL) + TOFILE(LIB/CRITICALBK) MBROPT(*REPLACE) + CRTFILE(*YES) MONMSG MSGID(CPF0000) CLRPFM FILE(LIB/CRITICAL) STRSBS SBSD(SUB1) MONMSG MSGID(CPF0000)
With this code, I relied on the Delay Job (DLYJOB) command to give the subsystem enough time to end before performing critical post-subsystem processing. However, readers thought that using DLYJOB to ensure the subsystem had ended was an inefficient way to perform this function, and they offered some alternatives that worked better. Here are some of their e-mails, along with my comments on their solutions.
Reader Tom Parkinson wrote: “To me, your solution just sidesteps the issue. What you really need is a way to be notified by the subsystem that it is down. Your solution works, but it’s like a Band-Aid: The cut is still there, and the Band-Aid protects it, so the cut can heal naturally.
Reader Elvis Budimilic added: “While using DLYJOB in this situation works, it may not work reliably. There are times when a subsystem may take a very long time to end, or it may never end due to some very obstinate job. A stubborn job could nullify the intended function of DLYJOB, then you’re back to the initial problem you were trying to solve.”
Tom and Elvis are correct. The DLYJOB code is effective only if all of the jobs in the subsystem end before the job delay ends. If they don’t, we’re back to the original problem: How do you ensure that a subsystem ends before executing special processing that is dependent on the subsystem being down?
Fortunately, our readers are good souls who, in addition to spotting a problem, also offer solutions for the gaps they see. In this case, they offered several different sets of pseudo-code and working code that help guarantee that a subsystem is down before continuing processing. Here are the solutions they provided. I tested and refined each solution using an OS/400 V5R1 machine, and they work much better than the DLYJOB solution I covered above.
Solution 1: Issue multiple ENDSBS commands until OS/400 tells you your subsystem has finished.
This one came from Elvis Budimilic, and the concept is simple. Issue the ENDSBS command, wait a predetermined amount of time (60 seconds, in this case), then issue the command again. Repeat as needed. On one of the multiple ENDSBS commands, OS/400 will tell you that the subsystem is ended by issuing one of the following priority 40 messages:
CPF1003 – Subsystem &1 not active CPF1054 – No subsystem &1 active
Where &1 is an OS/400 variable containing the name of the subsystem you’re checking.
Using this idea, you could run the following code to ensure your subsystem is down before performing post-subsystem processing and restarting the subsystem:
PGM RETRY: ENDSBS SBS(SUB1) MONMSG MSGID(CPF1003 CPF1054) EXEC(DO) CPYF FROMFILE(LIB/CRITICAL) + TOFILE(LIB/CRITICALBK) MBROPT(*REPLACE) CRTFILE(*YES) MONMSG MSGID(CPF0000) CLRPFM FILE(LIB/CRITICAL) STRSBS SBSD(SUB1) MONMSG MSGID(CPF0000) GOTO CMDLBL(SUCCESS) ENDDO DLYJOB DLY(60) GOTO CMDLBL(RETRY) SUCCESS: ENDPGM
In this example, all the post-subsystem-end code is contained in the DO statement that is kicked off by the CPF1003 or CPF1054 message. If those messages aren’t found, the program will continually loop until OS/400 tells it that the subsystem is finally ended.
Elvis also suggested modifying this code to put in a retry counter. With a retry counter, you set the code to perform a certain action, like such as notifying an operator or paging a technician, if the subsystem doesn’t end within a specified period of time.
Solution 2: Allocate the subsystem description object to ensure the subsystem has ended.
Not to be outdone, readers Dan Stephens and Ben Hines offered variations on the same theme. Their solution for ensuring that a subsystem has ended before performing post-subsystem processing is to attempt to allocate the subsystem description exclusively and, when you can exclusively lock the subsystem description, then you know the subsystem is ended. Here are two pieces of code they submitted. I modified and tested each program to perform this task.
Dan’s code performed a simple loop to check when the program is able to exclusively allocate the subsystem object. When the program cannot allocate the object, it issues one of the following two messages and then loops back to try the allocation again:
CPF1002 – Cannot allocate object &1 CPF1085 – Objects not allocated
Once the subsystem description is exclusively allocated, the code performs the post-subsystem-ending processing, deallocates the subsystem description, and restarts the subsystem.
PGM ENDSBS SBS(SUB1) OPTION(*CNTRLD) + ENDSBSOPT(*NOJOBLOG) MONMSG MSGID(CPF1003 CPF1054) LOOP: ALCOBJ OBJ((LIB/SUB1 *SBSD *EXCL)) WAIT(60) MONMSG MSGID(CPF1002 CPF1085) EXEC(GOTO + CMDLBL(LOOP)) CPYF FROMFILE(LIB/CRITICAL) + TOFILE(LIB/CRITICALBK) MBROPT(*REPLACE) CRTFILE(*YES) MONMSG MSGID(CPF0000) CLRPFM FILE(LIB/CRITICAL) DLCOBJ OBJ((LIB/SUB1 *SBSD *EXCL)) STRSBS SBSD(SUB1) SUCCESS: ENDPGM
Ben’s code is a little more complicated but offers greater flexibility and reporting. In Ben’s version, the subsystem name and library are passed into the program as variables, so that you can easily use this program to shut down and restart different subsystems. In this scenario, the program checks for message CPF1002 (“cannot allocate object &1”) when trying to exclusively allocate the subsystem description. If it cannot allocate the description, it sends a program message to the user and loops around to try the allocation again. If it can allocate the description, it sends a different program message to the user, deallocates the subsystem description, and restarts the subsystem after the post-subsystem processing completes.
PGM PARM(&SBSD &SBSLIB) DCL VAR(&SBSD) TYPE(*CHAR) LEN(10) DCL VAR(&SBSLIB) TYPE(*CHAR) LEN(10) ENDSBS SBS(&SBSD) OPTION(*CNTRLD) ENDSBSOPT(*NOJOBLOG) MONMSG MSGID(CPF1003 CPF1054) OOP: ALCOBJ OBJ((&SBSLIB/&SBSD *SBSD *EXCL)) MONMSG MSGID(CPF1002) EXEC(DO) SNDPGMMSG MSGID(CPF9898) MSGF(QCPFMSG) + MSGDTA('SUBSYSTEM' *BCAT &SBSD *BCAT 'IS + ACTIVE') GOTO CMDLBL(LOOP) ENDDO SNDPGMMSG MSGID(CPF9898) MSGF(QCPFMSG) + MSGDTA('SUBSYSTEM' *BCAT &SBSD *BCAT 'IS + NOT ACTIVE') CPYF FROMFILE(LIB/CRITICAL) + TOFILE(LIB/CRITICALBK) MBROPT(*REPLACE) CRTFILE(*YES) MONMSG MSGID(CPF0000) CLRPFM FILE(LIB/CRITICAL) DLCOBJ OBJ((&SBSLIB/&SBSD *SBSD *EXCL)) STRSBS SBSD(&SBSLIB/&SBSD) ENDPGM
As you can see, there’s more than one way to stop a subsystem. All of these techniques work; it’s just a matter of choosing the one that works best for you.