Controlling System i Shutdown Activities Using an Intelligent Power-Handling Program, Part I
October 10, 2007 Hey Brian
It happened. We had our system about a year and we had a power crash over the weekend. Yes, we have a UPS, but that does not mean that when the system was starved for power it did not come crashing down, and things got worse after that. When we got the UPS, the boss decided that we would get the version that kept the system up for eight minutes instead of getting the extra battery for 37 minutes. We set the delay timer (QUPSDLYTIM) at two minutes. How is it that we crashed so hard? From this encounter, we got damaged objects all over the place with the following Message ID CPF8111: Message: Partial damage on member XXXX. And there was more to this message. Can you imagine how this scared us? I called IBM and they were very helpful. In fact, the recovery to be able to use these objects again was right in the message above and thankfully it worked. The good old CPYF was a godsend. We recreated those that we could using CPYF and we restored those that we could not from the backup tape. We know we lost data but we are not sure what at this point. We will be rechecking yesterdays work to see what was fully processed and what was not. At least we are back up and running and the folks are not standing around pointing at us anymore. I thought we were covered by putting in the UPS and setting the values. What happened? –Elle Hi Elle, The short answer is that your battery died before your system powered down, which has just about the same effect as somebody pulling the plug from the system out of the wall and you don’t have a UPS, or pulling it out of the UPS if there is one in place. Fortunately, there is an intelligent power warning feature on the system that is programmable. A user-written program can read information from the UPS, quiesce jobs (slow down the computer), and call for an orderly shutdown based on how much time is available on the batteries. It requires a special wire to be connected in addition to the power cord and a program needs to be pirated and modified or built from scratch. IBM even provides sample code. But, even without this special code, there are a few system values that, if set correctly and if the UPS is communicating properly with your AS/400 or System i5, can effect a semi-orderly shutdown. Unfortunately in your scenario, no amount of programming and no amount of intelligent system value setting would have helped. You simply ran out of power. You were wise in setting the value QUPSDLYTIM low (at two minutes), but you were lacking enough energy in the battery to sustain the two minutes of operations as well as the time to power down. The first suggestion I would make would be to order an extended battery pack and speak to the UPS vendor to make certain you understand how much time this new UPS will give you. In the meanwhile, you may perform the required tests on your existing UPS to provide you with a better idea on how much time you actually have at your disposal. Note that you must wait until your UPS charges fully before you can gain the required information from these tests. I would expect you will find over time that the UPS batteries will significantly lose their charge. This happens very often–far more often, in fact, than it should. Just as a new or old car with new batteries requires eventual replacement, your UPS batteries require the same treatment when they lose their charge. You may have had a number of minor power issues for a few seconds that this UPS handled for you, but once power went out completely, it was a different story. Until you assure your UPS capabilities, set the UPSDLYTIM system value that governs all this to no more than 10 seconds. That way, you can continue to ride out these power issues and this should produce enough charge to permit the system to come down in a more orderly fashion than an abrasive plug pull. For the long term, however, as Scotty from Star Trek would say, you simply need more power. Next time, I’m sure you won’t let management supply inadequate UPS battery capabilities and you’ll set your UPSDLYTIM value so that you will be able to sustain that delay without your system crashing. I would also recommend that you look into writing an intelligent power monitoring program so that if you have a power hit that your UPS cannot withstand, you can more ably shut things down in the system to help ensure a better startup. Those damaged objects sure are scary. Follow-Up Note Hey Again Brian: So, the program you are talking about is over and above the QUPS system values? I had thought that UPS monitoring was a standard feature of the operating system. We do have the cable and we have it plugged in and I think that it works–but I do not know for sure. How can I set the machine up so that this is less likely to happen again? Can you give me an idea of how to write this power handling program? –Elle Hi Elle, Yes, the software I mention in my first reply is above and beyond the system values. However, the system values do offer you some protection as long as you set them correctly. But again, you must pay attention to setting the system values with enough time left after a power hit to permit the system to shut down without crashing and preferably in a very intelligent fashion. After assuring that the power warning cable is installed properly and functioning, you can write or modify a CL program to work with the uninterruptible power supply. It can be tailored to your specific system requirements. For example, additional recovery can be added to monitor error conditions specific to your system/software/data. Other programs can be called from this program. For instance, a second user-written program can perform the steps necessary to prepare for a normal system shutdown. These steps may include holding job queues, sending messages, and ending subsystems. The program should also restart normal operations should the power outage end before the system is powered down. An intelligent power handling program provides a better way of managing the power so that you can end functions that need to be terminated without having the system fail before a critical job finishes or ends abruptly as happened in your case. Because the batteries could not sustain the load any longer, you experienced software damaged objects. Let’s first see how you can solve this problem with the system values and then we’ll pursue how you can tackle it more intelligently with a power monitoring program. The first system value that we will examine is called QUPSDLYTIM for “uninterruptible power supply (UPS) delay time.” With no programming, once you are warned that power is out, the system will continue to operate normally for that period of time you specified as the delay factor in this system value–usually anywhere from two to 20 minutes. When a power warning is communicated to the system via the special cable, a little alarm clock is set for the system to wake up again in the number of minutes specified in the QUPSDLYTIM. When the time expires, the alarm clock goes off and the system will do an automatic PWRDWNSYS. Case closed. If there is enough time for the PWRDWNSYS to complete, you should not have damaged objects. Based on the setting of another value, QPWRRSTIPL (“automatic IPL after power restored”), the system will automatically IPL if power is restored. This may not be such a good idea. If power is fluctuating, you may be resuscitating a system with a dead UPS, so power experts suggest you let sleeping dogs lie, so to speak. If your system conks out because of power, do not permit it to immediately try to come back up if power is restored. Set the QPWRRSTIPL to “0” so your system does not automatically IPL without you being able to take some action during that IPL or shortly after. More importantly, by doing it this way, you get to make certain that power really is restored and your battery is charged rather then trusting that all is well. The last thing you need is to crash a system that just went down to avoid a crash. One thing is for sure: your UPS will not have had any time to charge so the second crash would be more destructive than the first. It seems to me that when you used up the two minutes, your eight-minute UPS did not have enough time to finish the power down cleanly and the system failed crudely. Depending on the load you were taking from the UPS during the power incident, if the UPS matches the load, you should get your eight minutes (or close to it). Since you did not, consult your power manuals and learn how to check the batteries. Also, take a look to see if anything else is mistakenly plugged into the UPS, like perhaps a Mr. Coffee. I kid you not, even the smartest among us have oversights that are comical in retrospect. That is the short answer. Now, I have a long answer for you and I actually built some code that can get you started in tackling this nightmare avoidance project. I am going to give you a general idea of how the program can handle your needs both here and in Part II of this article, which we will show next week. I’ll demonstrate some sample code that you can use to get your system prepared to run with an effective power-monitoring, power-handling program. I’ll show you a nifty test program to check all this without having to turn power off, and I’ll show you an even better test program that demonstrates how you can set the testing environment to meet your needs. We’ll also touch on hardware since it is important, but let’s first discuss the notion of a power-handling program. Power-Handling Program A power-handling power program, when used with power protection devices, can minimize interruption during a power loss situation. Power protection devices such as a UPS help provide energy to the AS/400 or System i5 when utility power is temporarily interrupted. The energy that is provided helps prevent system functions from ending abruptly thereby creating an abnormal system termination and the possibility of lost or damaged data. The controlled shutdown mechanism is tuned through programming to help the system power down as smoothly as possible, minimizing adverse impacts on startup. The following steps need to be taken to get a power handling program functional on your system:
Getting the Hardware Right In the information that Elle provided, she noted that her UPS is a Powerware 9910-P15 model. According to Powerware’s site, this model comes with an expected battery life of eight minutes. Unfortunately, Elle already found this did not work. After making certain the UPS is working properly by speaking with the Powerware support group, the next to-do is to purchase a battery extender to increase your battery life to at least 37 minutes. When upgrading your UPS, follow the directions to install as given below and in your Powerware documentation to be sure everything works from the get-go. Visit Powerware’s UPS offering Web page for some useful information. For those not using Powerware (which is IBM’s default UPS), I suggest you contact your UPS vendor for the specifics on your model. Powerware is so popular with System i that the information in this article will offer a great deal of help to the novice implementer. New Powerware UPS units come from the plant unprepared to talk to the System i5. When your business partner orders the equipment correctly, your package does include a necessary accessories kit, but you must use it to complete the installation. This kit contains a separate logic card that must replace the communications card in the standard Powerware UPS. You must replace this card even before you plug in your UPS. This is necessary for the power warning connection process. The System i and the UPS communicate over the cable that communicates with this logic card. The kit parts are well marked but sometimes the cables are not. When you open the kit, be sure that you mark the cables to indicate their source. If things do not appear right, you may have to call IBM and/or Powerware to find out if you have the correct cables. One thing is certain –no amount of battery power will make up for not having the right power warning cable connection. Depending on the UPS model that you are getting from Powerware, you will want to consult the IBM manuals that pertain to UPS and System i. The Powerware (9910-P15 or 9910-P33) Installation Guide for IBM Applications is the best manual for understanding how to install the Powerware UPS used in most i5s. If you have a new machine, the UPS verification process will be easier since you won’t have to worry about users and the possibility of bringing the machine down unexpectedly. If the machine is not new, however, pick a time when you can be alone on the system before you finish the hardware part of this installation. Before you plug the System i side of the cable into the System i, make sure that all units are powered off. Before you begin moving cables around, if your System i is already plugged into the UPS, power it down gracefully to start this process. Also, power down the UPS if it is not coming right from the box. Because your UPS and most other Powerware boxes are factory installed with a “single-port card, which is the correct communication card for most other types of machines, you more than likely have some work to do before you can begin. The goal is to make this unit communicate with your System i. Though it may be theoretically possible to accomplish this via the power cord, this is not ideal. Also, be sure to unplug the System i when it is powered down. As you read the installation guide, you will find that you must change the single-port card to the “relay serial card,” which should be bundled with the cable kit that you received. None of this work should be done while the UPS or the system are plugged in. After you change the card in the UPS and connect the cables (one supplied in the accessories kit and the other by IBM), you are on your way. The IBM cable is about 1 foot long. It is called the “AS/400 interface cable.” The other cable is about 6 feet long and it is called the “UPS interface cable.” It would be nice if just one cable did the trick but in new 5XX machines for example, we’re left wanting. The two cables in the middle must be screwed together. The IBM Powerware manual shows diagrams of the process so you won’t think that you are doing something silly. When all of the cabling is ready and no units are plugged into any power source, begin your installation. If you are adding one or several battery units, install the batteries first. The Powerware UPS manual shows you how to connect these units so that you will be operating on full battery power. If you do this incorrectly without a bottle and without a genie, on your own, you can magically turn a 37-minute UPS into an eight-minute UPS. So, read the directions carefully. When the UPS is assembled, connect the already coupled power warning cable to the System i and to the relay serial card (or equivalent) in the UPS. Then, plug your system into the UPS, and plug the UPS into the wall. Check the charging level of the UPS. Your Powerware manual may provide diagnostic tests you can perform and observe the lights as directed by the manual. If something seems awry, contact Powerware before you trust your System i to the UPS. In my experience, I have found the Powerware support people to be as helpful as the IBM support personnel. If it is just out of the box, or if you just changed or added batteries, or if you just are not sure, power up your AS/400 for operations and come back another day. By the time you come back, the batteries in the UPS should be charged well enough that you can test the connection to make sure it is working properly. Once you have given the UPS enough time to charge, you are ready to test the installation. Before you do anything else, get out your Powerware manual and make sure the lights indicate that the unit is fully charged. The first step is to verify that you have made the connections properly and that the UPS is installed correctly. You may feel strange about doing this, but the best way to find out if the UPS works isn’t to simply wait for a power blip. Instead, you need to shut off the power to your UPS. Pulling hot plugs out is really not the best way to test. While in most cases this works fine, to be safe, you should find the circuit breaker (your System i should be on its own) and turn off the breaker. So, when I say, “pull the plug to test the UPS,” please know that I mean to turn off the circuit breaker. When you have executed this physical test, the messages in the logs proving that the connection worked are your best assurance that you are OK from a UPS to System i communications perspective. The messages you should see in the system operator’s message queue are: System utility power failed at Timestamp These messages will more than likely be in the queue. In the unlikely event that they are not, you are not communicating and that too can explain a system crash. It may be cables, it may be the UPS communications card, or it may be that the system was plugged in but down when you connected the cables. Go through the instructions one more time, making sure that it starts again from an unplugged state, before you call support. If it still does not work and you have already verified that the UPS is operating correctly with the Powerware or other UPS support team, then it’s time to give the IBM support folks a call. They will help you assure that you are connected to the correct IBM port and that the cables are working properly. When you are sure your hardware is communicating, there is only one true way of determining how long your UPS can sustain power to your system. We are assuming now that you have done the basics and the coffee pot is not on the UPS lines. Set your UPSDLYTIM system value to a reasonable number–at least five minutes less than your expected amount of battery time–bring the System i to a restricted state (ENDSYS — ), and let the UPSDLYTIM bring down your system. Voila! Testing capacity Because testing capacity is an overlooked aspect of UPS installation, I have expanded the instructions for this section. Please read carefully. Testing a UPS is a pretty simple task. Really! Just throw the circuit breaker and clock its drain time. Nobody will say this is the safest way to test, but it is almost fool-proof. Make sure that the Powerware units (the UPS and extra battery) have been allowed to charge for the period of time recommended by their manufacturer and then some. Before you set the value for the intelligent program, it would be good to see how long the unit will last when you are not going to hurt anything. (And, again, when nobody is on the system). Change the QUPSDLYTIM system value to 60 seconds X 45 minutes = 2700 seconds to assure the system will test the battery. Start a stop watch or check the time. Place the system in a restricted state and throw the circuit breaker. Theoretically, you will get 37 minutes and it will power down from not having enough juice right at 37 minutes. When I performed this test, I ran in restricted state and the system lasted through the timer and then it took this four 70GB drive System i Model 520 four minutes to actually power down. In this case it had more than the 37 minutes of capacity, but to be safe don’t count on that much time for the long haul or when your disks are really clanging. If you get your 45 minutes, the UPSDLYTIM alarm clock will fire and it will bring the system down. That is a good situation. Because we are in a restricted state, pages will have been written to memory and this is a reasonably safe test. Most experts suggest that it is also a good idea to periodically test a UPS and its failure modes. A good time to do this would be right after a backup. This is even safer than just being in the restricted state. Nobody is logged in and you’ve got a good current backup of the machine. Throw the circuit breaker with the UPS on it to simulate an outage and see how the transition goes. As noted above, a number of UPS vendors suggest that testing an UPS by pulling the plug from the wall is “not” a good idea. Most UPS units like to have a good idea of what ground looks like. It is likely that unplugging just about any UPS for a short amount of time would not be too dangerous (don’t take my word for it, though), but in all cases, throwing a circuit breaker would be a better thing to do. Without this test, you may have a base UPS and be depending on it, and any power program you write will fail if the UPS does not have a reasonable amount of power. You actually do need to know what to expect in terms of power sustainability in order to use the power properly. When you have all of this working as you would expect it to work, you can read Part II in next week’s Four Hundred Guru to see if an intelligent power handling program is your best bet.
|