Lessons from Southwest’s IT Debacle
January 11, 2023 Alex Woodie
During the historic Christmas Blizzard of 2022, air traffic came nearly to a standstill across the United States. Most major airlines rebounded before the ice fully melted, but Southwest Airlines was unable to get its Boeing 737s back up in the air. The main reason cited may have a familiar ring to it for IBM i shops: An overreliance on outdated, legacy technology.
Not all the details are known about what went on at Southwest Airlines during the week that followed the winter storm that decimated the eastern half of the country for several days, starting on Christmas Eve. But enough information has slipped out that a rough outline can be drawn of the slow-rolling disaster.
The trouble started when Winter Storm Elliott hit the United States on December 24, causing predictable travel delays. As temperatures plummeted and snow piled up, it wreaked havoc on airplanes, airports, and the people who make air travel possible. Jet fuel gummed up, wings turned frosty, and jet bridges froze together.
As temperatures gradually warmed over the Christmas holiday, most airlines slowly got rolling again, and passengers and crew made their way home. But Southwest was unable to recover from that initial and sudden halt, and ultimately was forced to cancel nearly 17,000 flights, stranding hundreds of thousands of passengers.
One of the main culprits in the breakdown was an outdated crew scheduling system. Called Sky Solver, the application was designed to match airline crews with flights in the Southwest network. The application reportedly was designed to handle up to 300 scheduling changes per day. However, during crunch time in late December, the volume of transactions was much higher than that, and the application essentially crashed, forcing the airline to manually build flight schedules.
Southwest, now the largest airline in the United States, gradually got back on track during the week between Christmas and New Year’s, and today it’s operating normally. However, the experience has exposed a major weakness in the company’s operations, one that is expected to lead to the public company taking an estimated 3 percent to 5 percent hit on fourth quarter earnings. The total cost – including compensation for stranded travelers – will be $725 million to $825 million, the company said last week in a regulatory filing. However, the damage done to Southwest’s reputation can’t be measured.
The airline is sure take a hard look at Sky Solver. The application was originally developed by a company called Caleb Technology Corp. (since bought by GE Aviation), but Southwest heavily modified it over the years. Employees with the airline had expressed concerns about the aging application before Winter Storm Elliott hit. For example, members of the airline’s pilot union say it would send them on circuitous routes as Southwest passengers to other airports to get assignments, called “deadheading,” that made no sense.
“We call on SWA to recode Sky Solver to respect and avoid known fatigue scenarios,” Jon Weaks, president of the Southwest Airlines Pilots Association, wrote back in 2017.
The need to rebuild or replace Sky Solver was evident, but management was too focused on minimizing cost and didn’t prioritize a replacement, members of the airlines’ pilots union have said. The company did, however, invest $500 million in implementing a new reservation system in recent years, according to the December 28 Wall Street Journal article “How Southwest Airlines Melted Down.” Bob Jordan, who took the helm of Southwest in February 2022, has been more supportive of legacy modernization than his predecessor, members of the pilots union have said. But he was clearly unable to make the changes in time to avoid the disastrous holiday season.
If there’s a silver lining in this storm cloud, it’s that Southwest’s pain with Sky Solver can be a lesson to other companies facing similar situations with aging technology.
Miten Marfatia, an IT modernization expert and the CEO of California-based application modernization software provider EvolveWare, says Southwest’s troubles can be informative for companies that run older software, as many IBM i shops do.
“Legacy systems have generally been in production for 10-plus years, and more often than not, for over 30-40 years,” Marfatia says. “At the time of development, the volume of data that required processing [at Southwest] was significantly smaller than what it is today, especially in a fast-growing digital world. Software that worked for the volume of data and environment for which it was written will begin to experience systemic failures as these parameters change. It seems that these failures had begun to surface quite some time ago, but apparently the management chose not to pay the attention it deserved.”
Another takeaway is this: As tedious as modernization initiatives tend to be, they are necessary exercises to prepare the company for the real-world conditions it will inevitably face.
“[T]hese initiatives have become imperative and if action is not taken, the cost of ‘kicking the can down the road’ can be devastating,” Marfatia continues. “For Southwest, there is no option but to review this software with the highest priority and take whatever corrective action is necessary. It would not be surprising if Southwest decided to analyze their entire software application portfolio in 2023. This would not only avoid increased federal reviews of its IT policies, but would also better serve their customers and staff.”
Eric Kimberling, the CEO Third Stage Consulting, a Colorado-based firm that advises company on digital transformation, recently shared four takeaways from Southwest’s IT debacle.
The first item on Kimberling’s list is to pay attention to the risks that come with operating substandard, outdated technology.
“Southwest has a culture of low cost. They are a low-cost, low-fare airline,” Kimberling says in a video posted to Twitter. “That is their business model, to minimize cost, so it’s understandable that they’re probably not going to invest heavily in anything. But in this case, I think what the company is going to find is that they’re going to spend a lot more money fixing this problem and dealing with the fallout of having bad systems . . . than if they would have simply upgraded their technology.”
Southwest’s strong growth was another factor in the episode. The airline has more than doubled its revenues over the past 10 years, and profits have been particularly strong since COVID-19 restrictions were lifted. However, the company’s underlying systems appear unable to keep up with that growth, Kimberling says.
“This was partially a technology problem. The technology was limited, heavily customized, a lot of changes with the system itself. But there were also operational issues that created this problem,” he said. “They had limitations with their current processes, and they didn’t seem to recognize that, or didn’t seem to quantify what those risks were. But what Southwest seems to have failed to do is really define what their future state operating model needs to be for the growth that they have achieved and the growth that they continue to experience.”
Another factor is Southwest’s freewheeling culture. If you’ve ever flown with the airline, you’ve likely heard the captain cracking jokes on the public address system (the flight attendants are also prone to saying things like “Welcome to beautiful Jacksonville, Florida” after landing in Oakland, California).
While the airline can keep the fun, easy-going culture and the growth at the same time, the fun part can’t come at the expense of ensuring operational integrity, Kimberling says.
“Southwest is a sizable company now where they need to be focusing more on adding to the recipe of their business,” he says. “So in other words, not abandoning that freewheeling culture and that entrepreneurial spirit, but starting to inject more structure and efficiency and scale into the organization. That was a big miss in my opinion, that cultural shift, and that cultural transition that Southwest should have made by now, and they probably will have to make that transition now given the magnitude of the problem they just experienced.”
While Southwest needs to take a close look at the Sky Solver system, the airline should not feel a need to rip out and replace all of its technology just to mollify its critics, Kimberling says.
“A lot of pundits and industry analysts and certainly software vendors in the industry will use this as an opportunity to say Southwest should just totally overhaul all their operations, all their technologies, and replace it all. Start from scratch, put in brand new technology, put in a big ERP system or whatever the case may be,” he says. “I don’t think Southwest necessarily needs to do a massive overhaul of all their systems right now.”
A better approach is to take a close look at the most pressing needs and react accordingly, he says.
“They can look at ways that they can make strategic investment in their technology,” Kimberling says. “In other words, they don’t need to go through a massive digital transformation that’s going to cost them hundreds of millions of dollars and impose a huge amount of additional risk to the organization, which just experienced the risk they just did. But what they can do is say let’s really prioritize our technological needs and where the pain points are, and start to attack those now and get some quick wins.”
In addition to a real-world lesson on the need for application modernization, Southwest Airlines also just got a crash course in crisis management, but that is a story for a different publication.
RELATED STORIES
Warning Signs: Inside Travis Perkins’ Failed ERP Migration
Alex, you have some reference to Southwest’s unique scheduling practice but make it sound like it’s caused by legacy software. It is not. Software implements Southwest’s business practice. The news covers that well. This technical article does not.
It is questionable whether any software written even yesterday in whatever script kiddies wouldn’t call legacy would do anything different. The scheduling practice pushes flights and crews from location to location versus the rest of the industry’s practice of spoke and hub scheduling. The only way Southwest could recover was to stop everything and start over. Software can’t make up for a business practice that does not calculate in nationwide days long shutdowns to traffic.
Even news articles managed not to hand wave at legacy software for this problem, explaining the underlyiing business practice involved in Southwest scheduling. This is yet another time I have to address blaming legacy software here on ITJungle, most recently it was the blame on legacy software for unemplyment payments during Covid layoffs that idiot politicians were making even when the peoblem in Florida was “modern” web software. The blame on COBOL in New Jersey and elsewhere was just as ignorant and unfounded.
It would be nice to get more correct, nuanced coverage of software problems involving legacy systems here despite universal hatred of legacy systems that run our entire business and government infrastructure, much better and more cost effective than whatever is being touted and billions spent trying, unsuccessfully, to replace with it.
regards – a grumpy legacy programmer whose IBM i RPG software is doing very well, thank you
I would love to know what the NOTAM system is and what it is written in and what it is running on…. No one seems to know how to ask such questions. I would have liked more detail on whatever.
Bad software is bad software. There is never any excuse for that, and there are corner cases in all software that happen every blue moon. But lack of resilience is an issue in both cases, and that, as far as I am concerned, is what the lesson is. How could both have been avoided? In the case of Southwest, it looks like the whole way of scheduling does not work on that corner case.