Should Spark In-Memory Run Natively On IBM i?
November 6, 2017 Alex Woodie
There’s a revolution happening in the field of data analytics, and an open source computing framework called Apache Spark is right smack in the middle of it. Spark is such a powerful tool that IBM elected to create a distribution of it that runs natively on its System z mainframe. Will it do the same for its baby mainframe, the IBM i?
So, what is Apache Spark, and why should you care? Great questions! Let’s introduce you to Spark.
Spark came out of UC Berkeley’s AMPLab about five years ago to provide a faster and easier-to-use alternative to MapReduce, which at that point was the primary computational engine for running big data processing jobs on Apache Hadoop. While Spark has a learning curve of its own, the Scala-based framework has not only replaced Java-based MapReduce, but also eclipsed Hadoop in importance in the emerging big data ecosystem.
Spark is useful for developing and running all sorts of data-intensive applications, including familiar programs like ETL jobs and SQL analytics, as well as more advanced approaches like real-time stream processing, machine learning, and graph analytics. This versatility, as well as well-documented APIs for developers working in Java, Scala, Python, and R languages and its familiar DataFrame construct, have fueled Spark’s meteoritic rise in the emerging field of big data analytics.
IBM took notice of Spark several years ago, and has since worked on several fronts to help accelerate the maturation of Spark on the one hand, and to embed Spark within its various products on the other, including:
- ML for z/OS, which executes Watson machine learning functions in a Spark runtime in the mainframe’s Linux-based System z Integrated Information Processor (zIIP).
- Integrated Analytics System, which combines Spark, Db2 Warehouse, and its Data Science Experience, a Jupyter-based data science “notebook” for data scientists to quickly iterate with Spark scripts.
- Project DataWorks, which brings Spark and Watson analytics together on the Bluemix cloud.
- Open Data Analytics for z/OS, a runtime that combines Spark, Python, and the Anaconda package of (mostly) Python-based data science libraries from Anaconda.
- And Spark running directly on its Bluemix cloud.
And considering that IBM opened a Spark Technology Center in 2015, it’s safe to say that IBM is quite bullish on Spark. (That’s a major understatement, actually.) But perhaps the most interesting data point for this discussion came in 2016, when Big Blue launched its z/OS Platform for Apache Spark, which is a native distribution of Spark for the System z mainframe.
Native Spark On The Mainframe
IBM received kudos for the work from various industry insiders who participated in this video on the z/OS Platform for Apache Spark webpage. Among those singing IBM’s praise was Bryan Smith, the former CTO and VP of R&D at Rocket Software.
“IBM did a really good job in porting Apache Spark to z/OS,” Smith says. “They could have just done a very simple port. But they didn’t. They didn’t cut any corners. They really exploited the underlying hardware architecture. They’re using specialty engines. They’re using the hardware compression facilities. They’re able to leverage the 10 TB of memory that you have on a z13 machine and the . . . processors, so you can actually run those Apache Spark clusters on z/OS.”
Another software vendor that appreciates having Spark running natively on z/OS is Jack Henry & Associates, the Missouri banking software developer that also has a fairly big IBM i business.
“The pain point to us is getting the data out to our customers,” Todd Hill, Jack Henry’s direct of card processing, says in the video. “Currently we have data on the mainframe. We have a distributed stack for across many types of applications. What Apache Spark does for us is to keep your data centralized in the one location. So instead of moving all that data off from multiple platforms into other applications, I can run Apache Spark directly on the mainframe, at low cost, and get it built out, and get the data to the people that need it.”
Mike Rohrbaugh, zSystem lead for Accenture, says having Spark on the mainframe helps by automating the generation of intelligence and reducing the complexity. “It’s just so simple to bring the analytics engine back to the data to do intelligent automation,” he says in the video.
IBM i Versus The Mainframe
So how does this relate to the question in the headline of this story? For starters, let’s compare the similarities and differences between the IBM i and the z/OS mainframe platforms.
First, the similarities. Both the IBM i server and the z/OS mainframe are relied upon to run transactional applications that are core to the businesses that use them. Both of them are used to store structured data that’s arguably the most critical data for the businesses that use them. They both store data in the EBCDIC format, and are heralded for best-in-class reliability and security. The also both run proprietary operating systems as well as open OSes like Linux, mostly utilize older languages (RPG and Cobol, respectively), and sport text-based interfaces that use the 5250 and 3270 datastreams, respectively.
Now, the differences. Mainframes have their own processor type, while IBM i runs on the more popular Power processor. The mainframe stores data many different data stores (Db2 for z, copy books, etc.), while most IBM i data is stored in Db2 or IFS. Demographically, mainframe customers tend to be the largest companies in the world, whereas IBM i has a bigger installed base among small and midsized business. There’s also a large concentration of mainframes in banking, insurance, and healthcare, whereas IBM i has a stronger foothold in manufacturing, distribution, and retail.
IBM i and mainframes are strong transactional systems, and are less known for their analytical prowess. However, data analytics are becoming increasingly important in this day and age, especially as part of a company’s digital transformation strategy. The pundits often say that all companies will need data analytics strategies to effectively compete in the coming decades. That’s probably a bit of an exaggeration, but only for the timing.
The question, then, becomes the places where this analytical processing is going to take happen. Today, most mainframe and IBM i shops offload it to another system. It’s fairly common for users of both mainframes and IBM i servers to set up elaborate workflows to move data from the “big iron” transactional systems to dedicated analytical systems, including massively parallel processing (MPP) column-oriented systems like Teradata, Netezza, or Vertica. With the advent of Apache Hadoop clusters running on commodity X86 processors, many companies started experimenting with Hadoop computing, which invariably introduced them to the in-memory Spark framework.
IBM wants to keep those analytic workloads on the mainframe if at all possible, which is why it made Spark run natively. This not only keeps costs down for its customers, but it also make the mainframe more “sticky” and lessens the urgency to migrate data and workloads off its biggest cash cow.
The question, then, is whether IBM sees similar dynamics at play for the average IBM i user. Mainframe customers, owing to their size and tendency to be in financial services, are early adopters of new technologies, like Spark. They’re arguably closer to the cutting edge than the average IBM i shop, and the dollars at stake for each mainframe client are much larger.
It’s safe to say that IBM i members of the Large User Group (LUG) probably are more closely resemble their mainframe brethren, and could benefit from having a powerful, cutting-edge tool like Spark running natively on the IBM i. They’re more apt to have a bigger investment in separate analytical environments, be it a Teradata machine or a Hadoop cluster. They’re also more likely to have some data science Skunk Works project running somewhere in their shop, and are more likely to already be running Spark in Linux, which is where it was originally developed to run.
Spark On IBM i
While Spark may not be on the radar of the average IBM i shop yet, folks within IBM are starting to ask questions about whether Spark will impact the IBM i installed base, and if it’s going to be important to them, how it ought to be introduced. If the company is planning to support Spark natively on IBM i, the company isn’t saying publically, which is not surprising.
What we do know, however, is that IBM executives are at least talking about the prospect of bringing Spark to IBM i in some way, shape, or form. “It’s part of some discussions,” IBM’s product development manager for Db2 Web Query Robert Bestgen recently told IT Jungle.
There are two general options for bringing Spark to the platform: porting Spark to run natively on IBM i or running in a Linux partition running on Power Systems. Spark was written in Scala, and therefore can run within a Java virtual machine (JVM), which the IBM i platform obviously runs. It may not be a stretch to get it running there, but there could be other factors that come into play, such as IBM i’s single level storage architecture, and how that maps to how Spark tries to keep everything in RAM (but will spill out to disk if needed).
Should the Spark port be native? “Depends on who you talk to,” Bestgen said.
The widely held thinking within IBM is that the Linux route makes more practical sense – if Spark is to come to IBM i at all (which, as far as we know, hasn’t been decided). “If you back up [and look at it] from an IBM i perspective, IBM would say that IBM i is part of the Power Systems portfolio, or what we call Cognitive Systems now,” Bestgen says. “For Power Systems, those platforms [like Hadoop and Spark] tend to run best on a… Linux kind of environment. That’s what folks think about it.”
Few IBM i shops today are even running Linux partitions. According to HelpSystems‘ 2017 IBM i Marketplace study, fewer than 8 percent of organizations are running Linux next to IBM i on a Power Systems box, while about 9 percent are running Linux on other Power boxes. AIX’s penetration is about 50 percent higher, for what it’s worth.
There’s a case to be made that IBM i shops are lousy at figuring out how to leverage the wealth of available tools for Linux, even after IBM went through the trouble of supporting little endian, X86-style Linux to go along with its existing support for big endian Linux within Power. “One of the areas that IBM could do a better job selling is saying, you seem to be willing to run Linux on a different platform. Why not run it on the platform that you have in your system now?” Bestgen says.
At the end of the day, there are a lot of unanswered questions, including whether the IBM i installed base needs or wants such a powerful tool as Spark, let alone how it should run. So the question to the answer in the headline is no. “I don’t think we’re there yet in terms of running those things natively on i,” Bestgen says.
RELATED STORIES
Visual Data Exploration Comes To Db2 Web Query
What Does IBM’s Embrace Of Apache Spark Mean To IBM i?
Hadoop and IBM i: Not As Far Apart As One Might Think
IBM Power Systems Can Do Big Data Analytics, Too
What Does ‘Big Data’ Mean for IBM i?
Big Data Gets Easier to Handle With IBM i TR7
Inside IBM ML: Real-Time Analytics On the Mainframe (Datanami)
Great Article on Spark memory. I am from http://www.online-trainings.org