Unwinding Python’s Data Science Potential On IBM i
February 13, 2017 Alex Woodie
There’s a revolution occurring in the world of big data analytics and data science at the moment, and Python is playing a starring role. Python is one of the languages that IBM has brought to the IBM i platform, so that’s clearly a good thing for midrange shops. But is enough being done to grow Python’s potential on IBM i? The answer to that question is not clear.
Python was originally conceived by Guido van Rossum as a “hobby” programming language to keep him busy over the 1989 Christmas break. “My office. . .would be closed, but I had a home computer, and not much else on my hands,” the Dutchman wrote in the Wikipedia entry on Python. “I decided to write an interpreter for the new scripting language I had been thinking about lately: a descendant of ABC that would appeal to Unix/C hackers.”
He was a big fan of Monty Python’s Flying Circus, so naturally, he called it Python. The interpreted language languished in relative anonymity for years, just one of a growing number of languages developers had to choose from. As the Web took off in the early 2000s, Python’s popularity grew, and it became a regular on TIOBE‘s top 10 list of most popular languages. In 2007 and 2010, Python was named TIOBE’s language of the year.
Today, Python sits at number 5 on that popular list, behind high-level stalwarts like Java, C, C++, and C#, and just ahead of other scripting languages like PHP, JavaScript, Perl, and Ruby. But while many of the languages surrounding Python in the standings are used for Web development, Python’s primary strength today comes from a very different source: applied statistical analysis, which today is often called data science or big data analytics.
Today, much of the data science work is done in one of two primary languages: Python and R. If you went to college and studied particle physics before sidestepping into the corporate data science world, you’re likely a user of R, which is powerful yet difficult to learn. If you started your data science career as a computer science major, you’re going to know Python, which is also powerful but much easier to learn than R.
Python Proliferates
Today, the center of gravity of the Python data science world is an open source project called Anaconada. Backed by an Austin, Texas-company called Continuum Analytics, Anaconda was created to unify the large array of Python libraries that were proliferating in the scientific computing community.
Anaconda is the brainchild of Continuum co-founder and chief data scientist Travis Oliphant, who studied and received his PhD in advanced biomedical imaging at the Mayo Clinic in Rochester, Minnesota and later taught computer science at Brigham Young University. Oliphant and his Continuum co-founder and chief technology officer Peter Wang were the creators of the NumPy package and were founding contributors of the SciPy package, both of which are used by scientists around the world today to manipulate data using Python.
“When I came to Python one of the first things I loved was how quickly I could iterate, because it’s interpreted,” Oliphant said during his keynote address at the inaugural AnacondaCON conference last week in Austin. “I could quickly iterate. I didn’t have to compile, wait for a while, lose my train of thought.”
Oliphant and his Continuum co-founder Peter Wang had the foresight to realize that, despite the power of Python for enabling rapid iteration on data, the proliferation of Python packages threatened to hurt its long-term growth. Other Python-based packages like Pandas, Scikit-learn, and Theano offered solutions for particular parts of the analytic spectrum, but they didn’t play well with others.
That was a huge impediment to developer productivity, says Michelle Chambers, Continuum’s EVP for Anaconda and chief marketing officer. “Let’s say you’re going to install the Scikit-learn package,” she says. “First thing, it says, ‘Thanks for trying, but I’m dependent on 10 different other packages.’ And you have to get the right version and you have to work together and you have to do the build.
“We solved that problem,” Chambers continues. “Until we could get that problem solved, it was always going to be a niche thing.”
Chambers previously worked at Revolution Analytics, which focused on making R easier to use and able to run on distributed systems, and which was acquired by Microsoft about two years ago. Revolution Analytics was never able to solve this problem in the same way Continuum did with Anaconda.
“The [R] packages were all in CRAN already, but the problem is you have multiple versions. You have cross platform compatibility issues. And you have dependency issues,” Chambers says. “We didn’t solve that problem. We couldn’t figure it out.”
Anaconda Rising
Today, the free and open source Anaconda packages includes more than 720 data science packages for things like interactive data visualizations, machine learning, and deep learning. Data scientists working in fields like financial services, retail, and healthcare use Anaconda products to explore big data sets, develop models that describe behavior, and use algorithms to automatically score live data to automate decision-making in the real world.
Putting all these libraries in one place has resonated with the data science community–in a big way. At the end of 2015, Anaconda had been downloaded 3 million times. By the end of 2016, that number was 11 million. And in just the first 40 days of 2017, that number has shot up to 13 million.
There’s clearly demand for a general purpose, open source, data science platform, says Wang. “What we want is the data scientists to be able to write their scripts, write their notebooks, and move that between all these different platforms,” he says. “That’s our vested interest in this. That’s why we want the Intel distribution to be based on Anaconda, the Microsoft stuff to be based on Anaconda, and the IBM stuff to be based on Anaconda.”
Continuum has already done the work to integrate Anaconda with big data sources like the Hadoop Distributed File System (HDFS) and Amazon Web Services S3 object storage. It works with Hadoop distributors like Cloudera to ensure that Python-based models can be deployed on Hadoop and run in a distributed manner on big Hadoop clusters. (All this is free by the way – it only charges for enterprise features like security, management, and monitoring features that open source hackers aren’t interested in.)
Now Anaconda is moving into the mainframe space. Last week at the AnacondaCON conference, IBM and Continuum announced that they’re working with Rocket Software to get Anaconda running on z/OS.
Anaconda On Big Iron
It’s a continuation of the same type of work that Big Blue is doing with Apache Spark, which is an open source platform designed to make it easier to do machine learning, graph analytics, and stream processing on distributed X86 clusters.
“In the same way that Apache Spark opened the door for big data analytics on the mainframe for Scala and JavaScript programmers, the Anaconda stack of technologies gives that same access to Python programmers,” IBM z Systems Analytics Technology & Architecture Lead Mythili K Venkatakrishnan,” says in a blog post announcing the work it’s planning to do with Rocket and Continuum.
“This partnership will help clients running z/OS to get even more out of their transaction data,” Venkatakrishnan says. “Not only can they now analyze data at the place of origin and without the need to know COBOL, but it also opens the mainframe to new workloads [Python, Java, Scala, R] and a new generation of users who before may not have had the skills needed to analyze z/OS data.”
Replace the word “z/OS” with “IBM i” and “COBOL” with “RPG” and then re-read that sentence. Does it make any sense? Do IBM i shops have a need for this kind of data analytic software?
Python already runs natively on IBM i. First there was the iSeries Python project, which we wrote about back in 2012, and then IBM enhanced support for Python 2.7 in IBM i 7.1 TR11 and IBM i TR3 in October 2015. (Python 3 is the current standard).
Just getting native support for Python on IBM i is a good first step to getting Anaconda on the platform. However, things developed for X86 platforms almost never can be moved to IBM’s proprietary, EBCDIC-speaking, big-endian, Big Iron platforms without a little tweaking, massaging, and handholding.
The IBM i server is rarely used as an analytic box. Its specialty is transaction processing, and it’s damn good at that, too. However, in the emerging world order, the lines between analytics and transactions are starting to blur. The analytics are running both in front of the transactions (identifying possible new customers, filtering out the fraud) and trailing the transactions (building reports, conducting audits).
The open data science world is exploding, and Python is at the center of it. The mainframe and Anaconda are coming together to enable Python-based analytics on data sitting under z/OS. Should IBM work to make that a possibility for the IBM i too?
RELATED STORIES
Hadoop and IBM i: Not As Far Apart As One Might Think
What Does ‘Big Data’ Mean for IBM i?
An Introduction to Python on IBM i, Part 3