Analytics Moves To The Cloud, And IBM i Data Goes With It
March 11, 2020 Alex Woodie
The cloud is changing the face of IT, much to the chagrin of IBM i traditionalists who are accustomed to having full control over their applications and data. Change is always hard, but the good news is that, with a little discipline, the cloud presents a number of new and exciting analytical options for your important IBM i data.
As a transaction processing powerhouse, the IBM i server is accustomed to hosting the most important data a business ever touches, including data about customers and their purchases. On-prem servers still run the lion’s share of online transactional processing (OLTP) workloads, as companies in traditional industries (i.e. manufacturing, distribution, retail, banking, insurance – just about everything except for Web and cloud-based businesses) are still wary of giving up control of their crown jewels.
But when it comes to analyzing this data, what’s traditionally called online analytic processing (OLAP), the workloads have shifted dramatically. While there are some IBM i shops that run OLAP workloads on their Power Systems boxes, most companies have offloaded them to separate servers. Workload isolation ensures that the OLTP continues normally, and it also allows companies to benefit from databases designed specifically for OLAP workloads.
In the past, companies have traditionally moved all of their OLTP data into dedicated data warehouses, where OLAP workloads can find the digital needles in the binary haystacks. Just 10 years ago, it was commonplace for businesses to employ extract, transform, and load (ETL) tools to move data from all their OLTP systems (IBM i data included) to data warehouses, such as those developed by Teradata, HPE Vertica, Oracle Exadata, IBM Netezza, and Microsoft SQL data warehouse.
Well, the market for data warehouses and OLAP tools has shifted dramatically in the past 10 years. First, the rise of open source Apache Hadoop spurred a rush to build clusters of standards-based X86 servers, which became data lakes capable of holding hundreds of terabytes or even petabytes of data. Based initially on technology defined by Google, including the Google File System and MapReduce, the Hadoop ecosystem flourished, driven in large part by the open sourcing of innovative technology from Web giants, such as Yahoo (Apache Hadoop), Facebook (Apache Hive, Presto), LinkedIn (Apache Kafka), Twitter (Apache Storm). (Click here for an in-depth look at the history of Hadoop.)
While Hadoop still has its place, the high level of technical difficulty in cobbling together dozens of disparate open source products to build a functional data lake has put a serious damper on the commercial Hadoop market. That has forced companies to look for other ways to analyze their data.
What’s happened over the past five years is that the cloud has simply taken over large-scale data analytic workloads. The three major public cloud vendors — Amazon Web Services, Google Cloud, and Microsoft Azure — have built their own data lakes services that have gobbled up many Hadoop workloads from unhappy former Hadoopers. Large companies that spent millions to stand up their own Hadoop infrastructure are still getting value out of those systems, which in some cases offered substantial savings over their previous data warehouses. But the writing on the wall is clear: the bulk of new analytics workloads are headed to the cloud.
So what does this mean for IBM i shops? In many ways, IBM i shops are no different than other businesses. Yes, these companies may run a proprietary midrange server that bears little resemblance to the Linux and Windows systems that are far and away more popular for transaction processing. IBM i and System z mainframe customers may store data internally in an EBCDIC format, as opposed to ASCI format. But from a practical perspective, that poses practically no barrier to prevent IBM i (or System z) shops from taking advantage of the vast new analytical options available to them.
AWS, Google Cloud, and Azure all have a rich and expanding set of data storage and analytic options that can work with practically any business data, whether it originated on a Db2 for i database or any number of other databases. All three public clouds store data in an object storage format, including AWS’s Simple Storage Service (S3), Google Cloud Storage (which is accessible via an S3-compatible API), and Microsoft’s Azure Data Lake Storage (ADLS) Gen 2 and Azure BLOB storage (which do not make use of the S3 API). They also all offer modern, column-oriented databases designed for large-scale SQL analytics, as well as a bevy of tools for developing machine learning models to assist customers in developing predictive analytics programs and other data science endeavors.
Once you have your data in a cloud object store, there are additional options available to you. You may have heard about Snowflake, which offers a data warehousing service in the cloud. Available on all three public clouds, Snowflake has garnered a lot of attention recently for the simplicity it offers customers. The company touts the ease-of-use of its data warehouse and says customers do not need to have advanced skills to manage data, as was common in the days of on-prem data warehouses. Its pay-per-use also resonates with customers who are concerned about the potentially high cost of processing large amounts of data in the cloud (which the public cloud vendors may not tell you about).
Another cloud-based analytics company worth mentioning here is Databricks. Founded by the creators of the open source Apache Spark data processing framework, Databricks has taken Spark to new levels on its cloud (which in turn runs on your choice of AWS, Google Cloud, or Azure). While the in-memory Spark framework is powerful in its own right, Databricks has surrounded Spark with an array of cloud services designed to take the pain out of large-scale data analytics tasks, in particular the all-important process of ensuring good data quality. Dubbed the Unified Analytics Platform, the Databricks offering has received high marks for both traditional SQL processing as well as newer machine learning-based workloads.
There are many other companies building and selling analytic and data science software and services on the cloud. It’s an extremely active and exciting segment of the market, and it’s driving real benefits for companies that want to move quickly with big data analytics, but don’t’ want the hassle of dealing with all that physical infrastructure (just mind those cloud invoices!)
All of these options may at first seem bewildering to the traditional IBM i shop. The good news is that some of the core tools that open your IBM i data up to this exciting world have not changed appreciably. Getting the data up into the cloud typically requires the service of an ETL tool. And interacting with those cloud-scaled SQL data warehouses typically can be done with the same familiar business intelligence tool that you likely already use.
On the ETL front, companies like Attunity (now owned by Qlik), Informatica, Talend, Syncsort, Information Builders, and even IBM will all be competing for the right to move your OLTP data from IBM i servers into cloud object stores. On the BI front, you can use mainstream Windows and Web client interfaces from Qlik, Tableau, MicroStrategy, and Microsoft to interact with the cloud data warehouses.
It’s a brave and exciting new world in the cloud. And while there are options for moving your entire IBM i workload into the public cloud, you don’t have to take that step. In fact, with the right mix of tools, you can move your OLTP data from IBM i into the cloud, and analyze it there.
RELATED STORIES
Syncsort’s Pitney Bowes Deal: All About Good, Clean Data
Hadoop and IBM i: Not As Far Apart As One Might Think
Unwinding Python’s Data Science Potential On IBM i