Unleashing Exploration on Enterprise Data

Share: Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedInShare on RedditEmail this to someonePrint this page

blogzSpark

Enterprise customers have huge investments in transactional data systems, yet they struggle to provide their users with flexible and timely exploratory access to this data. One solution to this problem is to empower these users with the ability to use Jupyter Notebooks and Apache Spark running natively on z/OS to federate analytics across business critical data as well as external data while leaving all data on its system of origin.

This blog article examines customer requirements and technical challenges associated with the validation of a solution architecture that incorporates open source technologies and transactional data systems on z/OS. By combining several open source solutions, such as Apache Spark and Project Jupyter, we can present an aggregate view of transactional data to power users within the enterprise who desire an iterative data exploration experience.

Customer Challenges

The IBM zSystems Spark Demo Video examines how Apache Spark could be used to provide a robust data processing platform for enterprise analytics. While much work still remains in the gathering of the customer requirements that influence this approach, we can assume the following:

  1. Enterprise customers with investments in System z applications desire to provide alternative means for the development, deployment and management of data processing oriented tasks (reports, jobs, analysis) that run against their transactional systems.
  2. While the data in these customer environments tends to be large, the data processing oriented tasks most often yield a smaller subset of data for downstream processing and analysis.
  3. Data processing oriented tasks have the potential to span multiple disparate data sources (internal and/or external), address time-consuming data munging activities and may need to be run on some scheduled frequency
  4. Data explorers may agree on the format and frequency of the data they require but often prefer to have freedom of choice on the tools and languages used to tackle their exploratory activities.
  5. As skills in the work place transition away from deep z/OS skills, Data wranglers, need to be able to port/mature/extend batch jobs for the gathering and organizing of data across disparate data sets using modern data processing techniques.

Technical Challenges

Several technical factors influence the approach to a solution. For starters, Apache Spark is a “black box” for data processing jobs which are defined, packaged and submitted based on Java concepts. By itself, Apache Spark currently lacks an easy way for users to load and manage data and programs or run queries. Unfortunately, these capabilities are not easily exposed to the power users who tend to perform interactive data exploration using languages like R and Python. While Project Jupyter is language neutral and is accompanied by an ecosystem of language kernels, the application itself has deep Python dependencies which are a challenge for z Systems where Python is not a first class language. Finally, most of the transactional systems (e.g., IMS, VSAM, DB2) running on z/OS lack support for a variety of language bindings thereby limiting access to Java and Scala. Yet the power users tend to associate themselves with the two largest data analysis ecosystems, namely Python and R.

Observations

Our initial research in the integration of Project Jupyter and Spark on z Systems has revealed several key observations:

  • A tight-coupling between a Jupyter Notebook client and Apache Spark server platform libraries does not provide for a viable solution architecture.
  • Scala exposes deep roots in Java which ignores the programming language preferences of associated with Data Scientists.
  • Apache Spark is not intended to be accessible by a broad audience, it should be positioned as a data processing engine that serves the needs of a data driven community.
  • Few enterprise users would be granted the credentials and privileges to transact directly with backend transactional systems.

Solution Concept

Persona: Before we propose a solution architecture that incorporates Project Jupyter, Apache Spark and transactional data systems on z/OS, we need to describe the requirements of the persona believed to be users of the solution.

Data Wrangler: Data experts within the enterprise who have a deep understanding of the schemas and relationships associated with the various data sources owned or used by a company. These individuals are responsible for gathering requirements from users across the company and creating programs that produce new data derivatives. These users tend to use a specialized set of tools that allow them to develop, publish, schedule and mange data processing programs for the business.

Information Management Specialist: Data experts within the enterprise who share the deep data management skills as the Data Wranglers but do not share the same scope of responsibilities. These users tend to have situational needs for running data queries where the results are small and latency is not a requirement. The most distinguishing factor about these users is that they would be granted direct access to enterprise data sources so that they can perform their independent data exploration tasks. They may share some of the same tooling as a Data Wrangler.

Data Scientist: As experts in data investigation, these users apply their data and analytical ability to find and interpret rich data sources to aid in understanding data; build mathematical models using the data; and present and communicate the data insights/findings. They are often expected to produce answers in days rather than months, work by exploratory analysis and rapid iteration, and to present results with interactive dashboards rather than reports. These folks apply data analysis techniques such as statistics, data mining and predictive analytics. The keys for these users is tooling flexibility and ease of data access.

Architecture Considerations

1. Managed Data Processing Environment: Apache Spark provides distributed task dispatching and scheduling. It can be used to allow data processing tasks to be submitted as batch jobs on some predefined frequency. Few users need to interact with such a managed environment for data processing jobs. Only Data Wranglers need a deep understanding of Spark and the tools necessary to integrate with it.

2. Data Lake: As data processing tasks are completed, the results of those jobs can be stored in a central location that is more easily accessible by a broader user community. New data sets that are produced by the Spark jobs can be refreshed or purged as desired by the system administrators or user community.

3. Content Format: Given the various programming languages used by Data Scientists, the Data Lake should embrace a data storage format, such as JSON, that is commonly supported across programming languages and data stores.

Solution Topology

One shoe does not fit all! Technologies like Project Jupyter and Apache Spark should be positioned for use where they excel.

Persona

Tool

Description

Data Wrangler

Spark Job Manager

Information Management Specialist

Scala Workbench

Tightly-coupled Jupyter+Spark workbench that allows Scala users with direct access to transactional systems to query, analyze and visualize enterprise data.

Data Scientist

Jupyter Notebook

Open source tool for Python, R, and Scala users to access a Data Lake of JSON file for analysis and insight generation.

Solution Components

Component

Purpose

Data Lake Cloudant NoSQL database that stores the result of Spark Jobs in JSON format. In this solution JSON is positioned as the common point of exchange for all languages to load and manipulate.
Spark Job Server Provides a REST API for submitting, running and monitoring Spark Jobs. Also enables the results of jobs to be converted to JSON format.
Spark Job Server Client Java Client API for development of client GUI tools that will allow developers to easily manage Spark jobs.

Next Steps

While this article has proposed a solution architecture, the next step is to validate these concepts with interested customers. If your company is willing to evaluate this approach to enterprise analytics using Apache Spark on z System, please contact info@ibmjstart.com.

Share: Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedInShare on RedditEmail this to someonePrint this page
Dan Gisolfi
As CTO for Trusted Identity, Dan is focused on the development and execution of a trusted identity strategy for both citizen and corporate identity interactions using blockchain technologies. This endeavor includes the development of a formal IBM Mobile Identity offering, the definition and development of a trusted identity reference architecture, and the creation of devops tools that streamline the delivery of trusted identity solutions for clients.
Dan Gisolfi
Dan Gisolfi

2 comments

Leave a Reply

Your email address will not be published. Required fields are marked *