Case Study: Delivering Transportation Insights using Jupyter Notebooks, Interactive Dashboards, and Apache Spark
Our IBM Cloud Emerging Technologies team recently worked with Executive Transportation Group (ETG) to analyze executive car service trips in New York City. ETG wanted to simulate potential changes to its driver dispatch algorithm, and assess the impact of those changes on its operations. The goal was to identify changes that might increase efficiency and have a positive impact on rider experience.
Jupyter Notebooks provide a flexible environment for interactive data analysis. Our team started using Jupyter Notebooks several years ago to help train IBM Watson. Since that time, the Jupyter ecosystem has grown, and new technologies such as Apache Spark and several incubator projects under Project Jupyter have emerged to help Jupyter Notebooks become a robust environment for tackling data-centric problems.
In our experience, we have found that frequent collaboration between data practitioners and domain experts has proven to be an integral part of deriving insights from data. So our team has been leading the development of the dashboards and declarativewidgets Jupyter Notebook extensions to make it easy to turn notebooks into interactive web-based dashboards.
Our team explored and analyzed historical ETG data in notebooks, and leveraged interactive dashboards to work closely with ETG domain experts. We created dashboards from notebooks, and deployed the dashboards as web applications. Both our team and ETG experts then used the web applications to visualize and interact with the data, and our team iterated on the analysis based on feedback, questions, and discoveries that resulted from such interactions. We repeated this cycle over the course of the engagement as depicted in the figure below.
Executive Transportation Group
Executive Transportation Group serves over two million corporate passengers each year across North America. While ETG operates in many cities across the continent, a large percentage of its client base is concentrated in the New York City metro area. On average, ETG dispatches over 1,000 professional drivers to several thousand pickups in and around the city each day.
ETG shared a subset of the data it collects from its day-to-day operations with our team. The data, which was anonymized to protect the privacy of its customers, included:
- Pickup metadata, including pickup coordinates, the time the passenger made the pickup request, the driver that ETG dispatched to the pickup, the time ETG dispatched the driver, the time the driver arrived at the pickup location, etc.
- GPS coordinates of every driver, at regular intervals.
- Information about when drivers are available for pickups, as indicated by drivers.
During the engagement, our team used Jupyter Notebooks for all of our data exploration and analysis. Within a few hours of receiving historical pickup data from ETG, we created a notebook that produced enough summary statistics and plots to give us a high-level understanding of ETG’s pickup operations. We were able to determine how many pickups ETG serviced, how many drivers ETG dispatched, and when and where the number of pickups peaked each day. From there, we created over a dozen notebooks to:
- cleanse and merge data from different data sets
- calculate the distances that drivers traveled to historical pickup locations, as well as the time it took to travel to those locations
- determine the location of every driver at the time of each historical pickup request
- simulate the times it would have taken other drivers to travel to the historical pickup locations
To simulate the travel times of drivers to pickup locations, we had to perform several calculations on the historical pickup data. For example, it was necessary to determine the location of every available driver at the time of each pickup request. This doesn’t sound too bad, until you do a little math.
On a typical weekday, ETG has over 1,000 active drivers at any given time of day. In order to determine the location of every driver for, say, 50,000 pickups, we needed to perform over 50 million lookups against historical GPS data.
While some would debate the classification of this data as “big”, the actual size of the data is irrelevant. What really matters is the time it takes to extract useful information from it. In this case, performing the calculations on a single server was time consuming, so our team turned to Apache Spark.
There is no doubt that Spark has garnered a lot of attention when it comes to processing large data sets. As Spark has matured, its API has grown support for multiple programming languages, including Python, Scala, and R. The Spark community itself provides the Spark Python API, and IBM recently contributed the Spark kernel (now an Apache Incubator project called Apache Toree) to open source, which means that Jupyter Notebook users have multiple kernel options to choose from if they want to leverage Spark from within a notebook.
The Jupyter community also provides several Jupyter Docker stacks that include everything necessary to run Jupyter Notebook kernels that can leverage Spark. After our team deployed a few virtual machines on IBM SoftLayer to host a Spark cluster, we pulled one of the publicly available pre-built Spark-enabled notebook images, and were up and running Spark jobs within an hour.
Overall, our team made liberal use of Spark for both data preparation and analysis tasks. With our minimal cluster, we increased the speed of most calculations by 10X over what we achieved using a single server.
Our team leveraged the declarative widgets Jupyter Notebook extension to build widgets that allowed us to interact with the ETG data within notebooks. This extension builds upon Jupyter’s interactive widgets project, and adds support for defining language agnostic widgets declaratively, using HTML markup.
In a nutshell, the declarativewidgets extension makes it easy to create “widgets”, or web-based UI elements (e.g., sliders, input boxes, buttons, tables, maps, etc.), within a notebook, and to bind widget inputs and outputs to functions, dataframes, or plain old variables in the notebook.
The coolest feature that declarative widgets provides is the ability to bind multiple widgets to the same data, which means notebook users can create widgets that respond to the interactions of other widgets. We leveraged this capability to build a map dashboard to visualize and explore historical pickups. A portion of the dashboard is shown in the image below. It contains a table widget, a Google map widget, and a list widget, all of which were created with just a few lines of HTML. Click the image to see an animation, and note that interactions in both the table and list widgets drive the map widget.
Our team used the dashboards Jupyter Notebook extension to share and publish notebooks containing results and interactive widgets. The extension allows notebook users to select and layout notebook cells as a “dashboard”, and to deploy the dashboard content as a stand-alone web application, in just a few clicks.
In the animated image below, we use the dashboards “Layout” view to select and position only the notebook cells we want to expose. We then preview what the dashboard will look like when we deploy it.
We then deploy the notebook dashboard as a stand-alone web application hosted by our Jupyter Notebook server. To share the dashboards with ETG, our team simply sent the URL of the deployed web application.
The dashboards extension provides a much faster and easier alternative to traditional methods of creating dashboards from notebook content. For example, one alternative would have been to convert notebooks to web applications by hand, which would have meant developing the applications to reproduce the notebook logic and results, testing the applications, and deploying and managing them. This requires a lot of effort, but we’ve found that such effort is often wasted because stakeholders use the applications for a short time, get the insights they need, and then forget about them.
The ability to easily create interactive dashboards to explore the ETG data was invaluable. It helped our team obtain a deep understanding of ETG’s operations, it gave ETG domain experts hands-on access to the data to interact with and explore on their own in a new way, but most importantly, it fostered communication and collaboration between our team and ETG domain experts throughout the engagement. We used dashboards in frequent discussions to:
- identify an appropriate sample data set for use in the study
- eliminate certain types of pickups from the data set, to avoid skewing results
- invalidate incorrect assumptions about driver behavior
- identify the need for additional data (this happened numerous times)
- communicate key findings and results
In one example, we instantly learned that some of our original assumptions about driver behavior were incorrect after we looked at the data on a map. Because we were able to create the map widget so easily, we were able to correct our assumptions early on. Using dashboards led to numerous moments of discovery like this one.
“It’s really good stuff…I love data and computers, they tell me a lot. When they back up a gut feel or a hunch, they are nice. When they give us new insights (or an aha! moment), they are better.” -Mark Heminway, VP of Operations and Business Development, Executive Transportation Group
New technologies are helping Jupyter Notebooks become a flexible and robust environment for interactive data analysis. We were able to work closely with ETG to derive insights about their business using notebooks and associated technologies.
Several Jupyter Notebook kernels provide support for Apache Spark, which means that notebook users can leverage Spark to help speed up calculations and work with larger data sets. We used Spark to speed up many of our calculations by up to 10X. And the availability of pre-built images that include Spark-enabled notebook environments makes it much easier to get up and running.
The dashboards and declarativewidgets Jupyter Notebook extensions provide an easy way to share and publish notebook content as stand-alone, interactive web dashboards. This helped foster collaboration during our engagement with ETG, and enabled both our team and ETG domain experts to explore and interact with data in new ways. We found interactive dashboards to be extremely useful in helping derive insights from the ETG data.