Visualizing Big Data with Spark and Scala
While exploring data analytics with Apache Spark, the team came to the realization that there are many Python examples, but resources for Scala are somewhat lacking. In particular, there are few data visualization examples in Scala. Python’s predominant visualization module is Matplotlib, but we struggled to find a Scala library that offered the same breadth of functionality and granularity of control.
In our searches, however, we did come across the Brunel project. Brunel is its own language built specifically for data visualization.
The Brunel project defines a highly succinct and novel language that defines interactive data visualizations based on tabular data.
By simply adding a jar file to your Scala notebook, you can unlock all of the language’s capabilities. To do this, you simply add one line to your notebook.
%AddJar -magic https://brunelvis.org/jar/spark-kernel-brunel-all-1.2.jar
This jar defines “brunel” as a magic command, thus allowing users to run Brunel code within a Scala notebook. The best part is, you can refer to your Spark Dataframes from within your Brunel code cells.
To create more Scala resources, the team tried to replicate some of the current Python notebook examples. Overall, Brunel worked great! The visualizations were sleek and the code to create them was minimal. In most cases the code needed was significantly less than when using Matplotlib. One drawback, however, is that Brunel creates SVGs that are rendered in the browser. While these are nice, they can be hard on the browser. Also, Brunel struggled with a few scatterplot visualizations. To be fair, we were trying to plot roughly 700,000 data points. Below is a side-by-side of the Python visualization(left) and the corresponding Brunel visualization(right). Note that we ended up using a subset of the data for the Brunel visualization.
Brunel has a fairly extensive set of plots and graphs that can be generated and it is easy to use. Brunel relies on a list of simple commands to generate its plots. In Scala, it can be as simple as listing a Dataframe and 2 columns like the code below.
data('df') x(column1) y(column2)
This code will take the data from “column1” and “column2” of the dataframe “df” and generate a scatter plot. It is as simple as that! Below are just a few more visualization samples generated with Brunel from our notebooks.
There are two available notebooks located in this Github repository that illustrate the Brunel visualizations discussed above. The “Precipatation_Analysis.ipynb” notebook explores and analyzes historical annual precipitation data while the “NYPD_Motor_Vehicle_Accidents_Scala.ipynb” draws insights from car accident reports.
Our team works on many projects that require data visualization. We welcome suggestions and alternative libraries / technologies that you may have come across to better visualize data from within Jupyter Notebooks.