Kafka and Spark Streaming for IoT at the Vacation Resort

Share: Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedInShare on RedditEmail this to someonePrint this page

In the Internet of Things at the Vacation Resort Experience blog post we described the Jabil/IBM jStart Proof of Concept (POC) high-level architecture and bujabil logosiness objectives. This post delves more deeply into the solution architecture and reviews a few of the lessons learned with our first experience with Kafka and Spark Streaming technologies.


component diagramSystem Component Diagram

Training Phase

recoAs part of the implementation of this POC, 26 Reco Beacons were placed on a floor in one of Jabil’s St. Petersburg locations.   A Beacon is a Bluetooth low-energy (BLE) wireless technology that sends out iBeacon (Apple Inc. developed protocol/standard) compatible broadcast messages to receivers by which the reciphoneredbeareived signal strength is determined thus allowing you to calculate nearness proximity to the beacon.

iPhone and RedBear mini devices were used as receivers to collect the beacon broadcast and calculate the received signal strength. An iOS training app was developed along with a sketch on the RedBear devices to capture beacon broadcast signals for each zone.   A training set of received signal strengths for each room by device was stored in a Bluemix Cloudant Service.

The training set collected in this phase was used to train a machine-learning (ML) model generated by IBM SPSS Modeler. This model would be used to determine the proximity (zone) of the receiver device based on the received signal strengths collected in real-time by the device. Once the ML model was developed and tested, the model was deployed to the real-time Bluemix Predictive Modeling Service to score (determine device location) the beacon data captured by each receiver device.

Run-Time Phase

In the run-time phase, an iPhone app was developed along with a sketch loaded on the RedBear device to capture the beacon/temperature/humidity at a regular interval and published the data to a Bluemix IOT Foundation service topic.   Next a Bluemix Node-Red Runtime flow subscribed to the IOT Foundation topic and orchestrated the data as follows:node-red

  1. Persisted the data to a Bluemix Cloudant Service for retrospective analysis
  2. Sent the data to the Bluemix Predictive Modeling services for scoring resulting in the identification of the zone location of the device
  3. Published the zone derived temperature/humidity data to a Softlayer hosted Kafka Service topic

Once the message is published to the Kafka Service topic, two solution kafka-logo-600x390spark-streamingcomponents subscribed to the same topic: 1)HDFS Object Store Application which stored message into a HDFS Object Store for future Jabil analytics use.   2) Spark Streaming Service that used the location/temperature/humidity data to calculate dwell time and temperature averages then published the collected and calculated data to an additional Kafka topic.

Finally, the Spark Streaming published data is subscribed to by the Jabil developed Web application through a websocket (Kafka .NET client) implementation.

Lessons Learned

Along with the implementation there were numerous lessons learned. Below we have selected a few of the key observations we encountered with this POC.


It’s important to keep the server’s time synchronized.

We installed the environment in a few VMs in Softlayer, which is an IBM IaaS cloud offering. We were testing the environment by running some simple jobs in standalone Spark and Spark managed by YARN. One day the jobs suddenly failed to start. Upon examining the logs we discovered that the date/time on the servers was way off, and some subsystem logged an error about an expired ticket.

We installed NTP daemons on all the servers and configured so that the master server synchronizes with the outside worlds and the workers would get their time from the master server via local sub-net. Perhaps due to well-known NTP multiplication security vulnerability, or some other reasons, Softlayer environment blocks outgoing NTP traffic but provides an internal server for time synchronization.

Making Kafka accessible to both workers and clients.

When creating the cluster, we set up a management/master node to be on both public and private VLANs, and the workers to be only on the private VLAN. Since the PoC was relatively small, we set up only one Kafka node on the management node.

Kafka was the primary integration point – the messages were delivered into an “input” Kafka queue where they were picked up by the Spark workers, processed, and the result were placed into an “output” Kafka queue, picked up by the clients downstream.

It turns out Kafka provides the broker host name early during in its protocol, so neither private not public IP address worked: the workers can’t see the public address and the outside clients can’t see the private one. We ended up binding the public address to a publicly resolvable host name in a domain we had control over, and overwriting its IP address in /etc./host on workers to point to the internal address, and then overriding the Kafka comfit to advertise that host name (advertised.host.name).

Other Kafka settings that may be useful for testing and development.

During the test and development you may create topics and messages that you don’t care about within an hour. Kafka works hard to keep your messages, so if you want to be able to actually delete topics you don’t need (vs keep seeing them in the list with “(deleted)” marker), or you want your test messages to get deleted earlier than 7 days, you may want to enable delete.topic.enable=true and log.retention.hours (global settings), topic.log.retention.hours (per-topic in config file). You can also use kafka-topic.sh utility to set the individual retention interval per topic.

Smoothing out input signal fluctuations

Part of the project’s use case was calculate a “dwell time” – determine for how long a device was detected to be in a certain location. The logic that determined the location was probabilistic and therefore not 100% accurate and reliable; sometimes it produced a neighboring location before going to back to the correct one. We tried to address this in the Spark Streaming logic. We were receiving data in batches, so we could see if a value was one-off since we could see the values that followed – unless it was the last value in the batch. We deemed the time and effort to implement proper handling of such data to be outside the scope of the project.

Spark Steaming API Persistence

Calculating average values of temperature and humidity and a guest’s dwell time in different areas was a requirement on the project. Such calculations require a persistence layer in the application. Spark streaming API provides persistence, however, checkpoint mechanism needs to be enabled. A HDFS file system was used as a checkpoints store.

Spark Steaming Checkpoint Clean Up

In order to avoid the old code base being invoked, the spark streaming checkpoint must be cleared when deploying a new script version. Prior to this observation a lot of time was spent trying to determine why changes were not deployed.

Publishing to Kafka

Spark Kafka direct API allows you only to read data from a Kafka topic but not to write data back to Kafka. The team resolved this using the following project: https://github.com/cloudera/spark-kafka-writer

Scala JSON Library

Scala language was used for the spark analytics implementation. The team prefers to work with json using Json4s library. It really makes life easier – you need to write just a couple of lines of code in order to parse json to model and vise versa.

Spark Analytics Testing Enablement

Apart from spark analytics application developing we also needed to create a router application & simulator applications for the analytics app.
We have decided to use Bluemix Node Red service as a framework for the router & simulators implementation. And I think that we made a good choice here.
Node Red service allows to create a complex application using web UI. As result this is how one of simulators applications appears:

noderedsimulatorBelow is the exported JSON flow of the above Node-Red Flow.   You can take the provided JSON and import it into a Node-Red Editor for deeper inspection

Node-Red Exported JSON Flow


Share: Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedInShare on RedditEmail this to someonePrint this page


Leave a Reply