Apache Spark – Processing Access Point Wi-Fi Data

Share: Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedInShare on RedditEmail this to someonePrint this page

sparkAndWifiData#2

A Closer Look

Previously in Utilizing Access Point Wi-Fi Data, we had discussed several use cases a business manager could solve by leveraging WiFi data:

  • Determining dwell times of unique customers in a store setting.
  • Determining peak venue traffic hours.
  • Determining missed customer engagement opportunities, e.g., someone just passing by a location/short dwell times.
  • Determining what kinds of devices your clients are using, e.g., Apple/non-Apple.
  • Determining when your most frequent customer(s) are on site.
  • Determining where customers are coming from and going to.
  • Determining customer’s paths.

In this post, we will look closer at how we actually can do some of these things and specific cases to be mindful of before moving forward.

Dirty Data

Unfortunately WiFi data is not inherently clean, thus most of these use cases require that we clean up our data first.

Here are some of the different kinds of noises you can expect to encounter when dealing with WiFi data.

Phantom Movement (Jitter)

Phantom movement occurs when a device is present in multiple access points’ wireless networks at the same time. In a situation like this you might see data resembling the following.

How are we to interpret this? Naïvely we could assume that device A moved from location 1, to location 2, then 3, then back to 1, but given the signal strength and time information that’s probably not the case. More likely what has happened is that the device was in the access points’ wireless networks of location 1, 2, and 3 at the same time, resulting in readings from each of them. Based on the signal strength we can assume that the device was actually in location 1; thus, we can ignore the data from locations 2 and 3. However, we need to remain mindful of the signal strength in case the device actually does move into location 2 or 3.

Put generally, this is an example of a device being present in two access points’ wireless network at the same time. In these cases, it would seem a device is moving back and forth between two locations very quickly. Combine this type of noise with passerby or stationary noise and you’ve got yourself some real problems.

Passerby Devices

A second type of noise could be ‘passerby’ devices. This is only considered noise in some use cases, but it’s helpful to understand how and why this can considered noise.

A passerby device is a device which is only in the wireless network of an access point for a very short period of time, e.g., a car driving by a restaurant or a person walking by a small shop. How much noise this contributes really depends on the device traffic around the access point. This data can be useful if you’re trying to track where a device is moving and build a path based on that, or determine missed client engagements; but, in most applications, e.g., determining throughput of a restaurant or average dwell time of all clients, this data would be considered noise.

Stationary Devices

A stationary device can be defined as an always on, always present device. Think wireless point-of-sale devices, wireless cameras, wireless TVs, other access points, etc. These devices manage to contribute a lot of data over time. We’ve found that stationary devices are generally responsible for 20% or more of device data. This is noise because stationary devices don’t normally provide any interesting or valuable data. Generally speaking, a store doesn’t need to know, or even care about, how many TVs or POS devices they have running.

Additionally, employees will also contribute partially to this stationary device paradigm, due to their extended stay at an establishment. But, you may be interested to keep employee information, so it is important to find a way to distinguish between employees and full-time stationary devices.

Building Dwell Times

Once you’ve got WiFi data, you are going to want to do two things: make the data more human consumable, and get rid of noise. In most cases WiFi data will come in the form of many pings, rows of data containing: the access point’s MAC address (location), the MAC address of a device within the wireless network, a signal strength, and a time stamp. Individual pings can be misleading and difficult to understand because of the quantity of data, noise, and their discrete nature. One technique to help eliminate noise and give you better insights into your WiFi data is to build dwell times. A dwell time is simply a compression of concatenate pings. The algorithm to convert these individual pings can be a bit tricky, partially because it’s not possible to completely parallelize this algorithm.

Below I’ve written some pseudocode in Python to build dwell times. Before this code comes into play though, the data must first be organizing into rows of tuples, where the first item in the tuple is a tuple (device MAC address, day) and the second item is an array of tuples (location, signal strength, timestamp in seconds) sorted by timestamp.

Here is a before and after, in tabular form, of what the data might look like (and in Python here):

The main things to note are how dwells are being created in a map-reduce paradigm and the choices we make to define what a dwell is.

This algorithm works in map-reduce because of how the data is organized. It keeps some of the efficiency of map-reduce by separating each person-day combination in a parallelized computation, but groups all of each device’s data together in one set so that a chronological loop can be established.

To define a dwell, we set a maximum and minimum dwell inactivity. This is where the art comes in. Its hard know exactly when to consider a dwell a single time slice or two separate ones, this is something a developer would have to determine based on how their access points collect data and where the data is being collected geographically.

Using Dwell Times

Once dwell times have been established, things get a little bit easier to understand and distinguish. Now that the data can be viewed as visits by a person, we can filter out noise  like 24 hour dwells and 1 second dwells.

Share: Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedInShare on RedditEmail this to someonePrint this page
Alex Lewitt
Alex is currently a software developer co-op on the IBM jStart Team and is pursuing a BS Degree in Computer Science at the University of Florida. Alex's main focus for the past few months has been on Apache Spark, Apache Spark Machine Learning, and IBM BlueMix.
Alex Lewitt
Alex Lewitt
GitHub
Nathan Hernandez
Nathan is a developer and co-op on IBM's jStart team. Nathan is currently pursuing his MS in Computer Science at Appalachian State University and holds a BS in Computer Science from Appalachian State University.
Nathan Hernandez
Nathan Hernandez

Latest posts by Nathan Hernandez (see all)

Leave a Reply

Your email address will not be published. Required fields are marked *