Apache Spark – Utilizing Access Point Wi-Fi Data

Share: Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedInShare on RedditEmail this to someonePrint this page

linear; circle; communication; button; technology; tech; innovation; hi tech; abstract; concept; bright; white background; element; shape; rectangle; arrow; traffic; connection; frame; texture; futuristic; future; vision; digital; background; backdrop; vector; illustration; banner; template; gradient; business; science; sci fi; design; modern; blue; electric; electronic, pattern; theme; graphic; creative; composition; space; telecoms; microchip, white; geometric; web; overlap; multiple; light,

Intro

Did you know that Wi-Fi routers used within your home or outside in public are capable of collecting a wealth of information about your mobile devices even if you never actually sign in and connect to the Internet? Wi-Fi is ubiquitous in today’s world and cell phones and other mobile devices are almost always either passively or actively probing for Wi-Fi networks.  When mobile devices probe for these networks, they pass along a small bit of identification information, their MAC address. Using this MAC address information, along with the time and duration that the device “pinged” the router, mobile devices (and thus, the people who own those devices) can be tracked on locations they visit, how long they dwell at a particular location, and what types of mobile device they use (e.g., iOS vs Android).

This Wi-Fi data has a number of fascinating uses, especially if you have more than one router in your network. For example, if you are a business owner who has several Wi-Fi access points within one store, or have access points within a larger geographical area, you can analyze patterns across each location and generate business use cases for refining revenue or cost savings opportunities.

Exploring the Data

What can a business manager do with location-based log data collected from Wi-Fi access points?  There are several business-oriented use cases that come to mind:

  • Determining dwell times of unique customers in a store setting
  • Determining peak venue traffic hours.
  • Determining missed customer engagement opportunities, e.g., someone just passing by a location/short dwell times.
  • Determining what kinds of devices your clients are using, e.g., Apple/non-Apple.
  • Determining when your most frequent customer(s) are on site.
  • Determining where customers are coming from and going to.
  • Determining customer’s paths.

Here is sample of what the data from several access points might look like:

And here we see some of the kinds of charts we could build with a sufficient collection of data:

Screen Shot 2015-08-27 at 9.15.36 AM     Screen Shot 2015-08-27 at 8.54.15 AM     Screen Shot 2015-08-27 at 9.05.05 AM

Data Challenges

Collecting and analyzing Wi-Fi access point data can present several data processing challenges. The more devices probing the APs, the more data records you will collect. A large network of access points and locations, will generate an enormous quantity of device data records. This form of data can get very large, very quickly, especially if there are stationary devices probing an access point’s network constantly. As discussed below, Apache Spark is the tool we used to greatly minimize processing time for our big data analysis.
Processing time is not the only challenge. These data records do not paint a clean picture  due to the extra ‘noise’ of stationary and passerby devices. This means a significant amount of work and data massaging must be done before any trends can be analyzed. Also, dwell times are not immediately clear when every device can probe at its own rate, and data can be difficult to interpret when multiple access points’ networks overlap one another.

There are always edge cases to deal with when processing big data; our objective is to minimize the number of edge cases without significantly complicating the code.

Spark Solutions

We used the Apache Spark Starter on IBM’s Bluemix to process the large amount of Wi-Fi data we had, but note, it is also possible to setup and run Apache Spark locally if your dataset is smaller. In our next article we’ll investigate Spark further and walk through how we approached some of the aforementioned use cases.

Share: Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedInShare on RedditEmail this to someonePrint this page
Alex Lewitt
Alex is currently a software developer co-op on the IBM jStart Team and is pursuing a BS Degree in Computer Science at the University of Florida. Alex's main focus for the past few months has been on Apache Spark, Apache Spark Machine Learning, and IBM BlueMix.
Alex Lewitt
Alex Lewitt
GitHub
Nathan Hernandez
Nathan is a developer and co-op on IBM's jStart team. Nathan is currently pursuing his MS in Computer Science at Appalachian State University and holds a BS in Computer Science from Appalachian State University.
Nathan Hernandez
Nathan Hernandez

Latest posts by Nathan Hernandez (see all)

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *