Apache Spark – Utilizing Access Point Wi-Fi Data
Did you know that Wi-Fi routers used within your home or outside in public are capable of collecting a wealth of information about your mobile devices even if you never actually sign in and connect to the Internet? Wi-Fi is ubiquitous in today’s world and cell phones and other mobile devices are almost always either passively or actively probing for Wi-Fi networks. When mobile devices probe for these networks, they pass along a small bit of identification information, their MAC address. Using this MAC address information, along with the time and duration that the device “pinged” the router, mobile devices (and thus, the people who own those devices) can be tracked on locations they visit, how long they dwell at a particular location, and what types of mobile device they use (e.g., iOS vs Android).
This Wi-Fi data has a number of fascinating uses, especially if you have more than one router in your network. For example, if you are a business owner who has several Wi-Fi access points within one store, or have access points within a larger geographical area, you can analyze patterns across each location and generate business use cases for refining revenue or cost savings opportunities.
Exploring the Data
What can a business manager do with location-based log data collected from Wi-Fi access points? There are several business-oriented use cases that come to mind:
- Determining dwell times of unique customers in a store setting
- Determining peak venue traffic hours.
- Determining missed customer engagement opportunities, e.g., someone just passing by a location/short dwell times.
- Determining what kinds of devices your clients are using, e.g., Apple/non-Apple.
- Determining when your most frequent customer(s) are on site.
- Determining where customers are coming from and going to.
- Determining customer’s paths.
Here is sample of what the data from several access points might look like:
SIGNAL, CLIENT_MAC_ADDR, DATE_TIME, ACCESS_POINT_MAC_ADDR
11, 80:00:00:be:a6:52, 2015-04-12 19:10:03, 00:1c:0a:21:f8:40
6, 00:11:22:33:44:55, 2015-04-12 19:10:06, 00:1c:0a:f1:f3:10
5, 55:44:33:22:11:00, 2015-04-12 19:08:53, 00:1c:0a:f3:85:30
35, 33:44:55:00:11:22, 2015-04-12 19:09:40, 00:28:0a:22:1c:c9
30, e0:f4:48:09:19:92, 2015-04-12 19:10:20, 00:28:0a:22:1c:8e
5, 30:ab:2a:72:65:7e, 2015-04-12 19:09:59, 00:28:0a:f2:91:10
17, 60:c6:da:3e:4f:9f, 2015-04-12 19:10:18, 00:5a:0a:ed:6d:70
23, 18:a3:d1:ac:8a:cb, 2015-04-12 19:10:41, 00:5a:0a:21:f8:4a
8, e4:ac:d8:1a:21:fb, 2015-04-12 19:09:17, 00:5a:0a:f2:86:e0
And here we see some of the kinds of charts we could build with a sufficient collection of data:
Collecting and analyzing Wi-Fi access point data can present several data processing challenges. The more devices probing the APs, the more data records you will collect. A large network of access points and locations, will generate an enormous quantity of device data records. This form of data can get very large, very quickly, especially if there are stationary devices probing an access point’s network constantly. As discussed below, Apache Spark is the tool we used to greatly minimize processing time for our big data analysis.
Processing time is not the only challenge. These data records do not paint a clean picture due to the extra ‘noise’ of stationary and passerby devices. This means a significant amount of work and data massaging must be done before any trends can be analyzed. Also, dwell times are not immediately clear when every device can probe at its own rate, and data can be difficult to interpret when multiple access points’ networks overlap one another.
There are always edge cases to deal with when processing big data; our objective is to minimize the number of edge cases without significantly complicating the code.
We used the Apache Spark Starter on IBM’s Bluemix to process the large amount of Wi-Fi data we had, but note, it is also possible to setup and run Apache Spark locally if your dataset is smaller. In our next article we’ll investigate Spark further and walk through how we approached some of the aforementioned use cases.