Cognitive Social Command Centre
The Emerging Technology team used Bluemix based runtimes, services and APIs to produce a Cognitive Social Command Centre (CSCC) that allowed Wimbledon’s editorial team to understand what was being said about Wimbledon and the other sport events happening over the Summer of 2016, enabling them to engage with tennis and general sports fans.
The 2016 Challenge
Last year I wrote about the work on social media and stream analytics that the Emerging Technology group developed for Wimbledon. The project was all about supporting Wimbledon’s editorial team. For 2016, we continued this work and addressed some new challenges using Bluemix technology and services.
The Summer of 2016 was unusual in sporting terms, with several events taking place simultaneously. The two weeks of the Wimbledon Championships coincided with the Euro 2016 Football Championships, the end of the 2016 Copa América, the Tour de France, two F1 Grand Prix and the build up to the Rio Olympics. The packed schedule presented a challenge to Wimbledon, in vying for sports fans attention, but also an opportunity to draw in fans from other sports.
In order to make their editorial content relevant, Wimbledon again needed to understand what fans were talking about and where their interests lay. Because of the packed sporting schedule Wimbledon wanted to analyze social media content about all sports, not just tennis. This presented a scalability problem in terms of the volume and rate of messages we needed to handle, but also a text analytics problem. Widening the scope also meant we would see much more variation in the topics discussed and increased the unpredictability of what those topics would be.
When the focus was just on Wimbledon we had a well defined set of competitors, we knew which celebrities would be attending and we had several years of data helping us decide what the likely topics of conversation would be. Once we opened the scope to include all sports we lost this advantage. We knew the analytics would need to be much more adaptable, in order that we could refine them during the tournament. We knew that pre-programmed, rule based text analytics would not cut it.
Coping with Scale
The main worry in terms of scaling the analytics was in the unknown. We knew the amount of data we had to process for Wimbledon in 2014 and 2015. We suspected that the Euros would generate a lot more conversation that Wimbledon and could also spike much more when an important or controversial moment in the game happened. The requirement was for real-time analytics, but the unpredictability of social media meant whatever capacity planning we did, there was always potential for something unexpected taking us beyond what we had anticipated. It was important that we could buffer incoming data and not lose any analysis, even in cases where it couldn’t happen in real-time.
Experience showed us that social media activity around sports events tends to be very bursty – spiking during the significant moments, but dying down to a low background level the rest of the time. Scaling down to handle small data volumes was as important as being able to scale up.
IBM Bluemix’s MessageHub was the key piece of infrastructure that solved these problems. MessageHub provided us with a means both to scale up our analytics processes, load balance between them, but also a mechanism to buffer the data if at any point we couldn’t process the data fast enough. Data flowing in to our application (from Twitter, Facebook, YouTube and Instagram) was pushed on to a MessageHub queue before anything else happened. The analytics and annotation processes read from this queue. If those processes couldn’t keep up, the queue would build up, but no data would be lost.
MessageHub also allowed us to load balance across analytics processes. When we needed to process data at a higher rate we would increase the number of analytics processes (Java runtimes also running in Bluemix). These processes read from the queue, but because MessageHub allows a queue to be partitioned, the incoming data could be evenly balanced across however many analytics processes we had running. For peak times, all we needed to do was increase the analytics runtimes instances and we got more throughput. Reducing the number of these processes just directed the messages back across the remaining processes.
In 2015 we relied on Watson Content Analytics (WCA) for the text analytics. This is a rule based platform where a developer manually writes rules to classify and analyse text. This worked with a relatively well defined set of data, but it is less adaptable to new and unexpected data. The rules could be re-written, but this requires a skilled developer and some understanding of how to successfully divide and classify new data. It takes even an experienced rule developer time to do this.
For the 2016 solution we relied on a combination of Watson Natural Language Classifier (NLC) and Alchemy Language APIs, both provided as Bluemix services. NLC was used to classify the incoming content into three categories: content about Wimbledon, content about other sports and ignore. This was done by training NLC with data from 2015. We had IBM interns manually classify a few thousand social media messages and this training set was used to build a classifier. During the tournament we added to this training set with data from this year’s event to improve classification accuracy. This was relatively quick. One person can classify a hundred messages in a few minutes and the person doing the training does not requite any specialised skills or training.
Messages classified as ‘ignore’ were discarded, but messages classified as being about Wimbledon or other sports were sent for further analysis using the Alchemy APIs. These APIs are used to pick out topics and entities within the text and also assess the sentiment of the message. The messages were annotated with this extra meta data.
Now that we had a set of classified social media messages, annotated with entities, topics and sentiment we aggregated them in real-time. A small Spark cluster ran a streaming task that performed a set of windowed counts over the messages, allowing us to find out the volume of messages about each topic or entity. These counts were ordered and output to the dashboard to be display. We also ran additional algorithms in Spark to find things like the rate of change in a topic and a network analsyis of the users discussing a topic.
The messages finally flowed in to an Elasticsearch index where they could be accessed as historical data. As well as showing real-time stats, the CSCC also allowed users to query for topics and entities that were being discussed at any point in time during the tournament.
The data was rendered in a web dashboard. This was a relatively simple Node.JS app, again running in Bluemix. It provided a web UI and also an API abstraction to access both the real-time and historical data. The API was written to allow for new UI visualisations and displays. The dashboard allowed users to export a more detailed Excel report. This report gave stats about each day and showed a comparison to social media activity from the same day of the tournament in 2015. This report was sent to Wimbledon’s editorial team at the end of each day.
By taking advantage of Bluemix and the Watson Cognitive APIs, the Emerging Technology team were able to build an application that met Wimbledon’s needs for 2016, but also one that could be transferred to other events without additional development effort.