Conversational Commerce and Speaker Recognition
Conversational commerce offers great potential for businesses to reach today’s consumers. In particular, conversational commerce makes it easy for consumers to interact with businesses by enabling natural language interactions with consumers. These natural language interactions can be enabled by chat bots in popular chat applications or speech-enabled applications on computing devices. In short, conversational commerce enables users to interact with businesses in natural language, quickly. IBM Emerging Technologies has been exploring conversational commerce and as part of that exploration we have been developing a prototype that, among other things, highlights the applicability of speaker recognition in conversational commerce. In this post, I discuss a collaboration between IBM Emerging Technologies and IBM Research that leverages speaker recognition in speech-enabled conversational commerce.
Speech-Enabled Conversational Commerce
Products like Echo and Siri demonstrate the power of using voice technologies to develop speech-enabled products. They show the potential of natural language as an interface to software systems. Developing speech-enabled systems requires advanced software. Watson Developer Cloud provides Language and Speech services that can be used to develop software systems for conversational commerce. These, and similar, technologies enable software systems that can facilitate a wide variety of things such as managing email and purchasing on the Internet. Security will become critical as conversational commerce grows. Security techniques will need to adjust to a new model of systems. For example, a user interacting with a speech-enabled system should not be required to type a password or a captcha code. Instead, biometric technologies can be used to authenticate users in a manner consistent with the interaction (i.e. voice, face). Speaker recognition is one such complementary technology that can be used to help secure speech-enabled systems. Using speaker recognition, we can design interactions in which a spoken request to perform a sensitive operation (i.e. ‘Hi Watson, please order a large pizza pie from Nick’s Pizzeria’) can not only be processed but also authorized by analyzing the audio.
Speaker recognition is the process of recognizing a speaker given a particular audio sample. Unlike speech recognition, which involves recognizing what was said; speaker recognition involves recognizing who spoke. Two commonly used processes for speaker recognition are speaker verification and speaker identification. Here we focus our attention on open-set speaker identification.
Open-set speaker identification is the process of determining if a voice sample from a speaker is from one of a known set of speakers or from an unknown speaker. For example, consider an in-car application where the user speaks to the car and the vehicle tailors its settings automatically for the user. In this open-set scenario the user could be someone who is already registered in the car or a guest.
There are many factors to consider when developing a speaker recognition system such as error rates, ease of use, and cost. Further technical information can be found at the end of this document.
A collaborative effort between IBM Emerging Technologies and the Audio Analytics team in IBM Research led to the development of voice authentication capabilities in our prototype. The voice authentication feature of the prototype leverages a speaker recognition system developed by the Audio Analytics team. In order to balance the need for accuracy with the desire to minimize the audio required from the user to build classification models, we implemented voice authentication using a text-dependent speaker identification technique. A text-dependent (versus text-independent) approach to speaker identification improves accuracy for short duration audio samples. Thus, we set a short passphrase that must be used by every user of the prototype.
Voice authentication works as follows:
- A registered user of the prototype registers for voice authentication by providing ten samples of the system passphrase. Users can record their passphrase through a recording feature in a registration page for voice authentication. The audio is streamed to the backend of the application, which routes it to a speaker identification service. Watson Speech to Text is used in the browser to do some preliminary checks on the audio (i.e. check for keywords & number of words). This prevents the processing of low quality recordings.
- The ten samples are used to create a classification model for the registered user. The speaker identification service maintains a count of the passphrase samples associated with a registered user. Once ten samples have been collected, the speaker identification service will use the ten samples to create a model. The model is associated with a registered user and is stored for future use by the speaker identification service.
- The user can begin authenticating via voice. The voice authentication feature is immediately available.
In summary, IBM Emerging Technologies is developing a prototype as part of an exploration into conversational commerce. The prototype includes a voice authentication feature that authenticates users through a spoken passphrase. The feature was implemented using Watson Developer Cloud services and IBM Research’s speaker identification system. We are in the process of implementing a continuous authentication feature using the same technologies and will blog about it soon.
Related Information on Conversational Commerce
Related IBM Publications
- J. Pelecanos, J. Navratil, and G. Ramaswamy, “Conversational Biometrics: A Probabilistic View”, chapter in book Advances in Biometrics, Springer London, pp. 203-224, 2008.
- S. Sadjadi, J. Pelecanos, and S. Ganapathy, “The IBM Speaker Recognition System: Recent Advances and Error Analysis”, in proc. of Interspeech, 2016.
- S. Sadjadi, S. Ganapathy, and J. Pelecanos, “The IBM 2016 speaker recognition system,” in proc. The Speaker and Language Recognition Workshop (Odyssey 2016), 2016.
- S. Sadjadi, J. Pelecanos, and W. Zhu, “Nearest neighbor discriminant analysis for robust speaker recognition,” in proc. Interspeech, pp. 1860–1864, 2014.