Talking to Machines

16
Apr
2021

Steve Young

Talking to machines is becoming commonplace. We routinely tell our smart speakers what to play next, we tell our satnavs where we want to go, we ask our phones general knowledge questions, we dictate messages to friends directly from our smartwatches, and much more. Slowly but surely the conversational agents that recognise and respond to our requests are getting smarter. Speech recognition errors are becoming less frequent, and the quality of synthesised voices are increasingly humanlike. Behind the scenes, the range of skills offered is growing. For example, Siri can read you messages, translate from one language to another order a pizza, find you a movie, and tell you who is at the front door. Alexa has a similar built-in repertoire of functions and by using its “Skills” interface, it can also connect you with literally thousands of 3rd party applications.

When it works well, using a conversational agent to provide a general-purpose interface to computer-based services has many advantages. It allows the user to execute a wide range of complex functions in natural language without having to learn special commands and interface protocols. It is hands-free and generally much quicker than typing or clicking through a deep hierarchy of menu choices. It can learn the user’s preferences and provide short-cuts for frequently executed tasks. Overall, it’s just easier to simply say what you want when you want it.

Conversational agents are extremely complex engineering systems, but they are not magic. Serious debate requires a basic understanding of how a conversational agent actually works in order to properly understand issues related to trust and future development.

The promise of this easy-to-use ubiquitous control interface has led many to argue that “Voice User Interfaces” or VUI’s will eventually supersede today’s plethora of graphical touch, click and swipe interfaces. However, before that can happen there are a few issues to resolve. Firstly, there is the problem of discoverability. When the functionality of an agent is limited, how can the user learn what it can and cannot do? One solution is for the agent to provide context-sensitive help. When the user tries to access a hitherto unused service, the agent should proactively explain what it can do and how to do it. Doing this effectively whilst avoiding irritating the user with unsolicited and unwanted information is quite a challenge.

The second problem is to efficiently present the results of a search without a display screen. For example, if we use Google to find a nearby restaurant, we can scan the ranked list of search results very quickly to find somewhere that we fancy. Presenting similar information by voice is not so easy since reading out a long list is not practical. To solve this, consider how hotel concierges operate. If you ask them for a nearby restaurant, they don’t simply list all of the possibilities. Instead, they ask you questions about your preferences in order to reduce the list to just two or three options. To be really effective, a conversational agent must learn to do something similar. However, this raises a third issue – trust.

As we get more and more dependent on conversational agents, trust will become a major issue. If an agent guides you to a specific restaurant choice, is it genuinely helping you find the restaurant which best matches your preferences, or is it steering you to the restaurant paying the largest commission? When you provide an agent with information about yourself, is it strictly maintaining your privacy or is it passing on the information to others? In the longer term, as agents get smarter and smarter, their ability to do us harm, whether deliberately or accidentally, will increase. Indeed, at the extreme end of concerns about trust, some experts argue that artificial intelligence must eventually exceed human levels of intelligence thereby enabling conversation agents to potentially become an existential threat (Bostrom, 2014).

Whatever the future is for conversational agents, it is certain that they will have a significant societal impact. To ensure that this impact is benign will require public engagement and discussion. Conversational agents are extremely complex engineering systems, but they are not magic. Serious debate requires a basic understanding of how a conversational agent actually works in order to properly understand issues related to trust and future development. However, the conversational agents we are familiar with are all proprietary and their designs and implementations are confidential.

In contrast, Cyba is a fictitious conversational agent who unlike his famous cousins is happy to explain exactly how it works. Despite its complexity, Cyba tries to keep its explanations simple. Rather than use mathematical equations, it uses diagrams to explain how it recognises speech, extracts meanings from words, manages a conversation and generates spoken responses. As well as purely technical matters, Cyba also discusses issues of privacy and trust, and speculates a little on future developments. Taking a tour of Cyba’s inner workings will provide a great foundation for understanding this important technology of the future.

Hey Cyba
The Inner Workings of a Virtual Personal Assistant
By Steve Young

About The Author

Steve Young

Steve Young has more than 40 years of research experience in speech processing and AI. He founded a number of speech technology startups including Entropic acquired by Microsoft in...

View profile >