Meeting the technical challenges of far-field voice control

Living through lockdown, we have fallen in love with voice technology all over again. A report in Voicebot.ai states that over half of voice assistant owners have increased their usage through lockdown and 40% plan to up their usage even after lockdown has lifted.

Last year (2019), Futuresource Consulting reported that “shipments of virtual assistants rose 25% year on year to 1.1 billion units”. The analyst house also forecast the market to exceed 2.5 billion shipments by 2023.

The question for the market is exactly how consumer interactions with these virtual assistants will evolve over time. In just a few decades, the way we interact with technology has already moved from clunky keypad experience, to the sophistication of the touchscreen. Now we’re heading into the landscape of voice, biometrics, haptics and other sensing technology.

With the COVID-19 pandemic heightening our awareness of touch and hygiene, a truly natural voice interface is not just an improvement in human–machine interaction — it’s arguably a necessity – and to open the door to that natural conversation, you need far-field voice (FFV).

Far-field voice around the home

FFV technology can be integrated with a huge array of devices. In the short term, there is an obvious benefit to using it to enhance the capabilities of smart speakers, but the real prize is in embedding FFV within the other devices around the home. This includes TVs, sound bars, set-top boxes and other smart devices.

Let’s take TV as an example. While near-field, push-to-talk (PTT) has been a gateway for a voice-enabled TV experience, it still requires the use of a physical remote. By embedding far-field voice control into the device, the user is free to enjoy a handsfree experience, calling up content on the TV from anywhere in the room – no more hunting for the remote or frustrating keypad entry and navigation.

Although huge progress has been made in FFV technology in the last 18 months, consumer adoption is still taking off and therefore the ecosystem is still growing. This has created a challenging environment for product and solution architects to realise the full potential that the technology can offer.

FFV is a complex technical challenge. Voice interfaces need intelligent algorithms, purpose built for modern living spaces, that are capable of analysing the acoustic landscape to identify and isolate a command from every other sound in the room. Add to that, the necessity to deliver these interfaces at exceptionally aggressive eBOM costs and the integration of FFV can seem daunting for designers

How to make far-field voice work?

Capturing a clear voice command from a distance isn’t easy. It requires some complex digital signal processing (or DSP). Accuracy of capture and clarity of command are critical. Our ears automatically tune out background noise to focus on and amplify the sound we want to hear. But a microphone captures the whole soundscape – including the unwanted noise such as conversation, traffic noise, appliances, air-conditioning, birdsong and dogs barking.

Fundamentally, for the success of FFV interfaces, purpose-designed algorithms are required to provide clarity in challenging acoustic environments, ‘cleaning up’ the voice signal for transmission to an Automatic Speech Recognition (ASR) engine. With the XVF3510, our latest FFV interface solution, we address the three dominant noise sources in the environment to ensure the highest capture and transmission quality.

The first noise source is the noise that generated from the device – for example if you’re talking to a smart speaker playing music or a smart-TV streaming a film. Our acoustic echo canceller (AEC) removes this audio stream by modeling the echo response and creating an estimate of the audio which is picked up by the microphone. This enables you to barge-in (cut-across) the music or audio that’s playing.

The second source of noise is point noise – or noise coming from a fixed point in the room, for example appliances or the kettle boiling. Our interference canceller ‘scans’ the soundscape of the room, and supresses static point noise sources in the surrounding space.

Finally, the XVF3510 accounts for background ambient noise, like an air conditioning unit or general chatter in a room. Here, our noise suppression algorithm reduces general background noise from the microphone input, creating a clear audio stream to pass to the speech recognition engine.

These three algorithmic blocks are tuned to work together. The output is then fed into an automatic gain control (AGC) which normalises and optimises the volume for the speech recognition engine.

In these complex audio systems, delays between the audio reference and the audio output can degrade performance. Our automatic delay estimator algorithm compensates for any delay in audio coming out of the system and ensures echo cancellation is optimised for reliable barge-in.

The future — far-field voice and artificial intelligence

As you can see, this is not an insignificant technical challenge. However, at XMOS not only have we delivered all of these capabilities in our XVF3510 platform, we have also designed a system that can deliver class-leading performance with only two microphones, which is critical to delivering FFV in an eBOM-efficient package.

And this is just the beginning. Recognising both the need for and potential of the multi-modal interactions of the future, we are already exploring ways to harness edge AI, voice and other sensors to transform the end-user experience of FFV and virtual assistants with presence and context awareness.

Although this remains a young market, the voice performance of interfaces is already simply table stakes. OEMs need ever-more capabilities in ever smaller packages at ever-lower eBOM costs. The focus for tech vendors has to be on value-add experiences, not purely on voice in isolation.

To find out more, watch our session from VOICE Global here.

SPEAK TO SALES

Spoken Command	Translation
打開電視	Turn on the TV
下一頻道	Next channel
上一頻道	Previous channel
增加音	Increase sound
降低音量	Lower the volume
關閉電視	Turn off the TV
開燈	Turn on the light
增加亮度	Increase brightness
減少亮度	Reduce brightness
關燈	Turn off the lights
開風扇	Turn on the fan
提高風速	Increase wind speed
降低風速	Reduce wind speed
提高溫度	Increase temperature
降低溫度	Reduce the temperature
關風扇	Turn off the fan

Meeting the technical challenges of far-field voice control

Far-field voice around the home

How to make far-field voice work?

The future — far-field voice and artificial intelligence

Anna Parlour

XMOS Headquarters
Bristol, UK

VACANCIES

Rohit Malhotra

US English commands

Mandarin commands

Hongquan Jiang

Dr. Jan-Hendrik Sewing

Paul Goodridge

Jalal Bagherli

Bill Elmore

Charles Cotton

Hermann Hauser

Peter Flach

David May

Robin Saxby

Mark Lippett

Jochen Meissner

Hitesh Mehta

Alan Duncan

Mark Lippett

Henk Muller

Stuart Mellis

Aneet Chopra

Joe Connelly

Sunny Suen

Andrew Dewhurst

Jochen Meissner

XMOS US Office
Hampton, USA

VACANCIES

XMOS Hong Kong Office

VACANCIES

UPDATE PASSWORD

UPDATE DETAILS

2-mic voice performance video

KEEP ME UPDATED ON THE LATEST XMOS NEWS AND ANNOUNCEMENTS

Spoken Command
Switch on the TV
Switch off the TV
Channel up
Channel down
Volume up
Volume down
Switch on the lights
Switch off the lights
Brightness up
Brightness down
Switch on the fan
Switch off the fan
Speed up the fan
Slow down the fan
Set higher temperature
Set lower temperature

Meeting the technical challenges of far-field voice control

Far-field voice around the home

How to make far-field voice work?

The future — far-field voice and artificial intelligence

RECEIVE OUR LATEST NEWS