In the latest post on its machine learning blog, the Apple audio software engineering team explains the challenges of speech detection for smart speakers, and how it uses machine learning models running on the HomePod’s A8 chip to help improve far-field accuracy.
The HomePod must be able to maintain recognition accuracy when loud music is being played on it, when the person talking is far away, and correctly isolate the sound of someone speaking a command from other sounds in the room like a TV or noisy appliance.
As ever, the blog post is written to target other engineers and scientists, and this is reflected in its use of very technical language. You can read the whole thing here, but the gist is that the HomePod uses custom multi-channel filtering to eliminate echo and background noise, and unsupervised learning models to focus only on the person who said ‘Hey Siri’ if there are multiple people speaking in a room.
The blog post has a lot of math that explains the mechanisms of this system, and their successful test results. It says that the multichannel sound processor uses less than 15% of a single core of the A8 chip which is inside the HomePod, an important point as the team was optimizing for energy efficiency as well.
If you don’t understand the math like me, then scroll to the bottom of the blog post and click on the play buttons below the graphs to hear some examples of the raw sound input and the processed result.
The Figure 7 example is particularly illuminating as it shows just how much of the microphone input sound is blocked out by the music tweeters and subwoofers. You basically cannot hear the person’s Siri request in the raw soundbite. The processed versions make it audible but there’s still a lot of audible interference that other systems in the speech recognition workflow have to know to avoid.