The machine learning techniques that give the whisper to Alexa’s speech

In October 2018, Amazon introduced Whisper Mode to some of its products. Just over a year later, all users of Alexa devices can now whisper back and forth with them. Details on how this process works have now been revealed.

Amazon Echo Dot
Amazon device users have been enjoying the widespread introduction of Whisper Mode. Photo: Amazon

Maintaining clarity

Amazon shared a paper from the January 2020 issue of the journal IEEE Signal Processing Letters. Here, it describes the machine learning techniques that were conducted to implement the Whisper Mode.

Its mission was to convert normal speech into a whispering voice while still holding a natural tone for Alexa. Therefore, it explored three different techniques to perform this conversion.

A handcrafted digital-signal-processing (DSP) system was looked at. This was based on an evaluation of the acoustics of whispered speech. Two different machine learning systems were also explored. One of these uses Gaussian mixture models (GMMs) while the other uses deep neural networks (DNNs).

These techniques were all evaluated through listener studies using multiple stimuli with hidden reference and anchor (MUSHRA) processes. Amazon concluded that the machine learning systems were highly effective but the DNN model was more responsive to multiple and unfamiliar speakers.

Amazon Echo
Those with Amazon devices on their bedside table have found Whisper Mode to be useful at night. Photo: Amazon

Training data

VentureBeat reports that the GMMs tried to identify a range of values for each output feature corresponding to a related distribution of input values. Meanwhile, the DNNs adjusted their internal settings through a way in which the networks attempted to predict the outputs related to particular inputs.

“We used two different data sets to train our voice conversion systems, one that we produced ourselves using professional voice talent and one that is a standard benchmark in the field. Both data sets include pairs of utterances — one in full voice, one whispered — from many speakers,” Amazon said, as per its blog post.

“Like most neural text-to-speech systems, ours passes the acoustic-feature representation to a vocoder, which converts it into a continuous signal.”

Amazon Echo
Voice actors from five different countries were consulted to help with the whispered speech. Photo: Amazon

The experiments continue

To evaluate its voice conversion systems, it compared its outputs to both recordings of natural speech and recordings of natural speech fed through a vocoder called WORLD.

The group used two sets of data to train their conversion systems. Thereafter, they produced speech using five professional voice actions from Australia, Canada, Germany, India, and the US. They then compared the outputs to recordings of natural speech and recordings of speech fed through a vocoder.

In their preliminary experiments, they trained the voice conversion systems on data from individual speakers and tested them on data from the same speakers.

Amazon Whisper
The MUSHRA scores for the naturalness of recorded speech (Rec), vocoded recorded speech (Oracle), and Amazon’s three experimental systems. Photo: Amazon

The finished product

They found that, while the raw recordings sounded most natural, whispers synthesized by the models sounded more natural than “vocoded” human speech. This then allowed the company to analyze how well the voice conversion process was performing.

The version of the whispers in all Alexa devices has passed through Amazon’s state-of-the neutral vocoder that enhances the speech quality further.

Altogether, Amazon is set to have another strong decade ahead following ten years of unprecedented growth. By continuing to look at modern technology to improve its products, it will continue to stay ahead within its markets.

What are your thoughts on these machine learning techniques? Let us know what you think in the comment section.

1 Shares:
avatar
  Subscribe  
Notify of
You May Also Like