In October 2018, Amazon introduced Whisper Mode to some of its products. Just over a year later, all users of Alexa devices can now whisper back and forth with them. Details on how this process works have now been revealed.
Amazon shared a paper from the January 2020 issue of the journal IEEE Signal Processing Letters. Here, it describes the machine learning techniques that were conducted to implement the Whisper Mode.
Its mission was to convert normal speech into a whispering voice while still holding a natural tone for Alexa. Therefore, it explored three different techniques to perform this conversion.
A handcrafted digital-signal-processing (DSP) system was looked at. This was based on an evaluation of the acoustics of whispered speech. Two different machine learning systems were also explored. One of these uses Gaussian mixture models (GMMs) while the other uses deep neural networks (DNNs).
These techniques were all evaluated through listener studies using multiple stimuli with hidden reference and anchor (MUSHRA) processes. Amazon concluded that the machine learning systems were highly effective but the DNN model was more responsive to multiple and unfamiliar speakers.
VentureBeat reports that the GMMs tried to identify a range of values for each output feature corresponding to a related distribution of input values. Meanwhile, the DNNs adjusted their internal settings through a way in which the networks attempted to predict the outputs related to particular inputs.
“We used two different data sets to train our voice conversion systems, one that we produced ourselves using professional voice talent and one that is a standard benchmark in the field. Both data sets include pairs of utterances — one in full voice, one whispered — from many speakers,” Amazon said, as per its blog post.
“Like most neural text-to-speech systems, ours passes the acoustic-feature representation to a vocoder, which converts it into a continuous signal.”
The experiments continue
To evaluate its voice conversion systems, it compared its outputs to both recordings of natural speech and recordings of natural speech fed through a vocoder called WORLD.
The group used two sets of data to train their conversion systems. Thereafter, they produced speech using five professional voice actions from Australia, Canada, Germany, India, and the US. They then compared the outputs to recordings of natural speech and recordings of speech fed through a vocoder.
In their preliminary experiments, they trained the voice conversion systems on data from individual speakers and tested them on data from the same speakers.
The finished product
They found that, while the raw recordings sounded most natural, whispers synthesized by the models sounded more natural than “vocoded” human speech. This then allowed the company to analyze how well the voice conversion process was performing.
The version of the whispers in all Alexa devices has passed through Amazon’s state-of-the neutral vocoder that enhances the speech quality further.
Altogether, Amazon is set to have another strong decade ahead following ten years of unprecedented growth. By continuing to look at modern technology to improve its products, it will continue to stay ahead within its markets.
What are your thoughts on these machine learning techniques? Let us know what you think in the comment section.