INTRO: AI researchers at Google and the University College London have detailed an AI model that can control speech characteristics like pitch, emotion, and speaking rate with just 30 minutes of data.
AI algorithm Will control Speech Characteristics: Google Researchers
According to the study, using just 30 minutes of labeled data enabled the AI algorithm to have a ‘significant degree’ of control over speech rate, valence, and arousal. Their paper, which has been published by the International Conference on Learning Representations (ICLR), details how the researchers trained the AI system for 300,000 steps across 32 of Google’s custom-designed tensor processing units (TPUs).
Voice-codec to Analyse Voice Data
The researchers further said that the new system can produce visual representations of frequencies called spectrograms by training a second model, such as DeepMind’s WaveNet, to act as a vocoder – which is a voice codec that analyzes and synthesizes all the voice data.
What’s really interesting is that the new AI model seems to address a critical limitation of an earlier study that investigated the use of ‘style tokens’, which represented different categories of emotion, to control speech effects. While that model achieved good results with only 5 percent of labeled data, it wasn’t able to satisfactorily modify speech samples that used different tones, stress, intonations, and rhythms while conveying the same emotion.
The labeled data set included a total of around 45 hours of audio, including 72,405 recordings of 5-second each from 40 English speakers. The speakers who selected were all trained voice actors who read the pre-written texts with different or varying levels of valence (emotions like sadness or happiness) and arousal (excitement or energy). The researchers then used those recordings to obtain six ‘affective states’ that were then modeled and used as labels for the AI algorithm to train on.
While the researchers admit that the new AI model can make it easier for unscrupulous parties to spread misinformation or commit fraud, they also claim that the benefits in this case far outweigh the possible risks because the study can eventually improve human-computer interfaces significantly.