Like Henry Higgins, the phonetician from George Bernard Shaw’s play ‘Pygmalion’, Marius Cotescu and Georgi Tinchev recently demonstrated how their student tried to overcome pronunciation problems.
The two data scientists, who work for Amazon in Europe, taught Alexa, the company’s digital assistant. Their job: to help Alexa master English with an Irish accent using artificial intelligence and recordings of native speakers.
During the demonstration, Alexa talked about a memorable night out. “The party last night was crazy crazy,” Alexa said in a lilting voice, using the Irish word for fun. “We got ice cream on the way home and we were happy outside.”
Mr. Tinchev shook his head. Alexa had dropped the “r” in “party,” making the word sound flat, like pah-tee. Too British, he concluded.
The technologists are part of a team at Amazon working on a challenging area of data science known as disentangling voices. It’s a thorny issue that has taken on new relevance amid a wave of AI developments, with researchers believing the speech and technology puzzle could help make AI-powered devices, bots, and speech synthesizers more conversational — that is, in stands for a large number of regional accents.
Untangling voices involves much more than understanding vocabulary and syntax. A speaker’s pitch, timbre, and accent often give words a nuanced meaning and emotional weight. Linguists call this linguistic feature “prosody,” something that machines have a hard time mastering.
Only in recent years, thanks to advances in AI, computer chips and other hardware, have researchers made strides in solving the problem of disentangling voices and transforming computer-generated speech into something more pleasing to the ear.
Such work may eventually coincide with an explosion of “generative AI,” a technology that allows chatbots to generate their own responses, researchers said. Chatbots like ChatGPT and Bard may one day fully respond to users’ voice commands and respond verbally. At the same time, voice assistants like Alexa and Apple’s Siri will become more conversational, potentially rekindling consumer interest in a tech segment that had seemingly stalled, analysts said.
Getting voice assistants like Alexa, Siri, and Google Assistant to speak multiple languages was an expensive and lengthy process. Tech companies have hired voice actors to record hundreds of hours of speech, which helped create synthetic voices for digital assistants. Sophisticated AI systems known as “text-to-speech models” – because they convert text into natural-sounding synthetic speech – are only just beginning to streamline this process.
The technology “is now capable of creating a human voice and synthetic audio from text input, in a variety of languages, accents and dialects,” said Marion Laboure, a senior strategist at Deutsche Bank Research.
Amazon is under pressure to catch up with rivals like Microsoft and Google in the AI race. In April, Amazon CEO Andy Jassy told Wall Street analysts that the company planned to make Alexa “even more proactive and talkative” using advanced generative AI. Rohit Prasad, Amazon’s lead scientist for Alexa, told CNBC in May that he saw the voice assistant as a voice-controlled “readily available, personal AI”
Irish Alexa made its commercial debut in November, after nine months of training in understanding and then pronouncing an Irish accent.
“Accent is different from language,” Mr Prasad said in an interview. AI technologies must learn to extract the accent from other parts of speech, such as tone and frequency, before they can replicate the idiosyncrasies of local dialects. For example, maybe the ‘a’ is flatter and ‘t’s’ are pronounced more forcefully.
These systems need to figure out these patterns “so you can synthesize a whole new accent,” he said. “That’s difficult.”
It was even harder to let the technology learn a new accent largely on its own, from a different-sounding speech model. That’s what Mr. Cotescu’s team tried to do when building Irish Alexa. They relied heavily on an existing speech model with mostly British English accents — with a much smaller number of American, Canadian, and Australian accents — to train it to speak Irish English.
The team faced several linguistic challenges from Irish English. The Irish tend to drop the “h” in “th”, for example pronouncing the letters as a hard “t” or a “d”, making “bath” sound like “bat” or even “bad “. Irish English is also rhotic, meaning the “r” is overpronounced. That means the “r” in “party” will be clearer than what you might hear coming from a Londoner. Alexa had to learn and master these speech features.
Irish English, said Mr Cotescu, who is Romanian and was the lead researcher on the Irish Alexa team, “is difficult.”
The speech models that power Alexa’s verbal skills have become more sophisticated in recent years. In 2020, Amazon researchers taught Alexa to speak fluent Spanish from an English speaking model.
Mr. Cotescu and the team saw accents as the next frontier of Alexa’s speech capabilities. They designed Irish Alexa to rely more on AI than actors to build its speech model. As a result, Irish Alexa was trained on a relatively small corpus – about 24 hours of recordings by voice actors reciting 2,000 utterances in Irish-accented English.
In the beginning, when Amazon researchers passed the Irish recordings to the still-learning Irish Alexa, strange things happened.
Letters and syllables occasionally dropped out of the answer. “S’s” sometimes stuck together. One or two words, sometimes crucial ones, were muttered inexplicably and were incomprehensible. At least in one instance, Alexa’s female voice dropped a few octaves and sounded more masculine. Even worse, the male voice sounded distinctly British, the kind of sham that raises eyebrows in some Irish homes.
“They’re big black boxes,” Mr Tinchev, a Bulgarian who is Amazon’s chief scientist on the project, said of the speech models. “You have to experiment a lot to fine-tune them.”
That’s what the technologists did to correct Alexa’s “party” blunder. They disentangled speech, word by word, phoneme (the tiniest audible sliver of a word) by phoneme to pinpoint where Alexa slipped and fine-tune it. They then fed the Irish Alexa speech model more recorded speech data to correct the mispronunciation.
The result: the “r” in “party” returned. But then the “p” disappeared.
So the data scientists went through the same process again. They finally landed on the phoneme that contained the missing “p”. They then further refined the model so that the “p” sound returned and the “r” did not disappear. Alexa finally learned to speak like a Dubliner.
Two Irish linguists – Elaine Vaughan, who teaches at the University of Limerick, and Kate Tallon, a doctoral student working in Trinity College Dublin’s Phonetics and Speech Laboratory – have since given high marks to the Irish Alexa accent. The way the Irish Alexa emphasized “r’s” and softened “t’s” stood out, they said, and Amazon got the accent right as a whole.
“It sounds authentic to me,” Ms. Tallon said.
Amazon’s researchers said they were pleased with the mostly positive feedback. That their speech models could unravel the Irish accent so quickly gave them hope that they could replicate accents elsewhere.
“We also plan to extend our methodology to language accents other than English,” they wrote in a January research paper on Ireland’s Alexa project.