New research into text-to-speech technology has significantly improved the ability of computers to mimic human speech patterns.

Researchers at Google have found a way to dramatically improve the cadence and intonation of computer generated speech. It’s a substantial step towards sophisticated speech synthesis that has, so far, existed entirely within the realm of sci-fi.

The ability to speak naturally is often treated as a vital component of humanity. Mechanical life forms in Star Trek: The Next Generation and its various spin-offs almost always speak with mannerisms intended to convey their artificiality, even when their intentions are perfectly benign. Despite previous advances, technologies like Alexa, Siri, Cortana, or Google Assistant, would rarely be mistaken for a human. A large part of the reason why we can still differentiate a computer voice is because of the (mis)use of prosody. Prosody is defined as the pattern of intonation, tone, rhythm, and stress within a language.

There’s an old joke reinforcing the importance of commas. The joke compares two simple sentences: “It’s time to eat Grandma” conveys a little different meaning than “It’s time to eat, Grandma.” In this case, the comma is used to convey information about how the sentence should be pronounced and interpreted. Not all prosodic information is encoded via grammar, however, and teaching computers how to interpret and use this data has been a major stumbling block.

Licensing prevents us from embedding Google’s speech samples directly, but it’s worth visiting the page to hear how the new advances impact pronunciation and diction. None of the clips sound completely human —  but they’re a substantial improvement on what’s come before.

Bring your business tech up to speed. Call 614-279-9969, or contact for your free quote.

Click to visit the google page with audio samples.

Share This