These days, most computer voices are passé. You probably don’t get too amped up about cyborgs and robots when you hear the "droid" on your phone helping you with a bill payment or asking you what department you want. But what if you suddenly heard Kurt Cobain prodding you for card information? Or John F. Kennedy telling you about the wonders of early voting? Or Elvis getting your name and address down before breaking into "a hunk, a hunk of burning love?"
All of these would be … kinda weird, but what’s even more fascinating is that the technology is basically already here. Just a decade or so ago, we were amazed by a computer's ability to even talk at all. Now, we’re about to be floored by free ranging, computer generated voices that sound just like folks we know.
Big Changes in NLP
If you’re paying attention to the field of natural language processing (NLP), you may have heard about some recent advances that go beyond the kinds of canned virtual assistant voices that we now hear in our global positioning systems (GPS) and automated business phone lines.
The beginning of NLP required a whole lot of research into the general mechanics of human speech. Researchers and engineers had to identify individual phonetics, fold them into greater algorithms for generating phrases and sentences, and then try to manage all of it at a meta-level to generate something that sounded real. Over time, NLP leaders mastered this and started building advanced algorithms to understand what humans say. Putting these two together, companies came up with the drivers for today’s virtual assistants and fully digital bill-pay clerks, whose mannerisms – while annoying – are still amazing when you stop to think about the work that went into them.
Now, some companies are going beyond the generic virtual voice to put together a more specific personalized result. This requires going through a particular person’s lexicon and collecting large amounts of unique voice video, then applying this archive to the complex rhythms for phonetics, emphasis, cadence and all the other tiny cues that linguists often group under the broad banner of "prosody."
What comes out is a voice that listeners think of as "owned" by a particular person – either someone they know and have spoken with, or someone whose voice they recognize as a result of the person's fame.
From Elvis to Martin Luther King, anyone’s voice can now be "cloned" this way – provided there’s a substantial prerecorded record of their speech. By applying even more detailed analysis and manipulation to individual small sounds, companies are able to make a virtual carbon copy of someone’s voice that sounds a lot like the real thing.
Exciting "Text to Voice" Creations at VivoText
VivoText, for example, is one company that's working to revolutionize the use of artificial human voices for all kinds of campaigns, from audiobooks to interactive voice response (IVR). At VivoText, research and production teams are working on processes that, theoretically, could specifically replicate the voices of deceased celebrities, such as Ol' Blue Eyes himself.
"To clone Frank Sinatra’s voice, we would actually go through his recorded legacy," says VivoText CEO Gershon Silbert, talking about how this kind of technology could work.
Right now, VivoText is working on archiving the voices of those who are still with us, such as NPR correspondent Neal Conan, who has signed up as a model for this kind of IT pioneer project. A promotional video shows VivoText workers painstakingly creating phonetic code modules using provided voice input from Conan. They then create the models for text to speech (TTS) tools that evoke a dramatically human and personified result.
According to Ben Feibleman, vice president of strategy and business development at VivoText, the computer works at a phoneme level (using the smallest unique parts of speech) to conform to a prosodic model for an individual human voice.
"It knows how the voice talks," says Feibleman, adding that by using "unit selection," the computer chooses a number of pieces to put together a single short word, like where the word "Friday" is given five components that help develop a particular emphasis and tonal result.
Artificial Voice in Marketing
So, how does this work in marketing? VivoText’s products could be extremely useful in creating products, like audiobooks, that could reach target audiences. For example, how much more effective would an Elvis voice be compared to one of today's generic, deadpan, automated voices if it were used to sell entertainment-related products?
Or, how about in politics? Feibleman has been working on various ideas for using projects like these to enhance marketing for companies or other parties that need more effective messaging.
"If you know any politicians running for president, this could have 10 million swing-state voters get a personal call from a candidate, thanking them for their support, telling them where they need to go to vote, the weather and all the trimmings the night before the election," Feibleman said.
Your Voice Lives on
There is another obvious application to all of this technology. Natural language companies like VivoText could create a personal service that would upload all of a customer’s voice data into a product that would allow that person to "speak forever."
Practical implementation would likely raise a number of questions about how we hear and internalize spoken voices. For example, what does it take to make a sound stream sound exactly like somebody? How well do we have to know a person to recognize a particular voice? And, interestingly, what happens if a natural language service produces a crude caricature, rather than a compelling mimicry?
Evaluating results, says Feibleman, often depends on consideration of context. For example, he says that children usually don’t ask questions about who’s speaking when they listen to a story. They just want more. But also, many adults may not think about who’s talking to them, given a particular scenario, such as a passive broadcast or phone message. Also, it’s easier to be fooled by a computer over the phone because the muffled sound can mask glitches or other discrepancies between the computer results and a human voice.
"It doesn’t occur to you to challenge the authenticity of the voice," says Feibleman.
In the Year 2525
As companies move forward in developing products and services and answering these questions, "living speech" technologies could advance us toward that convergence of technology and the human mind, which has classically been called artificial intelligence (AI).
If computers can speak like us, they may be able to trick other users into thinking that they think like us, feeding into the larger principle of singularity, as ushered into our lexicon by John von Neumann, a 1950s-era tech pioneer evangelized by writers and thinkers like Ray Kurzweil. Kurzweil's 2005 book, "The Singularity Is Near," excites some and scares others. Kurzweil’s predicted that by 2045, "intelligence" as a phenomenon will become greatly unglued from the human brain and migrate into technology, blurring the lines between machines and their human masters.
Immortalized in the lyrics of Zager & Evans' "In the Year 2525" (nobody does creepy sci-fi ballads like these guys)…
In the year 4545
You ain't gonna need your teeth, won't need
You won't find a thing to chew
Nobody's gonna look at you
In the year 5555
Your arms hangin' limp at your sides
Your legs got nothin' to do
Some machine's doin' that for you
Are computer voices a step in this direction? As a new way to outsource some of the functions of the human body (or more commonly, to simulate them), this kind of tech progress is one of the biggest – and probably underreported – advances on the horizon as we look into a singular future. (Read more about "the singularity" in Will Computers Be Able to Imitate the Human Mind?)