Text to speech (TTS) is a natural language modeling process that requires changing units of text into units of speech for audio presentation. This is the opposite of speech to text, where a technology takes in spoken words and tries to accurately record them as text. Text to speech is now common in technologies that seek to render audio output from digital text to assist those who are unable to read, or for other kinds of uses.
Developing text-to-speech capability includes some unique challenges. Especially in the English language, where a great number of homonyms have varied pronunciations, computer programs rely on probability modeling to guess the desired pronunciation of a word in digital text. The program also has to convert units of text into phonemes, the smallest units of speech pronunciation. The result is that many text-to-speech technologies are less than infallible, although developers have made vast progress on these technologies over several years.
Over time, experts have observed some best practices for TTS development. These include phoneme bases and concatenative approaches with predictive analytics. The best programs are also able to work with minimal memory requirements and are easy to set up. Developers continue to work on TTS resources for any given language, working through the major challenges of ambiguity and other obstacles to more accurate rendering.