What is the difference between speech to text and chatbots?

Answer

The numerous significant differences between speech-to-text technologies and chatbots is part of what's being examined in the rapid evolution of chatbot and voicebot projects.

A speech-to-text technology is simply one that converts verbal speech to text on a digital page. That’s its full function, but it's not one that's simple to design. In order to convert verbal speech to text, the technology has to break words and sentences down into individual phonemes and work with them according to complex algorithms to create text that is accurate and represents what the speaker said.

Chatbots, on the other hand, are technologies that accomplish the goal of communicating with a human. There are two types of chatbots: text chatbots and voicebots. Text chatbots have been around much longer, because they don't need the speech-to-text element that voicebots utilize.

The main difference between speech-to-text technologies and chatbots is scope. As mentioned, all the speech-to-text technology needs to do is to transcribe the verbal speech. The chatbot, on the other hand, needs to take speech in whichever form it's made for, understand it, and provide responses that seek to pass the Turing test – the test of whether a technology can fool a human into thinking that he or she is speaking with another person.

With that in mind, chatbots are far easier to create than voicebots. The chatbot takes in the human's text and provides a text response. Even relatively simple chatbots have been able to provide interesting and enjoyable results for humans since the late 1980s and early 1990s.

The voicebot, on the other hand, has to take in verbal speech, convert it to text, check it for accuracy, produce a response, and build that response from machine language into audible speech. This large number of fairly significant tasks means that the voicebot takes a lot of computing power and a lot of design to build.

Projects like Siri, Cortana and Alexa demonstrate part of the vanguard of voicebot technologies. They also illustrate that this technology is still in its infancy. Although Alexa and other technologies can respond verbally to human speech, they're not extremely capable in the sense that we associate with verbal human speech in general. In other words, there's quite a bit of limitation to the responses that these technologies can provide. There's even a limited ability of today’s generation of personal assistants to really generate speech to text, for example, for the purposes of transcribing an email or helping someone write an essay without using their hands. Some of the specific speech-to-text programs on the market do this better than Siri or Cortana, probably due to the allocation of resources. However, there are signs that voicebot progress is soon going to take off – such as Amazon's Lex platform that allows a studio environment for building these types of technologies.

In a clever and instructive essay on the subject, Tobias Goebel talks about the difference between these technologies, contrasting the process of “transcribing,” which speech to text does, to the job of understanding, which chatbots are supposed to do.

“While eliminating the need for speech recognition does make things easier for a chatbot, the main challenge to build functioning bots lies in natural language understanding,” Goebel writes.

Goebel also identifies many of the current players in the industry:

The market leader for speech recognition is Nuance, who is behind well-known systems such as Dragon NaturallySpeaking for dictation on a PC, which has been around since the nineties, but also Siri: the speech recognition/transcription task conducted in the Apple cloud uses Nuance technology behind the scenes. Others are LumenVox, Verbio, or Interactions, but speech recognition is now also offered as a cloud service via APIs by the likes of Amazon, Google, Microsoft, and IBM.

As chatbots develop, it’s assumed that their understanding will continue to increase on some trajectory – and it’s also largely assumed that more bot technology will pass from text interfaces to verbal interfaces, requiring additional amounts of computing power.