Voice AI: Latency deep dive

Jul 27, 2024

Premise


Voice AI takes centre stage in 2024 as its the natural path of evolution in Gen AI. Humans would want to speak to AI like they speak with humans without compromising on the experience. What's an acceptable human-like conversation?

Today let's dive into the depths of latency in real time conversations and what is the acceptable state. I will also touch upon how latency can be lowered.

⚡Low latency is the backbone of seamless Voice AI interactions – where milliseconds matter more than ever.


The works


When humans have to communicate to AI, there are several key components involved, but the noteworthy ones in SOTA are:


💬 ASR (Automatic speech recognition): When humans speak, this has to be recognized by machines. this is done via ASR.


🧠 Intelligence: Once human speech is recognized, there needs to be an intelligence layer to process what the speech means and how the response needs to be given back.


🗣 TTS (Text to speech): The processed answer from intelligence layer needs to be converted to speech.


All these 3 key components work in tandem to provide end to end Voice AI conversations with humans in real time. (here's where I can promote Digitar, but I won't 😇)

Now latency adds in all of these components as it needs time to do processing. Humans accept end to end latency of 500 milliseconds to 1000 milliseconds. In this range, humans don't feel awkward waiting for an AI response. 500-1000 milliseconds end to end latency (ASR + Intelligence + TTS) is an acceptable state without lowering the experience.

In order to achieve this extremely challenging ask, there are several areas where startups and incumbents are striving really hard to optimize the latency.

💬 ASR: There are companies like Deepgram which are delivering at ~300 milliseconds. several speech models are getting to this state and even further they are now optimizing on streaming inputs as well (this is the way to go for real time conversations).
🧠 Intelligence: Currently Generative pretrained transformer models have latency deadlock state as text to text response itself is hovering around a couple of seconds. Several optimizations are being done by various industries to bring this latency to few hundred milliseconds and it's a WIP.

Digitar AI specializes in providing Intelligence with lowest latency & highest accuracy for Voice AI scenarios


🗣 TTS: Here as well there are companies like ElevenLabs trying to deliver at ~250 milliseconds. Even this is a WIP.

There are certain techniques like faster inferencing via Language processing units (LPUs Groq) and edge device ready 1-Bit LLMs (Microsoft) for tackling the intelligence layer latency problem and these are drastically different and have brought down the latency of intelligence layers (it's also WIP).


It's an interesting year for lowest latency conversation in Voice AI. Let's see how it goes.

Until next time.
Best Regards
Mukunda

Subscribe to our newsletter for daily updates

Follow Us

Get the latest updates from social media