Speech-based LLMs lag behind their text counterparts. Early attempts to feed raw audio waveforms directly into models, like DeepMind's WaveNet, struggled to produce coherent speech despite processing 160,000 samples for just 10 seconds of audio. The key? Neural audio codecs, which effectively 'tokenize' audio for LLMs. Kyutai's Mimi codec, initially for Moshi, tackles this. Sesame's CSM now leverages it. This is how you get audio into LLMs.
The Problem: Raw Audio is Dense
Text models benefit from established tokenization methods like byte-pair encoding (BPE). OpenAI's GPT-4o, despite its age in LLM years, still uses the same tokenizer. Andrej Karpathy's 2015 work demonstrated impressive results generating code and LaTeX with LSTMs operating on character-level text. Audio, however, presents a different challenge. While Karpathy's RNNs processed a few thousand characters, a short audio clip consists of tens of thousands of samples. This makes it difficult for models to maintain coherence over time. Direct waveform prediction requires immense computational resources.
Consider this WaveNet sample:
Even though the audio sounds good acoustically, it rarely produces correct English words. Modeling audio sample-by-sample is computationally expensive and struggles with long-range dependencies.
Neural Audio Codecs: Bridging the Gap
Neural audio codecs provide a solution by encoding audio into a sequence of discrete tokens, similar to how text is tokenized. This allows LLMs to process audio more efficiently. The codec consists of two main components: an encoder and a decoder. The encoder transforms the raw audio into a sequence of tokens, while the decoder reconstructs the audio from these tokens. Kyutai's Mimi codec is designed for this purpose.
The core idea: sandwich a language model between an audio encoder/decoder pair. This setup allows the LLM to predict audio continuations based on the tokenized representation. This architecture mirrors text-based LLMs, where the model predicts the next token in a sequence. This approach sidesteps the computationally intensive task of directly modeling raw audio waveforms.
Practical Implications and Considerations
Using neural audio codecs introduces trade-offs. The encoding process inevitably leads to some information loss, affecting the audio quality of the reconstructed signal. However, this loss is often acceptable, especially when balanced against the performance gains achieved by processing tokenized audio. Latency becomes a crucial factor in real-time applications. The encoding and decoding processes add latency, which must be minimized for interactive use cases. The choice of codec depends on the specific application requirements. Some codecs prioritize audio quality, while others prioritize low latency or computational efficiency.
Early speech LLMs often transcribe speech to text, generate a text response, and then use text-to-speech. This workaround misses crucial nuances like tone and emotion. Native speech LLMs, such as Gemini and ChatGPT's Advanced Voice Mode, aim to address this. However, they often fall short of matching the performance of text-based models. Neural audio codecs represent a promising direction for improving the capabilities of speech LLMs, enabling them to truly understand and generate human speech.
Looking Ahead
As neural audio codecs continue to evolve, we can expect to see further improvements in audio quality, latency, and computational efficiency. Research is ongoing to develop codecs that can capture more nuanced aspects of human speech, such as emotion and intonation. The integration of neural audio codecs into LLMs has the potential to unlock new applications in areas such as voice assistants, speech recognition, and audio generation. Watch for advancements in codec architectures and training techniques to drive further progress in this field.