Speech | Menlo Research

🍓 Ichigo: Rethinking Speech and Language Processing

Modern speech systems typically rely on a pipeline approach: convert audio to text using Automatic Speech Recognition (ASR), process text with a Large Language Model (LLM), and optionally convert the response back to speech via Text-to-Speech (TTS). This pipeline has enabled real-time assistants and transcription tools but remains fundamentally limited in fluency, latency, and robustness.

Ichigo proposes a different approach: a unified model where speech and language are processed in a shared token space. Rather than treating speech as an external input that must be converted to text, Ichigo learns to represent speech and language within the same framework, allowing for more seamless and native interactions between spoken and written modalities.

Key Concepts

Speech as a Tokenized Modality
- Ichigo-ASR does not just transcribe speech—it tokenizes it into a form compatible with LLMs. This enables direct speech-to-response generation without an intermediate text representation.
Early Fusion for Speech-Language Understanding
- Inspired by multimodal models like Meta’s Chameleon, Ichigo explores early fusion techniques where speech tokens interact directly with text-based LLMs, improving contextual understanding and reducing errors from cascading pipelines.
Lightweight & Efficient Design
- Unlike monolithic speech models, Ichigo is designed to be lightweight (22M ASR parameters) and modular, making it suitable for real-time applications on edge devices.
Towards End-to-End Conversational AI
- Ichigo lays the groundwork for a future where LLMs do not simply ‘read’ and ‘write’ but also ‘listen’ and ‘speak’ natively.

Why This Matters

Reduces Latency: Traditional ASR + LLM + TTS stacks introduce unnecessary delays. Ichigo enables more fluid, real-time conversation.
Enhances Robustness: Speech recognition errors often distort downstream LLM understanding. Direct speech-to-response generation minimizes error accumulation.
Bridges Modalities: Ichigo moves towards a future where spoken and written language processing are no longer separate tasks but part of the same intelligent system.

Ichigo is not just a tool—it is a shift in how we think about speech interaction in AI systems. 🚀

Resources:

Links:

GitHub: https://github.com/janhq/ichigo
Paper: https://arxiv.org/abs/2410.15316
Demo: https://ichigo.menlo.ai/
Model: https://huggingface.co/homebrewltd/Ichigo-llama3.1-s-instruct-v0.4
Blog post: https://menlo.ai/blog/llama-learns-to-talk

🍓 Ichigo: Rethinking Speech and Language Processing

Key Concepts

Why This Matters

Resources:

Products

Research

Robotics

Hardware