Authors: Mohammadmahdi Nouriborji, Morteza Rohanian
Developing and providing state-of-the-art efficient speech-to-speech AI models.
Current speech language models deliver strong results but remain too large for many deployment settings. We present TinyWave, a family of 2 B-parameter speech-to-speech transformers. A layer-aligned distillation strategy—matching hidden states, attention maps, and softened logits—shrinks model size by 3 × while retaining most of the teacher’s behaviour. Trained on 50 k h of publicly available speech, TinyWave supports (i) speech-only generation with either phonetic or expressive token streams and (ii) mixed speech–text continuations. This approach reduces inference latency and memory footprint by 3x compared to the teacher, while preserving expressive qualities such as prosody, intonation, and speaker-specific traits. On Libri-Light language-modeling, TinyWave keeps within 1.4 normalised-perplexity points of its teacher; on spoken StoryCloze and SALMon it preserves 93–97 % of teacher accuracy and surpasses size-matched baselines. Finetuning interleaved models further yields competitive ASR and TTS performance, demonstrating effective Multimodal feature transfer. These models are optimized for deployment on commodity hardware, enabling applications in real-time conversational agents, assistive technologies, and low-resource environments. We release models, training code, and evaluation scripts to facilitate reproducible research on compact, expressive speech generation.
Accuracy (↑) across sStoryCloze, tStoryCloze, sWuggy, and sBlimp.
Examples from our interleaved expressive model.
Model Output
Examples from our speech-to-speech base model.
Model Input
Model Output