TinyWave

Authors: Mohammadmahdi Nouriborji, Morteza Rohanian

Training Efficient Speech-to-Speech Models via Knowledge Distillation

Developing and providing state-of-the-art efficient speech-to-speech AI models.

Abstract

Current speech language models deliver strong results but remain too large for many deployment settings. We present TinyWave, a family of 2 B-parameter speech-to-speech transformers. A layer-aligned distillation strategy—matching hidden states, attention maps, and softened logits—shrinks model size by 3 × while retaining most of the teacher’s behaviour. Trained on 50 k h of publicly available speech, TinyWave supports (i) speech-only generation with either phonetic or expressive token streams and (ii) mixed speech–text continuations. This approach reduces inference latency and memory footprint by 3x compared to the teacher, while preserving expressive qualities such as prosody, intonation, and speaker-specific traits. On Libri-Light language-modeling, TinyWave keeps within 1.4 normalised-perplexity points of its teacher; on spoken StoryCloze and SALMon it preserves 93–97 % of teacher accuracy and surpasses size-matched baselines. Finetuning interleaved models further yields competitive ASR and TTS performance, demonstrating effective Multimodal feature transfer. These models are optimized for deployment on commodity hardware, enabling applications in real-time conversational agents, assistive technologies, and low-resource environments. We release models, training code, and evaluation scripts to facilitate reproducible research on compact, expressive speech generation.

Model Architecture

Spoken‑LM Benchmarks

Accuracy (↑) across sStoryCloze, tStoryCloze, sWuggy, and sBlimp.

TinyWave Expressive

Examples from our interleaved expressive model.

Model Output

TinyWave Base

Examples from our speech-to-speech base model.

Model Input

Model Output