Research

Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

arXiv•February 18, 2026 ()•Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds, Guangzhi Sun, William Held, Diyi Yang

Professional Abstract

"The paper presents a significant advancement in the field of audio language modeling, addressing the limitations of current models that primarily focus on text-first approaches. These traditional models either extend pre-trained text-based large language models (LLMs) or utilize semantic-only audio tokens, which restricts their ability to perform comprehensive audio modeling. The authors propose a novel framework for native audio foundation models that utilize next-token prediction directly on audio data at scale. This approach aims to jointly model semantic content, acoustic details, and textual information, thereby enhancing both general audio generation and cross-modal functionalities. The research is structured around a systematic empirical study that provides extensive insights into the design and training of these audio models. The authors begin by investigating critical design choices, including the selection of data sources, the ratios of text mixtures, and the composition of tokens. This investigation culminates in a validated training recipe that serves as a foundation for building effective audio models. A pivotal aspect of the study is the introduction of the first scaling law analysis for discrete audio models, conducted through IsoFLOP analysis across 64 models with computational requirements ranging from $3{ imes}10^{18}$ to $3{ imes}10^{20}$ FLOPs. The findings reveal that optimal data size increases at a rate of 1.6 times faster than the optimal model size, providing crucial insights for future model scaling and training efficiency. Building on these empirical findings, the authors introduce SODA (Scaling Open Discrete Audio), a suite of models ranging from 135 million to 4 billion parameters, trained on a dataset comprising 500 billion tokens. The performance of SODA is evaluated against the scaling predictions established earlier in the study, as well as against existing models in the field. The versatility of SODA is highlighted through its application in various audio and text tasks, including a fine-tuning example for voice-preserving speech-to-speech translation, which showcases the model's capability to maintain a unified architecture across different modalities. The significance of this research lies in its potential to redefine audio modeling paradigms by providing a robust framework that integrates audio generation with text and semantic understanding. The insights gained from this study not only advance the state of the art in audio language models but also pave the way for future explorations in cross-modal AI applications, enhancing the interaction between audio and textual data in innovative ways."

Technical Insights

1The paper critiques current audio language models for their text-first approach, which limits comprehensive audio modeling.

2Introduces native audio foundation models that utilize next-token prediction directly on audio data, enhancing general audio generation.

3Presents a systematic empirical study focusing on design choices like data sources, text mixture ratios, and token composition.

4Establishes a validated training recipe for building effective audio models based on empirical insights.

5Conducts the first scaling law study for discrete audio models using IsoFLOP analysis across 64 models with varying computational requirements.

6Finds that optimal data size increases 1.6 times faster than optimal model size, providing key insights for model scaling.

7Introduces SODA, a suite of models ranging from 135M to 4B parameters trained on 500B tokens.

8Demonstrates SODA's versatility through fine-tuning for voice-preserving speech-to-speech translation.

9Highlights the potential of this research to redefine audio modeling paradigms and enhance cross-modal AI applications.