Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
Professional Abstract
"The paper presents a significant advancement in the field of audio language modeling, addressing the limitations of current models that primarily focus on text-first approaches. These traditional models either extend pre-trained text-based large language models (LLMs) or utilize semantic-only audio tokens, which restricts their ability to perform comprehensive audio modeling. The authors propose a novel framework for native audio foundation models that utilize next-token prediction directly on audio data at scale. This approach aims to jointly model semantic content, acoustic details, and textual information, thereby enhancing both general audio generation and cross-modal functionalities. The research is structured around a systematic empirical study that provides extensive insights into the design and training of these audio models. The authors begin by investigating critical design choices, including the selection of data sources, the ratios of text mixtures, and the composition of tokens. This investigation culminates in a validated training recipe that serves as a foundation for building effective audio models. A pivotal aspect of the study is the introduction of the first scaling law analysis for discrete audio models, conducted through IsoFLOP analysis across 64 models with computational requirements ranging from $3{ imes}10^{18}$ to $3{ imes}10^{20}$ FLOPs. The findings reveal that optimal data size increases at a rate of 1.6 times faster than the optimal model size, providing crucial insights for future model scaling and training efficiency. Building on these empirical findings, the authors introduce SODA (Scaling Open Discrete Audio), a suite of models ranging from 135 million to 4 billion parameters, trained on a dataset comprising 500 billion tokens. The performance of SODA is evaluated against the scaling predictions established earlier in the study, as well as against existing models in the field. The versatility of SODA is highlighted through its application in various audio and text tasks, including a fine-tuning example for voice-preserving speech-to-speech translation, which showcases the model's capability to maintain a unified architecture across different modalities. The significance of this research lies in its potential to redefine audio modeling paradigms by providing a robust framework that integrates audio generation with text and semantic understanding. The insights gained from this study not only advance the state of the art in audio language models but also pave the way for future explorations in cross-modal AI applications, enhancing the interaction between audio and textual data in innovative ways."