Reinforced Fast Weights with Next-Sequence Prediction
Professional Abstract
"The paper introduces REFINE (Reinforced Fast weIghts with Next sEquence prediction), a novel framework aimed at enhancing the capabilities of fast weight architectures in the context of long-context modeling. Fast weight architectures are recognized for their ability to maintain constant memory overhead irrespective of the context length, presenting a significant advantage over traditional attention-based transformers. However, the authors identify a critical limitation in the prevailing next-token prediction (NTP) training paradigm, which primarily focuses on optimizing single-token predictions. This approach neglects the semantic coherence that exists across multiple tokens following a prefix, leading to suboptimal representations that fail to adequately capture long-range dependencies in the data. To address this issue, REFINE employs a reinforcement learning strategy that shifts the training objective from NTP to next-sequence prediction (NSP). This shift allows the model to select informative token positions based on prediction entropy, thereby generating multi-token rollouts that better reflect the contextual relationships inherent in the data. The framework assigns self-supervised sequence-level rewards, which are crucial for guiding the learning process, and utilizes group relative policy optimization (GRPO) for model optimization. REFINE is designed to be versatile, applicable throughout various stages of the training lifecycle of pre-trained language models, including mid-training, post-training, and test-time training. The experimental results presented in the paper, based on models such as LaCT-760M and DeltaNet-1.3B, demonstrate that REFINE consistently outperforms traditional supervised fine-tuning methods based on NTP across a range of tasks, including needle-in-a-haystack retrieval, long-context question answering, and diverse benchmarks in LongBench. The findings underscore the potential of REFINE as an effective framework for enhancing long-context modeling in fast weight architectures, paving the way for future advancements in natural language processing and machine learning."