Parameter-free representations outperform single-cell foundation models on downstream benchmarks
Professional Abstract
"In the realm of single-cell RNA sequencing (scRNA-seq), the inherent statistical structure of the data has spurred the development of sophisticated computational models, notably the TranscriptFormer, which employs transformer-based architectures to generate representations of gene expression. These representations, or embeddings, have demonstrated remarkable efficacy in various downstream applications, including cell-type classification, disease-state prediction, and cross-species learning. However, the reliance on deep learning methods raises questions about the necessity of such complex models when simpler approaches may yield comparable results. This study investigates the performance of straightforward, interpretable methodologies that leverage careful normalization and linear techniques against the backdrop of established benchmarks for single-cell foundation models. The authors report that their simplified pipelines achieve state-of-the-art (SOTA) or near SOTA performance across multiple evaluation metrics, even outperforming existing foundation models in scenarios involving out-of-distribution tasks with novel cell types and organisms not included in the training datasets. These findings underscore the importance of rigorous benchmarking in the field and suggest that the biological nuances of cell identity can be effectively captured through linear representations of single-cell gene expression data. This research not only challenges the prevailing notion that deep learning is indispensable for high-performance outcomes in scRNA-seq analysis but also emphasizes the potential for more interpretable and computationally efficient methods to contribute meaningfully to our understanding of cellular biology."