Research

Parameter-free representations outperform single-cell foundation models on downstream benchmarks

arXiv•February 18, 2026 ()•Huan Souza, Pankaj Mehta

Professional Abstract

"In the realm of single-cell RNA sequencing (scRNA-seq), the inherent statistical structure of the data has spurred the development of sophisticated computational models, notably the TranscriptFormer, which employs transformer-based architectures to generate representations of gene expression. These representations, or embeddings, have demonstrated remarkable efficacy in various downstream applications, including cell-type classification, disease-state prediction, and cross-species learning. However, the reliance on deep learning methods raises questions about the necessity of such complex models when simpler approaches may yield comparable results. This study investigates the performance of straightforward, interpretable methodologies that leverage careful normalization and linear techniques against the backdrop of established benchmarks for single-cell foundation models. The authors report that their simplified pipelines achieve state-of-the-art (SOTA) or near SOTA performance across multiple evaluation metrics, even outperforming existing foundation models in scenarios involving out-of-distribution tasks with novel cell types and organisms not included in the training datasets. These findings underscore the importance of rigorous benchmarking in the field and suggest that the biological nuances of cell identity can be effectively captured through linear representations of single-cell gene expression data. This research not only challenges the prevailing notion that deep learning is indispensable for high-performance outcomes in scRNA-seq analysis but also emphasizes the potential for more interpretable and computationally efficient methods to contribute meaningfully to our understanding of cellular biology."

Technical Insights

1The study focuses on single-cell RNA sequencing (scRNA-seq) data, which exhibits strong statistical structure.

2Large-scale foundation models like TranscriptFormer utilize transformer architectures to create generative models for gene expression.

3These models have achieved state-of-the-art performance in tasks such as cell-type classification and disease-state prediction.

4The research investigates whether similar performance can be attained using simpler, interpretable methods instead of deep learning.

5The authors employed careful normalization and linear methods to analyze scRNA-seq data.

6Results indicate that these simpler methods achieved state-of-the-art or near state-of-the-art performance across multiple benchmarks.

7The study highlights the ability of these methods to outperform foundation models on out-of-distribution tasks involving novel cell types and organisms.

8The findings suggest a need for rigorous benchmarking in the field of single-cell analysis.

9The research proposes that the biological complexities of cell identity can be captured using linear representations, challenging the necessity of deep learning approaches.