Synthetic-Powered Multiple Testing with FDR Control
Professional Abstract
"The paper presents SynthBH, a novel approach to multiple hypothesis testing that integrates synthetic data to enhance statistical inference while maintaining control over the false discovery rate (FDR). Multiple hypothesis testing is a critical area in statistics, particularly in fields such as genomics, drug screening, and outlier detection, where researchers often face the challenge of making reliable inferences from a large number of hypotheses. Traditional methods for controlling FDR can be limited when only real experimental data is available, especially in scenarios where the number of tests is large and the signal-to-noise ratio is low. SynthBH addresses these limitations by incorporating auxiliary or synthetic data, which can stem from previous experiments or be generated through advanced generative models. This integration is particularly valuable as it allows researchers to leverage additional information that may not be directly observable in their current datasets. The authors establish that SynthBH guarantees finite-sample, distribution-free FDR control under a mild positive dependence condition known as PRDS (Positive Regression Dependence on Subsets). This is significant because it allows the method to function effectively without the stringent requirement that pooled-data p-values be valid under the null hypothesis. One of the key innovations of SynthBH is its adaptability to the quality of the synthetic data used; it enhances sample efficiency and can increase statistical power when the synthetic data is of high quality. Conversely, it maintains FDR control at a user-specified level even when the quality of the synthetic data is poor, providing a robust framework for researchers. The empirical performance of SynthBH is demonstrated through various applications, including tabular outlier detection benchmarks and genomic analyses related to drug-cancer sensitivity associations. These applications highlight the practical utility of SynthBH in real-world scenarios, showcasing its ability to improve the reliability of findings in complex datasets. Furthermore, the authors conduct controlled experiments on simulated data to study the properties of SynthBH, providing insights into its operational characteristics and effectiveness in various testing environments. Overall, the introduction of SynthBH represents a significant advancement in the field of multiple hypothesis testing, offering a powerful tool for researchers to navigate the complexities of data analysis in high-dimensional settings."