Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology
Professional Abstract
"In recent years, large language models (LLMs) have demonstrated remarkable performance on various biological benchmarks, prompting concerns regarding their potential to enable novice users to acquire dual-use laboratory skills. This study aimed to investigate whether LLMs could enhance the performance of novices in a laboratory setting, specifically in tasks that simulate a viral reverse genetics workflow. Conducted as a pre-registered, investigator-blinded, randomized controlled trial from June to August 2025, the research involved 153 participants who were randomly assigned to either an LLM-assisted group or a control group utilizing traditional internet resources. The primary endpoint of the study was the completion rate of the defined workflow, which yielded no significant difference between the two groups (5.2% for the LLM group versus 6.6% for the internet group, P = 0.759). Despite the lack of statistical significance in overall workflow completion, the LLM group exhibited numerically higher success rates in four out of five individual tasks, particularly in the cell culture task, where success rates were 68.8% for the LLM group compared to 55.3% for the internet group (P = 0.059). Further analysis using post-hoc Bayesian modeling of pooled data suggested a potential 1.4-fold increase in success for typical reverse genetics tasks when assisted by LLMs, although the credible interval (95% CrI 0.74-2.62) indicated considerable uncertainty. Ordinal regression modeling revealed that participants in the LLM arm were more likely to progress through intermediate steps across all tasks, with a posterior probability of a positive effect ranging from 81% to 96%. These findings indicate that while mid-2025 LLMs did not significantly enhance novice completion of complex laboratory procedures, they were associated with a modest performance benefit. This study highlights a critical gap between in silico benchmarks and real-world applications, emphasizing the necessity for physical-world validation of AI biosecurity assessments as both model capabilities and user proficiency continue to evolve."