Research

Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology

arXiv•February 18, 2026 ()•Shen Zhou Hong, Alex Kleinman, Alyssa Mathiowetz, Adam Howes, Julian Cohen, Suveer Ganta, Alex Letizia, Dora Liao, Deepika Pahari, Xavier Roberts-Gaal, Luca Righetti, Joe Torres

Professional Abstract

"In recent years, large language models (LLMs) have demonstrated remarkable performance on various biological benchmarks, prompting concerns regarding their potential to enable novice users to acquire dual-use laboratory skills. This study aimed to investigate whether LLMs could enhance the performance of novices in a laboratory setting, specifically in tasks that simulate a viral reverse genetics workflow. Conducted as a pre-registered, investigator-blinded, randomized controlled trial from June to August 2025, the research involved 153 participants who were randomly assigned to either an LLM-assisted group or a control group utilizing traditional internet resources. The primary endpoint of the study was the completion rate of the defined workflow, which yielded no significant difference between the two groups (5.2% for the LLM group versus 6.6% for the internet group, P = 0.759). Despite the lack of statistical significance in overall workflow completion, the LLM group exhibited numerically higher success rates in four out of five individual tasks, particularly in the cell culture task, where success rates were 68.8% for the LLM group compared to 55.3% for the internet group (P = 0.059). Further analysis using post-hoc Bayesian modeling of pooled data suggested a potential 1.4-fold increase in success for typical reverse genetics tasks when assisted by LLMs, although the credible interval (95% CrI 0.74-2.62) indicated considerable uncertainty. Ordinal regression modeling revealed that participants in the LLM arm were more likely to progress through intermediate steps across all tasks, with a posterior probability of a positive effect ranging from 81% to 96%. These findings indicate that while mid-2025 LLMs did not significantly enhance novice completion of complex laboratory procedures, they were associated with a modest performance benefit. This study highlights a critical gap between in silico benchmarks and real-world applications, emphasizing the necessity for physical-world validation of AI biosecurity assessments as both model capabilities and user proficiency continue to evolve."

Technical Insights

1The study was a pre-registered, investigator-blinded, randomized controlled trial involving 153 participants.

2The primary endpoint assessed was the completion rate of a viral reverse genetics workflow, with no significant difference found between LLM assistance and traditional internet resources.

3The overall workflow completion rates were 5.2% for the LLM group and 6.6% for the internet group (P = 0.759).

4Numerically higher success rates were observed in the LLM group for four out of five tasks, particularly in the cell culture task (68.8% LLM vs. 55.3% Internet; P = 0.059).

5Post-hoc Bayesian modeling estimated a 1.4-fold increase in success for typical reverse genetics tasks with LLM assistance, although with a wide credible interval (95% CrI 0.74-2.62).

6Ordinal regression modeling indicated that LLM participants were more likely to progress through intermediate steps across all tasks, with a posterior probability of a positive effect between 81% and 96%.

7The results suggest a modest performance benefit from LLM assistance, despite not achieving significant improvements in overall task completion.

8The study underscores the gap between in silico performance benchmarks and practical laboratory outcomes.

9It emphasizes the need for ongoing validation of AI tools in real-world settings, particularly in the context of biosecurity and laboratory skills acquisition.