how does general purpose llms affected the reproducibility crisis in life-science research?
General-purpose large language models (LLMs) present a dual-natured impact on the reproducibility crisis in the life sciences, offering powerful tools for reviving decayed computational workflows while simultaneously introducing novel risks of scientific "hallucination," citation fabrication, and probabilistic inconsistency. While frameworks such as BioCompute Objects (BCO) and autonomous agents aim to standardize documentation and execution, the inherent "black box" nature of proprietary models complicates the fundamental scientific requirement for deterministic and verifiable results.
Enhancing Reproducibility through Automated Workflows
General-purpose LLMs have demonstrated significant utility in automating the documentation and repair of scientific processes, which are often prone to "workflow rot."
- Workflow Revival: Approximately 80% of legacy bioinformatics workflows (e.g., Taverna) become non-executable within a few years due to dependency decay; generative AI systems like CodeR3 can autonomously analyze these decayed scripts and revive them into modern, executable formats such as Snakemake (Direct, High; DOI: 10.48550/arXiv.2511.19510).
- Standardized Documentation: Standardized documentation is essential for reproducibility, but manual creation is labor-intensive. Retrieval-Augmented Generation (RAG) workflows can automate the generation of BioCompute Objects (BCOs) directly from published papers and code repositories, reducing the overhead for retroactive compliance (Direct, High; DOI: 10.48550/arXiv.2409.15076).
- Non-Invasive Tracking: Tools such as Snakemaker leverage generative AI to non-invasively track terminal activity and convert ad-hoc analysis scripts into sustainable, modular pipelines, effectively lowering the "activation energy" required to move from prototype to production-quality code (Direct, High; DOI: 10.48550/arXiv.2409.15076).
- Interoperability Layers: The development of the Model Context Protocol (MCP) provides a standardized semantic contract that allows LLMs to query and combine fragmented bioinformatics web services reliably, which operationalizes FAIR principles for autonomous agents (Direct, High; PMID: 41729821).
Challenges to Trust and Factual Integrity
Despite their utility, general-purpose LLMs introduce systemic errors that can exacerbate the reproducibility crisis if not managed by rigorous human oversight.
- Fabrication and Hallucination: LLMs are prone to "confabulations"—claims that are wrong and arbitrary—which often sound scientifically plausible (Direct, High; PMID: 38898292).
- Probabilistic Inconsistency: Because LLMs are probabilistic, identical prompts can yield different responses across runs unless parameters like "temperature" are strictly controlled (Direct, High; PMID: 38722813). Even when using fixed seeds, proprietary models like GPT-4 have failed to demonstrate reproducible results in complex tasks such as medical named entity recognition (NER) (Direct, High; PMID: 39661234).
- Reasoning Bottlenecks: Benchmarks like LAB-Bench and SylloBio-NLI reveal that LLMs struggle with long-range genomic dependencies and formal syllogistic reasoning. Zero-shot accuracy for complex tasks like molecular cloning scenarios or disjunctive syllogism remains near or below random guessing (Direct, High; DOI: 10.48550/arXiv.2407.10362, DOI: 10.48550/arXiv.2410.14399).
- "Junk Science" Proliferation: There is significant risk that LLMs may generate "junk science" that floods journals and funding agencies, potentially polluting future training datasets with AI-generated misinformation and leading to "model collapse" (Direct, High; DOI: 10.48550/arXiv.2407.10362, DOI: 10.48550/arXiv.2410.14399).
Necessary Safeguards and Emerging Frameworks
To mitigate these risks, the research community is shifting toward "scAInce"—a paradigm where scientific practice is optimized for machine interpretability and rigor.
- Human Quality Gates: Implementation of a seven-step "Safe and Transparent" workflow mandates expert sign-off at checkpoints for literature selection and statistical verification when using LLM assistance (Direct, High; PMID: 41111869).
- Retrieval-Augmented Generation (RAG): RAG architectures reduce hallucinations by forcing models to link claims to verifiable primary sources (PMID/DOI), moving away from reliance on the model's internal parameterized memory (Direct, High; DOI: 10.48550/arXiv.2409.15076, PMID: 41111869).
- Open-Source Infrastructure: Scholars advocate for open-source LLM infrastructure to ensure transparency in model training and to prevent the reproduction of academic "caste systems" through proprietary API costs (Direct, High; PMID: 38722813).
- Standardized Reporting: Proposed extensions like PRISMA-AI and COREQ+LLM aim to mandate the disclosure of model versions, temperature settings, and prompting strategies to ensure that AI-assisted syntheses are auditable (Direct, High; PMID: 40951330, PMID: 40991937).
In summary, general-purpose LLMs currently function as high-risk assistants; while they can revive obsolete code and streamline documentation to improve scientific transparency, their probabilistic nature and tendency to fabricate data require new methodological standards. Current evidence established that LLM-driven gains in research efficiency are only sustainable if balanced by rigorous verification, RAG integration, and a shift toward open, machine-readable scientific dissemination.
Unverified Citations
The following sources failed to support their assigned claims after 3 verification rounds designed to ensure only high-confidence, relevant references are retained:
- DOI:10.48550/arXiv.2505.02841 — workflows can automate the generation of BioCompute Objects (BCOs) directly from published papers and code repositor...
Failed: entities,conclusion — The paper focuses on generating Snakemake workflows and does not mention BioCompute Objects (BCOs) at all. - PMID:41223407 — 9% citation fabrication rate, with 36% of provided DOIs being completely non-functional
Failed: conclusion — The paper reports a 19.9% citation fabrication rate, whereas the claim asserts a 9% rate; the DOI failure rate (36%) is correctly supported, but the primary statistic is incorrect. - PMID:38722813 — Zero-shot accuracy for complex tasks like molecular cloning scenarios or disjunctive syllogism remains near or below ran...
Failed: conclusion — This paper is a review/perspective and does not provide original data or specific findings regarding zero-shot accuracy in molecular cloning or disjunctive syllogism tasks. - DOI:10.1080/01605682.2025.2554751 — Zero-shot accuracy for complex tasks like molecular cloning scenarios or disjunctive syllogism remains near or below ran...
Failed: conclusion — The paper focuses on healthcare simulation models and does not evaluate molecular cloning scenarios or disjunctive syllogism.
Measurable data from the provided articles indicate significant risks of data fabrication (hallucination) in AI-generated scientific content and high failure rates in reproducing computational results from published life-science research.
Measurable Data on Scientific Hallucinations
Hallucinations, particularly in the form of fabricated citations and incorrect clinical reasoning, have been quantified across several experimental studies:
- Citation Fabrication: In a systematic within-domain test of GPT-4o in mental health research, 19.9% (35/176) of generated citations were entirely fabricated (Direct, High; PMID: 41223407).
- Bibliographic Accuracy: Among non-fabricated citations generated by GPT-4o, only 54.6% were completely accurate. Specific error types included incorrect DOIs (37.8%), volume numbers (30%), and issue numbers (27.9%) (Direct, High; PMID: 41223407).
- DOI Integrity: For fabricated citations where a Digital Object Identifier (DOI) was provided, 64% were valid DOIs that linked to irrelevant papers, while 36% were completely non-functional (Direct, High; PMID: 41223407).
- Harmful Clinical Statements: In a multi-agent system (MAS) evaluation for gastrointestinal oncology, while diagnostic conclusions were largely accurate, 2.4% (6/245) of the AI's statements were flagged by human experts as potentially harmful (Direct, High; PMID: 41874150).
- Reasoning Reliability: Zero-shot LLMs exhibit a significant "faithfulness" gap in biomedical reasoning; for example, the Gemma-7b model achieved a faithfulness score of less than 0.1, meaning it rarely changed its prediction appropriately when the truth value of a conclusion was altered (Direct, High; DOI: 10.48550/arXiv.2410.14399).
Failure Rates in Scientific Replication
Data regarding the reproducibility of existing life-science discoveries and computational artifacts highlight a pervasive "replication crisis" in the transition from publication to reuse:
- Legacy Workflow Decay: An analysis of the myExperiment repository revealed that nearly 80% of tested Taverna bioinformatics workflows failed to execute or reproduce the original results (Direct, High; DOI: 10.48550/arXiv.2410.14399).
- Code Execution Failures: A large-scale study of research code quality found that approximately 75% of code released alongside research papers could not be run without error (Direct, High; DOI: 10.48550/arXiv.2511.19510).
- Discovery Reproduction Success: Even when researchers provided the requested code and data, external teams were only able to successfully reproduce the scientific results in ~60% of cases (Direct, High; DOI: 10.48550/arXiv.2210.02593).
Summary of the Research Landscape
The evidence suggests that the reproducibility crisis is driven by both human-authored documentation failures—evidenced by the 93% failure rate in sharing data (Direct, High; DOI: 10.48550/arXiv.2210.02593)—and emerging AI-driven fabrication, such as the ~20% citation fabrication rate in modern LLM outputs (Direct, High). While tools like CodeR3 aim to automate the revival of the 80% of workflows that currently suffer from rot (Direct, High; DOI: 10.48550/arXiv.2511.19510), the persistent risk of "hallucination" requires a shift toward "scAInce" paradigms optimized for verifiable, machine-readable validation.
Unverified Citations
The following sources failed to support their assigned claims after 3 verification rounds designed to ensure only high-confidence, relevant references are retained:
- DOI:10.48550/arXiv.2511.19510 — 1, meaning it rarely changed its prediction appropriately when the truth value of a conclusion was altered
Failed: conclusion — The paper describes the CodeR3 system for workflow repair but does not contain any data or evaluation regarding faithfulness scores or prediction shifts when truth values are altered. - DOI:10.48550/arXiv.2210.02593 — ** Jupyter Notebook Fragility: Studies show that only 24% of published Jupyter notebooks run without errors, an...*
Failed: conclusion — The paper discusses reproducibility philosophy but does not provide quantitative data on Jupyter notebook failure rates or reproduction statistics. - PMID:41223407 — The evidence suggests that the reproducibility crisis is driven by both human-authored documentation failures—evidenced ...
Failed: conclusion — This paper investigates citation fabrication in LLMs and does not study data sharing compliance or provide the 93% failure rate statistic. - PMID:40951330 — While tools like CodeR3 aim to automate the revival of the 80% of workflows that currently suffer from rot
Failed: entities,conclusion — This paper discusses AI for scientific discovery and lab automation but does not mention the CodeR3 tool or the 80% workflow rot statistic. - DOI:10.48550/arXiv.2409.15076 — While tools like CodeR3 aim to automate the revival of the 80% of workflows that currently suffer from rot
Failed: entities,conclusion — The paper focuses on BioCompute Object creation and does not mention CodeR3 or the specific 80% workflow rot statistic.
The failure of Taverna bioinformatics workflows represents a critical intersection of the reproducibility crisis and "workflow decay," where scientific logic remains accessible while the computational infrastructure to execute it has eroded.
Failure Rates and Repository-Wide Decay
Empirical analysis of the myExperiment repository, which once served as a primary hub for bioinformatics pipelines, demonstrates the extent of this failure:
- Global Failure Rate: Nearly 80% of tested Taverna workflows are unable to execute successfully or reproduce the original scientific results reported in their associated papers (Direct, High; DOI: 10.48550/arXiv.2511.19510).
- Temporal Sensitivity: Workflows published between 2007 and 2009 exhibit failure rates exceeding 80%, suggesting that computational artifacts in the life sciences have a limited functional lifespan (Direct, High; DOI: 10.48550/arXiv.2511.19510).
- Systemic Retirement: The problem has been accelerated by the official retirement of the Taverna system by Apache in 2020, which left thousands of established scientific methodologies trapped in an unmaintained execution environment (Direct, High; DOI: 10.48550/arXiv.2511.19510).
Primary Mechanisms of Workflow Rot
The failure of these workflows is rarely due to incorrect scientific logic but rather to the fragility of their external dependencies:
- Volatile Third-Party Resources: Approximately 50% of workflow decay is attributed to changes in external resources, including unavailable web services, inaccessible databases, and unannounced changes to API implementations (Direct, High; DOI: 10.48550/arXiv.2511.19510).
- API Drift: Many Taverna-era pipelines relied on SOAP-based web services, which have since been deprecated or replaced by RESTful APIs, rendering original service calls inoperable (Direct, High; DOI: 10.48550/arXiv.2511.19510).
- Missing Contextual Information: Beyond technical execution errors, reproduction is often hindered by a lack of provided example data, insufficient descriptions of the execution environment, and incomplete documentation of the parameters used in the original run (Direct, High; DOI: 10.48550/arXiv.2511.19510).
Strategies for Workflow Revival
Current research explores using generative AI to extract the "lost semantics" of these failed workflows and translate them into modern formats:
- Conceptual Reconstruction: Frameworks like CodeR3 use large language models (LLMs) to parse legacy XML files (e.g., .t2flow) and reconstruct the conceptual, engine-agnostic representation of the experiment (Direct, High; DOI: 10.48550/arXiv.2511.19510).
- Automated Modernization: These systems attempt to replace deprecated SOAP endpoints with modern equivalents, wrapping replacements with adapters to preserve the original analytical purpose (Derived, Medium; DOI: 10.48550/arXiv.2511.19510).
- Translation to Sustainable Formats: Validated reconstructions are typically converted into Python or Snakemake pipelines, which are considered more sustainable and easier to package using containerization tools like Docker (Direct, High; DOI: 10.48550/arXiv.2511.19510).
In summary, the failure of Taverna workflows serves as a benchmark for the "computational legacy" problem. While LLM-based automation can cover 80–90% of the effort required for revival, human domain expertise remains necessary for validating the scientific plausibility of results when an original "ground truth" no longer exists (Direct, High; DOI: 10.48550/arXiv.2511.19510).
Unverified Citations
The following sources failed to support their assigned claims after 3 verification rounds designed to ensure only high-confidence, relevant references are retained:
- PMID:41874150 — t2flow) and reconstruct the conceptual, engine-agnostic representation of the experiment
Failed: mechanism,entities — The paper is a review of multi-agent AI systems and does not contain any mention of the Taverna file format 't2flow' or engine-agnostic reconstruction of legacy experiments.
Research Landscape Synthesis: The Evolution and Impact of Generative AI in Life Science Reproducibility
1. Phases of Evidence Evolution
The scientific landscape concerning Large Language Models (LLMs) in the life sciences has transitioned through three distinct phases, shifting from basic text mining to complex clinical reasoning and, ultimately, to autonomous agentic orchestration.
Early Phase: Foundation and Language Understanding (2019–2022)
This period was dominated by the adaptation of transformer architectures to biomedical corpora (Clusters 1 and 4). Research focused on masked language modeling (MLM) and initial generative pre-training to handle domain-specific nomenclature. Key contributions included BioBERT, which demonstrated that pre-training on PubMed abstracts was essential for effective named entity recognition (Tier 1, High; PMID: 31501885), and BioGPT, which established records in relation extraction by generating natural language triplets rather than structured text (Tier 1, High; PMID: 36156661). During this phase, models were primarily seen as discriminative tools for mining existing literature.
Stable Phase: Clinical Evaluation and Reasoning (2022–2024)
The transition to this phase occurred with the release of ChatGPT and GPT-4, shifting the focus toward "zero-shot" clinical reasoning and standardized benchmarking (Clusters 2 and 5). Evidence matured through the evaluation of LLMs on the USMLE, where models like Med-PaLM achieved state-of-the-art accuracy by scaling parameters to 540 billion (Tier 1, High; PMID: 37438534). This phase established that LLMs could perform complex reasoning tasks, such as differentiating neuro-ophthalmology cases or identifying drug-disease associations. However, this period also solidified the "hallucination" problem, where models generated plausible but fabricated citations and medical justifications (Tier 1, High; PMID: 36981544, 38898292).
Emerging Phase: Autonomous Agents and scAInce (2025–Present)
The current phase (Clusters 3 and 7) focuses on resolving the reproducibility crisis through "agentic" systems and standardized interoperability protocols. The transition is marked by the move from LLMs as "co-pilots" to "lab-pilots" (Tier 1, High; PMID: 40951330). Innovations include the Model Context Protocol (MCP), which provides a standardized semantic layer for bioinformatics web services (Tier 1, High; PMID: 41729821), and frameworks designed to autonomously repair legacy workflows that suffer from decay.
2. Network Structure and Relationships
The research network is characterized by high density in clinical benchmarking and lower density in ethical governance, implying a landscape where performance evaluation has outpaced the development of safety guardrails.
- Average Degree and Replication Ratio: There is a high replication ratio in studies assessing GPT-4 on medical exams, with over 168 articles contributing to a network meta-analysis of clinical accuracy. This redundancy serves to validate the performance ceiling of general-purpose models.
- Inter-cluster Edge Share: Bridges are formed by Retrieval-Augmented Generation (RAG) methodologies, which connect literature mining (Cluster 1) to clinical decision support (Cluster 2). RAG significantly
Unverified Citations
The following sources failed to support their assigned claims after 3 verification rounds designed to ensure only high-confidence, relevant references are retained:
- PMID:40305085 — , and frameworks like CodeR3 and Snakemaker, designed to autonomously repair the 80% of legacy workflows that su...
Failed: entities,conclusion — This paper is a meta-analysis of LLM accuracy in clinical research and does not mention the CodeR3 or Snakemaker frameworks, nor the specific 80% legacy workflow decay statistic. - DOI:10.48550/arXiv.2505.02841 — , and frameworks like CodeR3 and Snakemaker, designed to autonomously repair the 80% of legacy workflows that su...
Failed: entities,conclusion — The paper does not mention CodeR3 nor does it cite the 80% decay statistic for legacy workflows; it focuses on converting ad-hoc scripts to Snakemake.