Can artificial intelligence generate scientific discussion that passes peer review for publication in a high-impact orthopaedic journal?
Academic Article
Overview
abstract
BACKGROUND: There is huge interest in the use of artificial intelligence (AI) in the production and assessment of academic material; however, the role of AI remains unclear. AIM: The purpose of this study was to perform a reviewer-blinded assessment of the quality of scientific discussion generated by an advanced AI language model (ChatGPT-4, Open AI) and determine whether this could be recommended for high-impact journal publication. METHODS: The introduction, methods and results sections of a recently published article from a high-impact journal were input into a current AI model. The AI application then produced a discussion and conclusion based on the provided text using a standardized prompt. Six experienced blinded reviewers scored all five sections of the hybrid article. A one-way analysis of variance (ANOVA) was used to assess significant differences between scores of each section. Reviewers recommended a decision regarding the suitability of the article for publication. RESULTS: AI composed a scientific discussion and conclusion. The median score was 80 (IQR 70-90) for introduction, 77.5 (IQR 70-90) for methods, 82.5 (IQR 50-90) for results, 60 (IQR 40-75) for discussion and 60 (IQR 40-80) for the conclusion. The median scores for the AI-generated sections were non-significantly lower than other sections (p = 0.37). The majority of reviewers (5/6, 83%) recommended "acceptance for publication after major revision". One reviewer recommended "resubmission with no guarantee of acceptance". There were no recommendations for rejection. CONCLUSION: Current AI large language models are now capable of generating content that passes experienced peer review and is acceptable for publication in a high-impact orthopaedic journal, after revision. There are still many concerns regarding the integration of AI into the process of scientific writing, mainly the tendency of AI to rely on advanced pattern recognition and fabricated or inadequate references. LEVEL OF EVIDENCE: Level IV.