Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study

被引:8
|
作者
Mugaanyi, Joseph [1 ]
Cai, Liuying [2 ]
Cheng, Sumei [2 ]
Lu, Caide [1 ]
Huang, Jing [1 ]
机构
[1] Ningbo Univ, Lihuili Hosp, Hlth Sci Ctr, Ningbo Med Ctr,Dept Hepatopancreato Biliary Surg, 1111 Jiangnan Rd, Ningbo 315000, Peoples R China
[2] Shanghai Acad Social Sci, Inst Philosophy, Shanghai, Peoples R China
关键词
large language models; accuracy; academic writing; AI; cross -disciplinary evaluation; scholarly writing; ChatGPT; GPT-3.5; writing tool; scholarly; academic discourse; LLMs; machine learning algorithms; NLP; natural language processing; citations; references; natural science; humanities; chatbot; artificial intelligence;
D O I
10.2196/52935
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Large language models (LLMs) have gained prominence since the release of ChatGPT in late 2022. Objective: The aim of this study was to assess the accuracy of citations and references generated by ChatGPT (GPT-3.5) in two distinct academic domains: the natural sciences and humanities. Methods: Two researchers independently prompted ChatGPT to write an introduction section for a manuscript and include citations; they then evaluated the accuracy of the citations and Digital Object Identifiers (DOIs). Results were compared between the two disciplines. Results: Ten topics were included, including 5 in the natural sciences and 5 in the humanities. A total of 102 citations were generated, with 55 in the natural sciences and 47 in the humanities. Among these, 40 citations (72.7%) in the natural sciences and 36 citations (76.6%) in the humanities were confirmed to exist (P=.42). There were significant disparities found in DOI presence in the natural sciences (39/55, 70.9%) and the humanities (18/47, 38.3%), along with significant differences in accuracy between the two disciplines (18/55, 32.7% vs 4/47, 8.5%). DOI hallucination was more prevalent in the humanities (42/55, 89.4%). The Levenshtein distance was significantly higher in the humanities than in the natural sciences, reflecting the lower DOI accuracy. Conclusions: ChatGPT's performance in generating citations and references varies across disciplines. Differences in DOI standards and disciplinary nuances contribute to performance variations. Researchers should consider the strengths and limitations of artificial intelligence writing tools with respect to citation accuracy. The use of domain-specific models may enhance accuracy.
引用
收藏
页数:7
相关论文
共 50 条
  • [31] A quasi-experimental cross-disciplinary evaluation of the impacts of education outside the classroom on pupils’ physical activity, well-being and learning: the TEACHOUT study protocol
    Glen Nielsen
    Erik Mygind
    Mads Bølling
    Camilla Roed Otte
    Mikkel Bo Schneller
    Jasper Schipperijn
    Niels Ejbye-Ernst
    Peter Bentsen
    BMC Public Health, 16
  • [32] A quasi-experimental cross-disciplinary evaluation of the impacts of education outside the classroom on pupils' physical activity, well-being and learning: the TEACHOUT study protocol
    Nielsen, Glen
    Mygind, Erik
    Bolling, Mads
    Otte, Camilla Roed
    Schneller, Mikkel Bo
    Schipperijn, Jasper
    Ejbye-Ernst, Niels
    Bentsen, Peter
    BMC PUBLIC HEALTH, 2016, 16 : 1 - 15
  • [33] Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam
    Tsoutsanis, Panagiotis
    Tsoutsanis, Aristotelis
    COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 168
  • [34] Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation
    Zhong, Li
    Wang, Zilong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 19, 2024, : 21841 - 21849
  • [35] Doctor Versus Artificial Intelligence: Patient and Physician Evaluation of Large Language Model Responses to Rheumatology Patient Questions in a Cross-Sectional Study
    Ye, Carrie
    Zweck, Elric
    Ma, Zechen
    Smith, Justin
    Katz, Steven
    ARTHRITIS & RHEUMATOLOGY, 2024, 76 (03) : 479 - 484
  • [36] Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study
    Workum, Jessica D.
    Volkers, Bas W. S.
    van de Sande, Davy
    Arora, Sumesh
    Goeijenbier, Marco
    Gommers, Diederik
    van Genderen, Michel E.
    CRITICAL CARE, 2025, 29 (01)
  • [37] Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study
    Kuerbanjiang, Warisijiang
    Peng, Shengzhe
    Jiamaliding, Yiershatijiang
    Yi, Yuexiong
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2025, 27
  • [38] Evaluation of large language models performance against humans for summarizing MRI knee radiology reports: A feasibility study
    Lopez-Ubeda, Pilar
    Martin-Noguerol, Teodoro
    Diaz-Angulo, Carolina
    Luna, Antonio
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2024, 187
  • [39] Evaluation of the Performance of Three Large Language Models in Clinical Decision Support: A Comparative Study Based on Actual Cases
    Wang, Xueqi
    Ye, Haiyan
    Zhang, Sumian
    Yang, Mei
    Wang, Xuebin
    JOURNAL OF MEDICAL SYSTEMS, 2025, 49 (01)
  • [40] Role of visual information in multimodal large language model performance: an evaluation using the Japanese nuclear medicine board examination
    Watanabe, Takashi
    Baba, Akira
    Fukuda, Takeshi
    Watanabe, Ken
    Woo, Jun
    Ojiri, Hiroya
    ANNALS OF NUCLEAR MEDICINE, 2025, 39 (02) : 217 - 224