Open-source large language models in action: A bioinformatics chatbot for PRIDE database

被引:5
|
作者
Bai, Jingwen [1 ]
Kamatchinathan, Selvakumar [1 ]
Kundu, Deepti J. [1 ]
Bandla, Chakradhar [1 ]
Vizcaino, Juan Antonio [1 ]
Perez-Riverol, Yasset [1 ,2 ]
机构
[1] European Mol Biol Lab European Bioinformat Inst EM, Wellcome Trust Genome Campus, Cambridge, England
[2] European Mol Biol Lab European Bioinformat Inst EM, Wellcome Trust Genome Campus, Cambridge CB10 1SD, England
基金
英国生物技术与生命科学研究理事会; 英国惠康基金;
关键词
bioinformatics; dataset discoverability; documentation; large language models; proteomics; public data; software architectures; training; SPECTROMETRY-BASED PROTEOMICS;
D O I
10.1002/pmic.202400005
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We here present a chatbot assistant infrastructure () that simplifies user interactions with the PRIDE database's documentation and dataset search functionality. The framework utilizes multiple Large Language Models (LLM): llama2, chatglm, mixtral (mistral), and openhermes. It also includes a web service API (Application Programming Interface), web interface, and components for indexing and managing vector databases. An Elo-ranking system-based benchmark component is included in the framework as well, which allows for evaluating the performance of each LLM and for improving PRIDE documentation. The chatbot not only allows users to interact with PRIDE documentation but can also be used to search and find PRIDE datasets using an LLM-based recommendation system, enabling dataset discoverability. Importantly, while our infrastructure is exemplified through its application in the PRIDE database context, the modular and adaptable nature of our approach positions it as a valuable tool for improving user experiences across a spectrum of bioinformatics and proteomics tools and resources, among other domains. The integration of advanced LLMs, innovative vector-based construction, the benchmarking framework, and optimized documentation collectively form a robust and transferable chatbot assistant infrastructure. The framework is open-source ().
引用
收藏
页数:7
相关论文
共 50 条
  • [1] PharmaLLM: A Medicine Prescriber Chatbot Exploiting Open-Source Large Language Models
    Ayesha Azam
    Zubaira Naz
    Muhammad Usman Ghani Khan
    Human-Centric Intelligent Systems, 2024, 4 (4): : 527 - 544
  • [2] Re: Open-Source Large Language Models in Radiology
    Kooraki, Soheil
    Bedayat, Arash
    ACADEMIC RADIOLOGY, 2024, 31 (10) : 4293 - 4293
  • [3] Servicing open-source large language models for oncology
    Ray, Partha Pratim
    ONCOLOGIST, 2024,
  • [4] A tutorial on open-source large language models for behavioral science
    Hussain, Zak
    Binz, Marcel
    Mata, Rui
    Wulff, Dirk U.
    BEHAVIOR RESEARCH METHODS, 2024, 56 (08) : 8214 - 8237
  • [5] Upgrading Academic Radiology with Open-Source Large Language Models
    Ray, Partha Pratim
    ACADEMIC RADIOLOGY, 2024, 31 (10) : 4291 - 4292
  • [6] Preliminary Systematic Review of Open-Source Large Language Models in Education
    Lin, Michael Pin-Chuan
    Chang, Daniel
    Hall, Sarah
    Jhajj, Gaganpreet
    GENERATIVE INTELLIGENCE AND INTELLIGENT TUTORING SYSTEMS, PT I, ITS 2024, 2024, 14798 : 68 - 77
  • [7] Classifying Cancer Stage with Open-Source Clinical Large Language Models
    Chang, Chia-Hsuan
    Lucas, Mary M.
    Lu-Yao, Grace
    Yang, Christopher C.
    2024 IEEE 12TH INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS, ICHI 2024, 2024, : 76 - 82
  • [8] An Open-Source Chatbot by Using ParlAI
    Narnavaram, Kiran Mai
    Lo, Dan
    PROCEEDINGS OF THE 2024 ACM SOUTHEAST CONFERENCE, ACMSE 2024, 2024, : 323 - 324
  • [9] Open-source bioinformatics software
    Vlagioiu, Constantin
    Vuta, Vlad
    Barbuceanu, Florica
    Predoi, Gabriel
    Tudor, Nicolae
    JOURNAL OF BIOTECHNOLOGY, 2017, 256 : S53 - S53
  • [10] Comparison of Frontier Open-Source and Proprietary Large Language Models for Complex Diagnoses
    Buckley, Thomas A.
    Crowe, Byron
    Abdulnour, Raja-Elie E.
    Rodman, Adam
    Manrai, Arjun K.
    JAMA HEALTH FORUM, 2025, 6 (03):