Open-source large language models in action: A bioinformatics chatbot for PRIDE database

被引:5
|
作者
Bai, Jingwen [1 ]
Kamatchinathan, Selvakumar [1 ]
Kundu, Deepti J. [1 ]
Bandla, Chakradhar [1 ]
Vizcaino, Juan Antonio [1 ]
Perez-Riverol, Yasset [1 ,2 ]
机构
[1] European Mol Biol Lab European Bioinformat Inst EM, Wellcome Trust Genome Campus, Cambridge, England
[2] European Mol Biol Lab European Bioinformat Inst EM, Wellcome Trust Genome Campus, Cambridge CB10 1SD, England
基金
英国生物技术与生命科学研究理事会; 英国惠康基金;
关键词
bioinformatics; dataset discoverability; documentation; large language models; proteomics; public data; software architectures; training; SPECTROMETRY-BASED PROTEOMICS;
D O I
10.1002/pmic.202400005
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We here present a chatbot assistant infrastructure () that simplifies user interactions with the PRIDE database's documentation and dataset search functionality. The framework utilizes multiple Large Language Models (LLM): llama2, chatglm, mixtral (mistral), and openhermes. It also includes a web service API (Application Programming Interface), web interface, and components for indexing and managing vector databases. An Elo-ranking system-based benchmark component is included in the framework as well, which allows for evaluating the performance of each LLM and for improving PRIDE documentation. The chatbot not only allows users to interact with PRIDE documentation but can also be used to search and find PRIDE datasets using an LLM-based recommendation system, enabling dataset discoverability. Importantly, while our infrastructure is exemplified through its application in the PRIDE database context, the modular and adaptable nature of our approach positions it as a valuable tool for improving user experiences across a spectrum of bioinformatics and proteomics tools and resources, among other domains. The integration of advanced LLMs, innovative vector-based construction, the benchmarking framework, and optimized documentation collectively form a robust and transferable chatbot assistant infrastructure. The framework is open-source ().
引用
收藏
页数:7
相关论文
共 50 条
  • [31] BioJava']Java:: an open-source framework for bioinformatics
    Holland, R. C. G.
    Down, T. A.
    Pocock, M.
    Prlic, A.
    Huen, D.
    James, K.
    Foisy, S.
    Draeger, A.
    Yates, A.
    Heuer, M.
    Schreiber, M. J.
    BIOINFORMATICS, 2008, 24 (18) : 2096 - 2097
  • [32] Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking
    Zhuang, Shengyao
    Liu, Bing
    Koopman, Bevan
    Zuccon, Guido
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 8807 - 8817
  • [33] An Empirical Study of (Multi-) Database Models in Open-Source Projects
    Benats, Pol
    Gobert, Maxime
    Meurice, Loup
    Nagy, Csaba
    Cleve, Anthony
    CONCEPTUAL MODELING, ER 2021, 2021, 13011 : 87 - 101
  • [34] Open-Source Large Language Models in Anesthesia Perioperative Medicine: ASA-Physical Status Evaluation
    Rouholiman, Dara
    Goodell, Alex J.
    Fung, Ethan
    Chandrasoma, Janak T.
    Chu, Larry F.
    ANESTHESIA AND ANALGESIA, 2024, 139 (06): : 2779 - 2781
  • [35] Enhancing Commit Message Categorization in Open-Source Repositories Using Structured Taxonomy and Large Language Models
    Al-razgan, Muna
    Alaqil, Manal
    Almuwayshir, Ruba
    Alhijji, Zamzam
    ADVANCES IN ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING, 2024, 4 (04): : 2950 - 2968
  • [36] TeenyTinyLlama: Open-source tiny language models trained in Brazilian Portuguese
    Correa, Nicholas Kluge
    Falk, Sophia
    Fatimah, Shiza
    Sen, Aniket
    De Oliveira, Nythamar
    MACHINE LEARNING WITH APPLICATIONS, 2024, 16
  • [37] OPEN-SOURCE LANGUAGE AI CHALLENGES BIG TECH'S MODELS
    Gibney, Elizabeth
    NATURE, 2022, 606 (7916) : 850 - 851
  • [38] Open-source language AI challenges big tech’s models
    Elizabeth Gibney
    Nature, 2022, 606 : 850 - 851
  • [39] BioJava']Java: an open-source framework for bioinformatics in 2012
    Prlic, Andreas
    Yates, Andrew
    Bliven, Spencer E.
    Rose, Peter W.
    Jacobsen, Julius
    Troshin, Peter V.
    Chapman, Mark
    Gao, Jianjiong
    Koh, Chuan Hock
    Foisy, Sylvain
    Holland, Richard
    Rimsa, Gediminas
    Heuer, Michael L.
    Brandstaetter-Mueller, H.
    Bourne, Philip E.
    Willis, Scooter
    BIOINFORMATICS, 2012, 28 (20) : 2693 - 2695
  • [40] Stigma in Large Language Models: A Chatbot Responds
    Weiner, Scott G.
    Wakeman, Sarah E.
    JOURNAL OF ADDICTION MEDICINE, 2024, 18 (01) : 90 - 91