Open-source large language models in action: A bioinformatics chatbot for PRIDE database

被引:5
|
作者
Bai, Jingwen [1 ]
Kamatchinathan, Selvakumar [1 ]
Kundu, Deepti J. [1 ]
Bandla, Chakradhar [1 ]
Vizcaino, Juan Antonio [1 ]
Perez-Riverol, Yasset [1 ,2 ]
机构
[1] European Mol Biol Lab European Bioinformat Inst EM, Wellcome Trust Genome Campus, Cambridge, England
[2] European Mol Biol Lab European Bioinformat Inst EM, Wellcome Trust Genome Campus, Cambridge CB10 1SD, England
基金
英国生物技术与生命科学研究理事会; 英国惠康基金;
关键词
bioinformatics; dataset discoverability; documentation; large language models; proteomics; public data; software architectures; training; SPECTROMETRY-BASED PROTEOMICS;
D O I
10.1002/pmic.202400005
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We here present a chatbot assistant infrastructure () that simplifies user interactions with the PRIDE database's documentation and dataset search functionality. The framework utilizes multiple Large Language Models (LLM): llama2, chatglm, mixtral (mistral), and openhermes. It also includes a web service API (Application Programming Interface), web interface, and components for indexing and managing vector databases. An Elo-ranking system-based benchmark component is included in the framework as well, which allows for evaluating the performance of each LLM and for improving PRIDE documentation. The chatbot not only allows users to interact with PRIDE documentation but can also be used to search and find PRIDE datasets using an LLM-based recommendation system, enabling dataset discoverability. Importantly, while our infrastructure is exemplified through its application in the PRIDE database context, the modular and adaptable nature of our approach positions it as a valuable tool for improving user experiences across a spectrum of bioinformatics and proteomics tools and resources, among other domains. The integration of advanced LLMs, innovative vector-based construction, the benchmarking framework, and optimized documentation collectively form a robust and transferable chatbot assistant infrastructure. The framework is open-source ().
引用
收藏
页数:7
相关论文
共 50 条
  • [41] MillenniumDB: An Open-Source Graph Database System
    Vrgoc, Domagoj
    Rojas, Carlos
    Angles, Renzo
    Arenas, Marcelo
    Arroyuelo, Diego
    Buil-Aranda, Carlos
    Hogan, Aidan
    Navarro, Gonzalo
    Riveros, Cristian
    Romero, Juan
    DATA INTELLIGENCE, 2023, 5 (03) : 560 - 610
  • [42] Open-Source Oriental Game and Endgame Database
    Zhou, Mengbo
    Kresman, Ray
    3RD INTERNATIONAL CONFERENCE ON INNOVATION IN ARTIFICIAL INTELLIGENCE (ICIAI 2019), 2019, : 167 - 171
  • [43] Towards a Critical Open-Source Software Database
    Dam, Tobias
    Klausner, Lukas Daniel
    Neumaier, Sebastian
    COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023, 2023, : 156 - 159
  • [44] Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study
    Ehrett, Carl
    Hegde, Sudeep
    Andre, Kwame
    Liu, Dixizi
    Wilson, Timothy
    JMIR MEDICAL EDUCATION, 2024, 10
  • [45] FaultLines - Evaluating the Efficacy of Open-Source Large Language Models for Fault Detection in Cyber-Physical Systems
    Muehlburger, Herbert
    Wotawa, Franz
    2024 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE TESTING, AITEST, 2024, : 47 - 54
  • [46] RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model
    Lu, Yao
    Liu, Shang
    Zhang, Qijun
    Xie, Zhiyao
    29TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2024, 2024, : 722 - 727
  • [47] Large language models and their applications in bioinformatics
    Sarumi, Oluwafemi A.
    Heider, Dominik
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2024, 23 : 3498 - 3505
  • [48] Archetypes of open-source business models
    Estelle Duparc
    Frederik Möller
    Ilka Jussen
    Maleen Stachon
    Sükran Algac
    Boris Otto
    Electronic Markets, 2022, 32 : 727 - 745
  • [49] Archetypes of open-source business models
    Duparc, Estelle
    Moeller, Frederik
    Jussen, Ilka
    Stachon, Maleen
    Algac, Sukran
    Otto, Boris
    ELECTRONIC MARKETS, 2022, 32 (02) : 727 - 745
  • [50] PMC-LLaMA: toward building open-source language models for medicine
    Wu, Chaoyi
    Lin, Weixiong
    Zhang, Xiaoman
    Zhang, Ya
    Xie, Weidi
    Wang, Yanfeng
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09) : 1833 - 1843