Benchmarking DNA large language models on quadruplexes

被引:0
|
作者
Cherednichenko, Oleksandr [1 ]
Herbert, Alan [1 ,2 ]
Poptsova, Maria [1 ]
机构
[1] HSE Univ, Int Lab Bioinformat, Moscow, Russia
[2] InsideOutBio, Charlestown, MA USA
关键词
Foundation model; Large language model; DNABERT; HyenaDNA; MAMBA-DNA; Caduseus; Flipons; Non-B DNA; G-quadruplexes;
D O I
10.1016/j.csbj.2025.03.007
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Large language models (LLMs) in genomics have successfully predicted various functional genomic elements. While their performance is typically evaluated using genomic benchmark datasets, it remains unclear which LLM is best suited for specific downstream tasks, particularly for generating whole-genome annotations. Current LLMs in genomics fall into three main categories: transformer-based models, long convolution-based models, and statespace models (SSMs). In this study, we benchmarked three different types of LLM architectures for generating whole-genome maps of G-quadruplexes (GQ), a type of flipons, or non-B DNA structures, characterized by distinctive patterns and functional roles in diverse regulatory contexts. Although GQ forms from folding guanosine residues into tetrads, the computational task is challenging as the bases involved may be on different strands, separated by a large number of nucleotides, or made from RNA rather than DNA. All LLMs performed comparably well, with DNABERT-2 and HyenaDNA achieving superior results based on F1 and MCC. Analysis of whole-genome annotations revealed that HyenaDNA recovered more quadruplexes in distal enhancers and intronic regions. The models were better suited to detecting large GQ arrays that likely contribute to the nuclear condensates involved in gene transcription and chromosomal scaffolds. HyenaDNA and Caduceus formed a separate grouping in the generated de novo quadruplexes, while transformer-based models clustered together. Overall, our findings suggest that different types of LLMs complement each other. Genomic architectures with varying context lengths can detect distinct functional regulatory elements, underscoring the importance of selecting the appropriate model based on the specific genomic task. The code and data underlying this article are available at https://github.com/powidla/G4s-FMs
引用
收藏
页码:992 / 1000
页数:9
相关论文
共 50 条
  • [41] Benchmarking Large Language Model Capabilities for Conditional Generation
    Maynez, Joshua
    Agrawal, Priyanka
    Gehrmann, Sebastian
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 9194 - 9213
  • [42] EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models
    Dui, Mengfei
    Wu, Binhao
    Li, Zejun
    Huang, Xuanjing
    Wei, Zhongyu
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 346 - 355
  • [43] Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions
    Severino, Joao Victor Bruneti
    de Paula, Pedro Angelo Basei
    Berger, Matheus Nespolo
    Loures, Filipe Silveira
    Todeschini, Solano Amadori
    Roeder, Eduardo Augusto
    Veiga, Maria Han
    Guedes, Murilo
    Marques, Gustavo Lenci
    BMJ HEALTH & CARE INFORMATICS, 2025, 32 (01)
  • [44] Benchmarking Large Language Model Performance on Natural Language Processing Tasks for Pharmacoepidemiology
    Feng, Hui
    Ronzano, Francesco
    LaFleur, JuDe
    Garber, Matthew L.
    de Oliveira, Rodrigo
    Roth, Katharine
    Rough, Kathryn
    Nanavati, Jay
    El Abidine, Khaldoun Zine
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2024, 33 : 70 - 70
  • [45] DNA and RNA quadruplexes
    不详
    EUROPEAN BIOPHYSICS JOURNAL WITH BIOPHYSICS LETTERS, 2005, 34 (06): : 636 - 636
  • [46] RNA AND DNA QUADRUPLEXES
    Mergny, Jean-Louis
    ANTICANCER RESEARCH, 2008, 28 (5C) : 3404 - 3404
  • [47] Large Language Models in der WissenschaftLarge language models in science
    Karl-Friedrich Kowalewski
    Severin Rodler
    Die Urologie, 2024, 63 (9) : 860 - 866
  • [48] PDLLMs: A group of tailored DNA large language models for analyzing plant genomes
    Liu, Guanqing
    Chen, Long
    Wu, Yuechao
    Han, Yangshuo
    Bao, Yu
    Zhang, Tao
    MOLECULAR PLANT, 2025, 18 (02) : 175 - 178
  • [49] The Sapientia ECN AI Baseline Index: Benchmarking Large Language Models Against Student Performance in Competitive Programming
    Katai, Zoltan
    Iclanzan, David
    ACTA UNIVERSITATIS SAPIENTIAE INFORMATICA, 2024, 16 (02) : 255 - 285
  • [50] Benchmarking Deep Graph Models for Large Molecular Generation
    Park, Jin-Jun
    Sael, Lee
    2022 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (IEEE BIGCOMP 2022), 2022, : 114 - 120