LLM4THP: a computing tool to identify tumor homing peptides by molecular and sequence representation of large language model based on two-layer ensemble model strategy

被引：0

作者：

Yang, Sen ^{[1
,2
]}

Xu, Piao ^{[3
]}

机构：

[1] Changzhou Univ, Sch Comp Sci & Artificial Intelligence, Aliyun Sch Big Data Sch Software, Changzhou 213164, Peoples R China

[2] Nanjing Med Univ, Affiliated Changzhou Peoples Hosp 2, Changzhou 213164, Peoples R China

[3] Nanjing Forestry Univ, Coll Econ & Management, Nanjing 210037, Peoples R China

来源：

AMINO ACIDS | 2024年 / 56卷 / 01期

关键词：

Tumor homing peptides; Computational method; Large Language models; Peptide sequence encoding; Ensemble strategy;

D O I：

10.1007/s00726-024-03422-5

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

Tumor homing peptides (THPs) have a distinctive capacity to specifically attach to tumor cells, providing a promising approach for targeted cancer treatment and detection. Although THPs have the potential for significant impact, their detection by conventional methods is both time-consuming and expensive. To tackle this issue, we provide LLM4THP, an innovative computational approach that utilizes large language models (LLMs) to quickly and effectively detect THPs. LLM4THP utilizes two protein LLMs, ESM2 and Prot_T5_XL_UniRef50, to encode peptide sequences. This allows for the capture of complex patterns and relationships within the peptide data. In addition, we utilize inherent sequence characteristics such as Amino Acid Composition (AAC), Pseudo Amino Acid Composition (PAAC), Amphiphilic Pseudo Amino Acid Composition (APAAC), and Composition, Transition, and Distribution (CTD) to improve the representation of peptides. The RDKitDescriptors feature representation approach transforms peptide sequences into molecular objects and computes chemical characteristics, resulting in enhanced THP identification. The LLM4THP ensemble strategy incorporates various features into a two-layer learning architecture. The first layer consists of LightGBM, XGBoost, Random Forest, and Extremely Randomized Trees, which generate a set of meta results. The second layer utilizes Logistic Regression to further refine the identification of sequences as either THP or non-THP. LLM4THP exhibits exceptional performance compared to the most advanced methods, showcasing enhancements in accuracy, Matthew's correlation coefficient, F1 score, area under the curve, and average precision. The source code and dataset can be accessed at the following URL: https://github.com/abcair/LLM4THP.

引用

页数：17