An Efficient Greedy Incremental Sequence Clustering Algorithm

被引：0

作者：

Ju, Zhen ^{[1
,2
]}

Zhang, Huiling ^{[1
,2
]}

Meng, Jingtao ^{[2
]}

Zhang, Jingjing ^{[1
,2
]}

Li, Xuelei ^{[2
]}

Fan, Jianping ^{[2
]}

Pan, Yi ^{[2
]}

Liu, Weiguo ^{[3
]}

Wei, Yanjie ^{[2
]}

机构：

[1] Univ Chinese Acad Sci, Beijing 100049, Peoples R China

[2] Shenzhen Inst Adv Technol, Chinese Acad Sci, Shenzhen 518005, Peoples R China

[3] Shandong Univ, Jinan 250100, Peoples R China

来源：

BIOINFORMATICS RESEARCH AND APPLICATIONS, ISBRA 2021 | 2021年 / 13064卷

基金：

美国国家科学基金会;

关键词：

Greedy incremental alignment; OneAPI; Gene clustering; Filtering; CD-HIT; PROTEIN;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Gene sequence clustering is very basic and important in computational biology and bioinformatics for the study of phylogenetic relationships and gene function prediction, etc. With the rapid growth of the amount of biological data (gene/protein sequences), clustering faces more challenges in low efficiency and precision. For example, there are many redundant sequences in gene databases that do not provide valid information but consume computing resources. Widely used greedy incremental clustering tools improve the efficiency at the cost of precision. To design a balanced gene clustering algorithm, which is both fast and precise, we propose a modified greedy incremental sequence clustering tool, via introducing a pre-filter, a modified short word filter, a new data packing strategy, and GPU accelerates. The experimental evaluations on four independent datasets show that the proposed tool can cluster datasets with precisions of 99.99%. Compared with the results of CD-HIT, Uclust, and Vsearch, the number of redundant sequences by the proposed method is four orders of magnitude less. In addition, on the same hardware platform, our tool is 40% faster than the second-place. The software is available at https://github.com/SIAT-HPCC/gene- sequence-clustering.

引用

页码：596 / 607

页数：12

共 50 条

[1] An efficient incremental protein sequence clustering algorithm
Vijaya, PA
Murty, MN
Subramanian, DK
IEEE TENCON 2003: CONFERENCE ON CONVERGENT TECHNOLOGIES FOR THE ASIA-PACIFIC REGION, VOLS 1-4, 2003, : 409 - 413
[2] nGIA: A novel Greedy Incremental Alignment based algorithm for gene sequence clustering
Ju, Zhen
Zhang, Huiling
Meng, Jintao
Zhang, Jingjing
Fan, Jianping
Pan, Yi
Liu, Weiguo
Li, Xuelei
Wei, Yanjie
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2022, 136 : 221 - 230
[3] A Greedy Clustering Algorithm for Multiple Sequence Alignment
Lebsir, Rabah
Layeb, Abdesslem
Fariza, Tahi
INTERNATIONAL JOURNAL OF COGNITIVE INFORMATICS AND NATURAL INTELLIGENCE, 2021, 15 (04)
[4] An Efficient Greedy Algorithm for Sequence Recommendation
Benouaret, Idir
Amer-Yahia, Sihem
Roy, Senjuti Basu
DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT I, 2019, 11706 : 314 - 326
[5] A greedy clustering and scheduling algorithm
Ruan, YL
Zhang, JJ
Li, QH
Yang, SD
2003 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-5, PROCEEDINGS, 2003, : 247 - 250
[6] REPRESENTATIVE POINTS AND CLUSTER ATTRIBUTES BASED INCREMENTAL SEQUENCE CLUSTERING ALGORITHM
Wu, Di
Ren, Jiadong
COMPUTING AND INFORMATICS, 2017, 36 (06) : 1361 - 1384
[7] An efficient greedy K-means algorithm for global gene trajectory clustering
Chan, ZSH
Collins, L
Kasabov, N
EXPERT SYSTEMS WITH APPLICATIONS, 2006, 30 (01) : 137 - 141
[8] Efficient Markov Clustering Algorithm for Protein Sequence Grouping
Szilagyi, Laszlo
Szilagyi, Sandor M.
2013 35TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2013, : 639 - 642
[9] An Efficient Density Based Incremental Clustering Algorithm in Data Warehousing Environment
Goyal, Navneet
Goyal, Poonam
Venkatramaiah, K.
Deepak, P. C.
Sanoop, P. S.
PROCEEDINGS OF 2009 INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATIONS, 2009, : 556 - 560
[10] Efficient incremental density-based algorithm for clustering large datasets
Bakr, Ahmad M.
Ghanem, Nagia M.
Ismail, Mohamed A.
ALEXANDRIA ENGINEERING JOURNAL, 2015, 54 (04) : 1147 - 1154

← 1 2 3 4 5 →