An Efficient Greedy Incremental Sequence Clustering Algorithm

被引:0
|
作者
Ju, Zhen [1 ,2 ]
Zhang, Huiling [1 ,2 ]
Meng, Jingtao [2 ]
Zhang, Jingjing [1 ,2 ]
Li, Xuelei [2 ]
Fan, Jianping [2 ]
Pan, Yi [2 ]
Liu, Weiguo [3 ]
Wei, Yanjie [2 ]
机构
[1] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[2] Shenzhen Inst Adv Technol, Chinese Acad Sci, Shenzhen 518005, Peoples R China
[3] Shandong Univ, Jinan 250100, Peoples R China
基金
美国国家科学基金会;
关键词
Greedy incremental alignment; OneAPI; Gene clustering; Filtering; CD-HIT; PROTEIN;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Gene sequence clustering is very basic and important in computational biology and bioinformatics for the study of phylogenetic relationships and gene function prediction, etc. With the rapid growth of the amount of biological data (gene/protein sequences), clustering faces more challenges in low efficiency and precision. For example, there are many redundant sequences in gene databases that do not provide valid information but consume computing resources. Widely used greedy incremental clustering tools improve the efficiency at the cost of precision. To design a balanced gene clustering algorithm, which is both fast and precise, we propose a modified greedy incremental sequence clustering tool, via introducing a pre-filter, a modified short word filter, a new data packing strategy, and GPU accelerates. The experimental evaluations on four independent datasets show that the proposed tool can cluster datasets with precisions of 99.99%. Compared with the results of CD-HIT, Uclust, and Vsearch, the number of redundant sequences by the proposed method is four orders of magnitude less. In addition, on the same hardware platform, our tool is 40% faster than the second-place. The software is available at https://github.com/SIAT-HPCC/gene- sequence-clustering.
引用
收藏
页码:596 / 607
页数:12
相关论文
共 50 条
  • [1] An efficient incremental protein sequence clustering algorithm
    Vijaya, PA
    Murty, MN
    Subramanian, DK
    IEEE TENCON 2003: CONFERENCE ON CONVERGENT TECHNOLOGIES FOR THE ASIA-PACIFIC REGION, VOLS 1-4, 2003, : 409 - 413
  • [2] nGIA: A novel Greedy Incremental Alignment based algorithm for gene sequence clustering
    Ju, Zhen
    Zhang, Huiling
    Meng, Jintao
    Zhang, Jingjing
    Fan, Jianping
    Pan, Yi
    Liu, Weiguo
    Li, Xuelei
    Wei, Yanjie
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2022, 136 : 221 - 230
  • [3] A Greedy Clustering Algorithm for Multiple Sequence Alignment
    Lebsir, Rabah
    Layeb, Abdesslem
    Fariza, Tahi
    INTERNATIONAL JOURNAL OF COGNITIVE INFORMATICS AND NATURAL INTELLIGENCE, 2021, 15 (04)
  • [4] An Efficient Greedy Algorithm for Sequence Recommendation
    Benouaret, Idir
    Amer-Yahia, Sihem
    Roy, Senjuti Basu
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT I, 2019, 11706 : 314 - 326
  • [5] A greedy clustering and scheduling algorithm
    Ruan, YL
    Zhang, JJ
    Li, QH
    Yang, SD
    2003 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-5, PROCEEDINGS, 2003, : 247 - 250
  • [6] REPRESENTATIVE POINTS AND CLUSTER ATTRIBUTES BASED INCREMENTAL SEQUENCE CLUSTERING ALGORITHM
    Wu, Di
    Ren, Jiadong
    COMPUTING AND INFORMATICS, 2017, 36 (06) : 1361 - 1384
  • [7] An efficient greedy K-means algorithm for global gene trajectory clustering
    Chan, ZSH
    Collins, L
    Kasabov, N
    EXPERT SYSTEMS WITH APPLICATIONS, 2006, 30 (01) : 137 - 141
  • [8] Efficient Markov Clustering Algorithm for Protein Sequence Grouping
    Szilagyi, Laszlo
    Szilagyi, Sandor M.
    2013 35TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2013, : 639 - 642
  • [9] An Efficient Density Based Incremental Clustering Algorithm in Data Warehousing Environment
    Goyal, Navneet
    Goyal, Poonam
    Venkatramaiah, K.
    Deepak, P. C.
    Sanoop, P. S.
    PROCEEDINGS OF 2009 INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATIONS, 2009, : 556 - 560
  • [10] Efficient incremental density-based algorithm for clustering large datasets
    Bakr, Ahmad M.
    Ghanem, Nagia M.
    Ismail, Mohamed A.
    ALEXANDRIA ENGINEERING JOURNAL, 2015, 54 (04) : 1147 - 1154