An Efficient Greedy Incremental Sequence Clustering Algorithm

被引:0
|
作者
Ju, Zhen [1 ,2 ]
Zhang, Huiling [1 ,2 ]
Meng, Jingtao [2 ]
Zhang, Jingjing [1 ,2 ]
Li, Xuelei [2 ]
Fan, Jianping [2 ]
Pan, Yi [2 ]
Liu, Weiguo [3 ]
Wei, Yanjie [2 ]
机构
[1] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[2] Shenzhen Inst Adv Technol, Chinese Acad Sci, Shenzhen 518005, Peoples R China
[3] Shandong Univ, Jinan 250100, Peoples R China
基金
美国国家科学基金会;
关键词
Greedy incremental alignment; OneAPI; Gene clustering; Filtering; CD-HIT; PROTEIN;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Gene sequence clustering is very basic and important in computational biology and bioinformatics for the study of phylogenetic relationships and gene function prediction, etc. With the rapid growth of the amount of biological data (gene/protein sequences), clustering faces more challenges in low efficiency and precision. For example, there are many redundant sequences in gene databases that do not provide valid information but consume computing resources. Widely used greedy incremental clustering tools improve the efficiency at the cost of precision. To design a balanced gene clustering algorithm, which is both fast and precise, we propose a modified greedy incremental sequence clustering tool, via introducing a pre-filter, a modified short word filter, a new data packing strategy, and GPU accelerates. The experimental evaluations on four independent datasets show that the proposed tool can cluster datasets with precisions of 99.99%. Compared with the results of CD-HIT, Uclust, and Vsearch, the number of redundant sequences by the proposed method is four orders of magnitude less. In addition, on the same hardware platform, our tool is 40% faster than the second-place. The software is available at https://github.com/SIAT-HPCC/gene- sequence-clustering.
引用
收藏
页码:596 / 607
页数:12
相关论文
共 50 条
  • [41] An incremental clustering algorithm based on hyperbolic smoothing
    Bagirov, A. M.
    Ordin, B.
    Ozturk, G.
    Xavier, A. E.
    COMPUTATIONAL OPTIMIZATION AND APPLICATIONS, 2015, 61 (01) : 219 - 241
  • [42] ICA: An Incremental Clustering Algorithm Based on OPTICS
    Jun-Song Fu
    Yun Liu
    Han-Chieh Chao
    Wireless Personal Communications, 2015, 84 : 2151 - 2170
  • [43] ICA: An Incremental Clustering Algorithm Based on OPTICS
    Fu, Jun-Song
    Liu, Yun
    Chao, Han-Chieh
    WIRELESS PERSONAL COMMUNICATIONS, 2015, 84 (03) : 2151 - 2170
  • [44] An Incremental Clustering Algorithm based on sample selection
    Lei, Chen
    Chong, Wu
    PROCEEDINGS OF 2017 9TH INTERNATIONAL CONFERENCE ON MEASURING TECHNOLOGY AND MECHATRONICS AUTOMATION (ICMTMA), 2017, : 158 - 163
  • [45] An incremental outlier factor based clustering algorithm
    Zhou, YF
    Liu, QB
    Deng, S
    Yang, Q
    2002 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-4, PROCEEDINGS, 2002, : 1358 - 1361
  • [46] An incremental clustering algorithm based on hyperbolic smoothing
    A. M. Bagirov
    B. Ordin
    G. Ozturk
    A. E. Xavier
    Computational Optimization and Applications, 2015, 61 : 219 - 241
  • [47] An incremental clustering algorithm based on semantic concepts
    Soleymanian, Mahboubeh
    Mashayekhi, Hoda
    Rahimi, Marziea
    KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (06) : 3303 - 3335
  • [48] SIHC: A STABLE INCREMENTAL HIERARCHICAL CLUSTERING ALGORITHM
    Gurrutxaga, Ibai
    Arbelaitz, Olatz
    Martin, Jose I.
    Muguerza, Javier
    Perez, Jesus M.
    Perona, Inigo
    ICEIS 2009 : PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS, VOL AIDSS, 2009, : 300 - 304
  • [49] An Incremental Clustering Algorithm Based on Mahalanobis Distance
    Aik, Lim Eng
    Choon, Tan Wee
    INTERNATIONAL CONFERENCE ON QUANTITATIVE SCIENCES AND ITS APPLICATIONS (ICOQSIA 2014), 2014, 1635 : 788 - 793
  • [50] Automatic Topic Detection with an Incremental Clustering Algorithm
    Zhang, Xiaoming
    Li, Zhoujun
    WEB INFORMATION SYSTEMS AND MINING, 2010, 6318 : 344 - 351