An Efficient Greedy Incremental Sequence Clustering Algorithm

被引:0
|
作者
Ju, Zhen [1 ,2 ]
Zhang, Huiling [1 ,2 ]
Meng, Jingtao [2 ]
Zhang, Jingjing [1 ,2 ]
Li, Xuelei [2 ]
Fan, Jianping [2 ]
Pan, Yi [2 ]
Liu, Weiguo [3 ]
Wei, Yanjie [2 ]
机构
[1] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[2] Shenzhen Inst Adv Technol, Chinese Acad Sci, Shenzhen 518005, Peoples R China
[3] Shandong Univ, Jinan 250100, Peoples R China
基金
美国国家科学基金会;
关键词
Greedy incremental alignment; OneAPI; Gene clustering; Filtering; CD-HIT; PROTEIN;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Gene sequence clustering is very basic and important in computational biology and bioinformatics for the study of phylogenetic relationships and gene function prediction, etc. With the rapid growth of the amount of biological data (gene/protein sequences), clustering faces more challenges in low efficiency and precision. For example, there are many redundant sequences in gene databases that do not provide valid information but consume computing resources. Widely used greedy incremental clustering tools improve the efficiency at the cost of precision. To design a balanced gene clustering algorithm, which is both fast and precise, we propose a modified greedy incremental sequence clustering tool, via introducing a pre-filter, a modified short word filter, a new data packing strategy, and GPU accelerates. The experimental evaluations on four independent datasets show that the proposed tool can cluster datasets with precisions of 99.99%. Compared with the results of CD-HIT, Uclust, and Vsearch, the number of redundant sequences by the proposed method is four orders of magnitude less. In addition, on the same hardware platform, our tool is 40% faster than the second-place. The software is available at https://github.com/SIAT-HPCC/gene- sequence-clustering.
引用
收藏
页码:596 / 607
页数:12
相关论文
共 50 条
  • [31] Efficient global clustering using the greedy elimination method
    Chan, ZSH
    Kasabov, N
    ELECTRONICS LETTERS, 2004, 40 (25) : 1611 - 1612
  • [32] Efficient Clustering Approach using Incremental and Hierarchical Clustering Methods
    Srinivas, M.
    Mohan, C. Krishna
    2010 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS IJCNN 2010, 2010,
  • [33] An incremental sequence pattern mining algorithm
    Fu, Zhongliang
    Chen, Nan
    Wuhan Daxue Xuebao (Xinxi Kexue Ban)/Geomatics and Information Science of Wuhan University, 2010, 35 (07): : 763 - 767
  • [34] An efficient clustering algorithm
    Zhang, YF
    Mao, JL
    Xiong, ZY
    2003 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-5, PROCEEDINGS, 2003, : 261 - 265
  • [35] EFFICIENT CLUSTERING ALGORITHM
    BHAT, MV
    HAUPT, A
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1976, 6 (01): : 61 - 64
  • [36] An efficient clustering algorithm
    Jiang, SY
    Xu, YM
    PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2004, : 1513 - 1518
  • [37] Efficient incremental subspace clustering in data streams
    Kontaki, Maria
    Papadopoulos, Apostolos N.
    Manolopoulos, Yannis
    10TH INTERNATIONAL DATABASE ENGINEERING AND APPLICATIONS SYMPOSIUM, PROCEEDINGS, 2006, : 53 - 60
  • [38] Research on incremental clustering algorithm for big data
    Yang X.
    Applied Mathematics and Nonlinear Sciences, 2023, 8 (02) : 169 - 180
  • [39] An Efficient Greedy LLL Algorithm for MIMO Detection
    Wen, Qingsong
    Ma, Xiaoli
    2014 IEEE MILITARY COMMUNICATIONS CONFERENCE: AFFORDABLE MISSION SUCCESS: MEETING THE CHALLENGE (MILCOM 2014), 2014, : 550 - 555
  • [40] HIREL: An Incremental Clustering Algorithm for Relational Datasets
    Li, Tao
    Anand, Sarabjot S.
    ICDM 2008: EIGHTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2008, : 887 - 892