A novel lossless encoding algorithm for data compression-genomics data as an exemplar

被引:0
|
作者
Al-okaily, Anas [1 ]
Tbakhi, Abdelghani [2 ]
机构
[1] King Hussein Canc Ctr, Dept Cell Therapy Appl Genom, Amman, Jordan
[2] McMaster Univ, Dept Pathol & Mol Med, Hamilton, ON, Canada
来源
关键词
compression; Huffman encoding; LZ; genomics; BWT; SEQUENCES; FORMAT;
D O I
10.3389/fbinf.2024.1489704
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Data compression is a challenging and increasingly important problem. As the amount of data generated daily continues to increase, efficient transmission and storage have never been more critical. In this study, a novel encoding algorithm is proposed, motivated by the compression of DNA data and associated characteristics. The proposed algorithm follows a divide-and-conquer approach by scanning the whole genome, classifying subsequences based on similarities in their content, and binning similar subsequences together. The data is then compressed into each bin independently. This approach is different than the currently known approaches: entropy, dictionary, predictive, or transform-based methods. Proof-of-concept performance was evaluated using a benchmark dataset with seventeen genomes ranging in size from kilobytes to gigabytes. The results showed a considerable improvement in the compression of each genome, preserving several megabytes compared to state-of-the-art tools. Moreover, the algorithm can be applied to the compression of other data types include mainly text, numbers, images, audio, and video which are being generated daily and unprecedentedly in massive volumes.
引用
收藏
页数:8
相关论文
共 50 条
  • [31] ISSDC DIGRAM CODING BASED LOSSLESS DATA COMPRESSION ALGORITHM
    Mesut, Altan
    Carus, Aydin
    COMPUTING AND INFORMATICS, 2010, 29 (05) : 741 - 756
  • [32] Lossless Compression Algorithm of Multimedia Data Based on Artificial Intelligence
    Ji, Quanpeng
    Engineering Intelligent Systems, 2022, 30 (01): : 23 - 33
  • [33] Architecture for efficient implementation of the YK lossless data compression algorithm
    Banerji, A
    Goel, S
    DCC 2001: DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2001, : 482 - 482
  • [34] Research and Software Implementation of CCSDS Lossless Data Compression Algorithm
    Chu Qing-Wei
    Zhang Hong-Qun
    Wu Ye-Wei
    PROCEEDINGS OF 2012 IEEE 11TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP) VOLS 1-3, 2012, : 102 - 106
  • [35] Lossless data compression for image decomposition with recursive IDP algorithm
    Milanova, M
    Todorov, V
    Kountcheva, R
    PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2, 2004, : 823 - 826
  • [36] A lossless data compression and decompression algorithm and its hardware architecture
    Lin, Ming-Bo
    Lee, Jang-Feng
    Jan, Gene Eu
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2006, 14 (09) : 925 - 936
  • [37] Lossless compression of seismic data
    Abdulaziz, O. Abanmi
    Alshebeili, Saleh A.
    Alamri, Tariq H.
    JOURNAL OF THE FRANKLIN INSTITUTE-ENGINEERING AND APPLIED MATHEMATICS, 2006, 343 (4-5): : 340 - 351
  • [38] LOSSLESS DATA-COMPRESSION
    APIKI, S
    BYTE, 1991, 16 (03): : 309 - &
  • [39] Lossless compression of ionogram data
    Ye, H
    Devlin, JC
    Deng, G
    DSP 97: 1997 13TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING PROCEEDINGS, VOLS 1 AND 2: SPECIAL SESSIONS, 1997, : 837 - 839
  • [40] Lossless Compression of Cytometric Data
    Bras, Anne E.
    van der Velden, Vincent H.
    CYTOMETRY PART A, 2019, 95 (10) : 1108 - 1112