Searching for smallest grammars on large sequences and application to DNA

被引:7
|
作者
Carrascosa, Rafael [1 ]
Coste, Francois [2 ]
Galle, Matthias [2 ]
Infante-Lopez, Gabriel [1 ,3 ]
机构
[1] Univ Nacl Cordoba, Grp Procesamiento Lenguaje Nat, Cordoba, Argentina
[2] IRISA INRIA Rennes Bretagne Atlantique, Symbiose Project, Rennes, France
[3] Consejo Nacl Invest Cient & Tecn, Buenos Aires, DF, Argentina
关键词
Linguistics of DNA; Smallest grammar problem; Structural inference; Maximal repeats;
D O I
10.1016/j.jda.2011.04.006
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Motivated by the inference of the structure of genomic sequences, we address here the smallest grammar problem. In previous work, we introduced a new perspective on this problem, splitting the task into two different optimization problems: choosing which words will be considered constituents of the final grammar and finding a minimal parsing with these constituents. Here we focus on making these ideas applicable on large sequences. First, we improve the complexity of existing algorithms by using the concept of maximal repeats when choosing which substrings will be the constituents of the grammar. Then, we improve the size of the grammars by cautiously adding a minimal parsing optimization step. Together, these approaches enable us to propose new practical algorithms that return smaller grammars (up to 10%) in approximately the same amount of time than their competitors on a classical set of genomic sequences and on whole genomes of model organisms. (C) 2011 Elsevier B.V. All rights reserved.
引用
收藏
页码:62 / 72
页数:11
相关论文
共 50 条
  • [41] LPC-VQ based hidden Markov models for similarity searching in DNA sequences
    Pham, Tuan D.
    Yan, Hong
    2006 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-6, PROCEEDINGS, 2006, : 1654 - +
  • [42] SEARCHING FOR DISTANTLY RELATED PROTEIN SEQUENCES IN LARGE DATABASES BY PARALLEL PROCESSING ON A TRANSPUTER MACHINE
    VOGT, G
    ARGOS, P
    COMPUTER APPLICATIONS IN THE BIOSCIENCES, 1992, 8 (01): : 49 - 55
  • [43] SEARCHING FOR TRANSFER-RNA GENES IN DNA-SEQUENCES - AN IBM MICROCOMPUTER PROGRAM
    WOZNIAK, P
    MAKALOWSKI, W
    COMPUTER APPLICATIONS IN THE BIOSCIENCES, 1990, 6 (01): : 49 - 50
  • [44] Searching for patterns in random sequences
    Wolford, G
    Newman, SE
    Miller, MB
    Wig, GS
    CANADIAN JOURNAL OF EXPERIMENTAL PSYCHOLOGY-REVUE CANADIENNE DE PSYCHOLOGIE EXPERIMENTALE, 2004, 58 (04): : 221 - 228
  • [45] Efficient String Matching Algorithm for Searching Large DNA and Binary Texts
    Al-Ssulami, Abdulrakeeb M.
    Mathkour, Hassan
    Arafah, Mohammed Amer
    INTERNATIONAL JOURNAL ON SEMANTIC WEB AND INFORMATION SYSTEMS, 2017, 13 (04) : 198 - 220
  • [46] SEARCHING FOR CENTS - IRSS SMALLEST CASES, TAKE-2
    不详
    JOURNAL OF TAXATION, 1994, 81 (03): : 191 - 192
  • [47] An optimization approach and its application to compare DNA sequences
    Liu, Liwei
    Li, Chao
    Bai, Fenglan
    Zhao, Qi
    Wang, Ying
    JOURNAL OF MOLECULAR STRUCTURE, 2015, 1082 : 49 - 55
  • [48] Linear regression model of DNA sequences and its application
    Dai, Qi
    Liu, Xiao-Qing
    Wang, Tian-Ming
    Vukicevic, Damir
    JOURNAL OF COMPUTATIONAL CHEMISTRY, 2007, 28 (08) : 1434 - 1445
  • [49] Using a DSM application to locally align DNA sequences
    Batista, RB
    Silva, DN
    Magalhaes, AC
    de Melo, A
    Li, WG
    2004 IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID - CCGRID 2004, 2004, : 372 - 378
  • [50] Application of a DNA optimization algorithm for SIV/HIV sequences
    Gao, WT
    Rzewski, A
    Sun, HJ
    Robbins, PD
    Gambotto, A
    MOLECULAR THERAPY, 2003, 7 (05) : S442 - S442