Searching for smallest grammars on large sequences and application to DNA

被引:7
|
作者
Carrascosa, Rafael [1 ]
Coste, Francois [2 ]
Galle, Matthias [2 ]
Infante-Lopez, Gabriel [1 ,3 ]
机构
[1] Univ Nacl Cordoba, Grp Procesamiento Lenguaje Nat, Cordoba, Argentina
[2] IRISA INRIA Rennes Bretagne Atlantique, Symbiose Project, Rennes, France
[3] Consejo Nacl Invest Cient & Tecn, Buenos Aires, DF, Argentina
关键词
Linguistics of DNA; Smallest grammar problem; Structural inference; Maximal repeats;
D O I
10.1016/j.jda.2011.04.006
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Motivated by the inference of the structure of genomic sequences, we address here the smallest grammar problem. In previous work, we introduced a new perspective on this problem, splitting the task into two different optimization problems: choosing which words will be considered constituents of the final grammar and finding a minimal parsing with these constituents. Here we focus on making these ideas applicable on large sequences. First, we improve the complexity of existing algorithms by using the concept of maximal repeats when choosing which substrings will be the constituents of the grammar. Then, we improve the size of the grammars by cautiously adding a minimal parsing optimization step. Together, these approaches enable us to propose new practical algorithms that return smaller grammars (up to 10%) in approximately the same amount of time than their competitors on a classical set of genomic sequences and on whole genomes of model organisms. (C) 2011 Elsevier B.V. All rights reserved.
引用
收藏
页码:62 / 72
页数:11
相关论文
共 50 条
  • [31] Searching Exact Tandem Repeats in DNA Sequences Using Enhanced Suffix Array
    Gupta, Shivika
    Prasad, Rajesh
    CURRENT BIOINFORMATICS, 2018, 13 (02) : 216 - 222
  • [32] An efficient tool for searching maximal and super maximal repeats in large DNA/protein sequences via induced-enhanced suffix array
    Kumar S.
    Agarwal S.
    Ranvijay
    Recent Patents on Computer Science, 2019, 12 (02) : 128 - 134
  • [33] Statistical analysis of large DNA sequences using distribution of DNA words
    Chaudhuri, P
    Das, S
    CURRENT SCIENCE, 2001, 80 (09): : 1161 - 1166
  • [34] SWORDS: A statistical tool for analysing large DNA sequences
    Chaudhuri, P
    Das, S
    JOURNAL OF BIOSCIENCES, 2002, 27 (01) : 1 - 6
  • [35] ATYPICAL REGIONS IN LARGE GENOMIC DNA-SEQUENCES
    SCHERER, S
    MCPEEK, MS
    SPEED, TP
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1994, 91 (15) : 7134 - 7138
  • [36] Learning Relational Grammars from Sequences of Actions
    Vargas-Govea, Blanca
    Morales, Eduardo F.
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, PROCEEDINGS, 2009, 5856 : 892 - 900
  • [37] FINDING MINIMAL PASS SEQUENCES FOR ATTRIBUTE GRAMMARS
    ALBLAS, H
    SIAM JOURNAL ON COMPUTING, 1985, 14 (04) : 889 - 914
  • [38] An assessment of gene prediction accuracy in large DNA sequences
    Guigó, R
    Agarwal, P
    Abril, JF
    Burset, M
    Fickett, JW
    GENOME RESEARCH, 2000, 10 (10) : 1631 - 1642
  • [39] SWORDS: A statistical tool for analysing large DNA sequences
    Probal Chaudhuri
    Sandip Das
    Journal of Biosciences, 2002, 27 : 1 - 6
  • [40] DESCRIBING SEQUENCES IN BEHAVIOR BY MEANS OF FORMAL GRAMMARS
    RODGER, RS
    BEHAVIOR GENETICS, 1978, 8 (01) : 113 - 113