Searching for smallest grammars on large sequences and application to DNA

被引:7
|
作者
Carrascosa, Rafael [1 ]
Coste, Francois [2 ]
Galle, Matthias [2 ]
Infante-Lopez, Gabriel [1 ,3 ]
机构
[1] Univ Nacl Cordoba, Grp Procesamiento Lenguaje Nat, Cordoba, Argentina
[2] IRISA INRIA Rennes Bretagne Atlantique, Symbiose Project, Rennes, France
[3] Consejo Nacl Invest Cient & Tecn, Buenos Aires, DF, Argentina
关键词
Linguistics of DNA; Smallest grammar problem; Structural inference; Maximal repeats;
D O I
10.1016/j.jda.2011.04.006
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Motivated by the inference of the structure of genomic sequences, we address here the smallest grammar problem. In previous work, we introduced a new perspective on this problem, splitting the task into two different optimization problems: choosing which words will be considered constituents of the final grammar and finding a minimal parsing with these constituents. Here we focus on making these ideas applicable on large sequences. First, we improve the complexity of existing algorithms by using the concept of maximal repeats when choosing which substrings will be the constituents of the grammar. Then, we improve the size of the grammars by cautiously adding a minimal parsing optimization step. Together, these approaches enable us to propose new practical algorithms that return smaller grammars (up to 10%) in approximately the same amount of time than their competitors on a classical set of genomic sequences and on whole genomes of model organisms. (C) 2011 Elsevier B.V. All rights reserved.
引用
收藏
页码:62 / 72
页数:11
相关论文
共 50 条
  • [21] Algorithm for searching for highly divergent tandem repeats in DNA sequences, statistical tests, and biological application in Drosophila melanogaster genome
    Boeva, V. A.
    Regnier, M.
    Makeev, V. J.
    Proceedings of the Fourth International Conference on Bioinformatics of Genome Regulation and Structure, Vol 1, 2004, : 34 - 37
  • [22] The Smallest Nontrivial Solution to and Related Sequences
    Andrica, Dorin
    Crisan, Vlad
    AMERICAN MATHEMATICAL MONTHLY, 2019, 126 (02): : 173 - 178
  • [23] Extracting grammars from RNA sequences
    Andrejkova, Gabriela
    Lengenova, Helena
    Mati, Michal
    ADAPTIVE AND NATURAL COMPUTING ALGORITHMS, PT 1, 2007, 4431 : 404 - +
  • [24] Categorial Dependency Grammars with Iterated Sequences
    Bechet, Denis
    Foret, Annie
    LOGICAL ASPECTS OF COMPUTATIONAL LINGUISTICS: CELEBRATING 20 YEARS OF LACL (1996-2016), 2016, 10054 : 34 - 51
  • [25] SEARCHING CIRCULAR SEQUENCES
    WEBER, RJ
    CROSS, M
    CARLTON, M
    JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 1968, 78 (4P1): : 588 - &
  • [26] A likelihood ratio approach to familial searching of large DNA databases
    Cowen, Simon
    Thomson, Jim
    FORENSIC SCIENCE INTERNATIONAL GENETICS SUPPLEMENT SERIES, 2008, 1 (01) : 643 - 645
  • [27] A robust method for searching the smallest set of smallest rings with a path-included distance matrix
    Lee, Chang Joon
    Kang, Young-Mook
    Cho, Kwang-Hwi
    No, Kyoung Tai
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2009, 106 (41) : 17355 - 17358
  • [28] Searching gapped palindromes in DNA sequences using Burrows Wheeler type transformation
    Gupta, Shivika
    Prasad, Rajesh
    JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES, 2016, 37 (01): : 51 - 74
  • [29] Fishing in silico:: searching for tilapia genes using sequences of microsatellite DNA markers
    Cnaani, A
    Ron, M
    Hulata, G
    Seroussi, E
    ANIMAL GENETICS, 2002, 33 (06) : 474 - 476
  • [30] Searching for target sequences by p53 protein is influenced by DNA length
    Brázda, V
    Jagelská, EB
    Fojta, M
    Palecek, E
    BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS, 2006, 341 (02) : 470 - 477