Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data

被引:8
|
作者
Gilchrist, Michael J. [1 ]
Christensen, Mikkel B. [1 ]
Harland, Richard [2 ]
Pollet, Nicolas [3 ,4 ]
Smith, James C. [1 ]
Ueno, Naoto [5 ]
Papalopulu, Nancy [6 ]
机构
[1] Univ Cambridge, Wellcome Trust Canc Res UK Gurdon Inst, Cambridge CB2 1QN, England
[2] Univ Calif Berkeley, Dept Mol & Cell Biol, Berkeley, CA 94720 USA
[3] Univ Evry, CNRS, Epigenom Project, F-91034 Evry, France
[4] CNRS, UMR 8080, F-91405 Orsay, France
[5] Natl Inst Nat Sci, Natl Inst Basic Biol, Dept Dev Biol, Okazaki, Aichi 4448585, Japan
[6] Univ Manchester, Fac Life Sci, Manchester M13 9PT, Lancs, England
基金
英国惠康基金; 英国医学研究理事会;
关键词
D O I
10.1186/1471-2105-9-442
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Non-sequence gene data (images, literature, etc.) can be found in many different public databases. Access to these data is mostly by text based methods using gene names; however, gene annotation is neither complete, nor fully systematic between organisms, and is also not generally stable over time. This provides some challenges for text based access, especially for cross-species searches. We propose a method for non-sequence data retrieval based on sequence similarity, which removes dependence on annotation and text searches. This work was motivated by the need to provide better access to large numbers of in situ images, and the observation that such image data were usually associated with a specific gene sequence. Sequence similarity searches are found in existing gene oriented databases, but mostly give indirect access to non-sequence data via navigational links. Results: Three applications were built to explore the proposed method: accessing image data, literature and gene names. Searches are initiated with the sequence of the user's gene of interest, which is searched against a database of sequences associated with the target data. The matching (non-sequence) target data are returned directly to the user's browser, organised by sequence similarity. The method worked well for the intended application in image data management. Comparison with text based searches of the image data set showed the accuracy of the method. Applied to literature searches it facilitated retrieval of mostly high relevance references. Applied to gene name data it provided a useful analysis of name variation of related genes within and between species. Conclusion: This method makes a powerful and useful addition to existing methods for searching gene data based on text retrieval or curated gene lists. In particular the method facilitates cross-species comparisons, and enables the handling of novel or otherwise un-annotated genes. Applications using the method are quick and easy to build, and the data require little maintenance. This approach largely circumvents the need for annotation, which can be a major obstacle to the development of genomic scale data resources.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Influenza sequence validation and annotation using VADR
    Calhoun, Vincent C.
    Hatcher, Eneida L.
    Yankie, Linda
    Nawrocki, Eric P.
    DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2024, 2024
  • [42] Investigating Correlation between Protein Sequence Similarity and Semantic Similarity Using Gene Ontology Annotations
    Ikram, Najmul
    Qadir, Muhammad Abdul
    Afzal, Muhammad Tanvir
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2018, 15 (03) : 905 - 912
  • [43] Sequence elements, E2 (UUCAC)s, need separable non-sequence elements to direct RNA localization in Xenopus oocytes
    Kwon, S
    Schnapp, BJ
    MOLECULAR BIOLOGY OF THE CELL, 2002, 13 : 522A - 522A
  • [44] ADAPTIVE SEARCH FOR A DATA PROCESSING SEQUENCE
    NISNEVIC.LB
    EPSHTEIN, VL
    AUTOMATION AND REMOTE CONTROL, 1969, (05) : 768 - &
  • [45] Predicting gene dosage using genomic sequence data
    Barker, Jocelyn Elaine
    Sherlock, Gavin
    Hartman, James
    Morgan, William
    FASEB JOURNAL, 2008, 22
  • [46] Recommendation of Child Care Blogs Using Multi-dimensional Sequence Similarity Search
    Yamamoto, Megumi
    Huang, Hung-Hsuan
    Kawagoe, Kyoji
    2011 6TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCES AND CONVERGENCE INFORMATION TECHNOLOGY (ICCIT), 2012, : 266 - 271
  • [47] Approximate similarity search in genomic sequence databases using landmark-guided embedding
    Sacan, Ahmet
    Toroslu, I. Hakki
    SISAP 2008: FIRST INTERNATIONAL WORKSHOP ON SIMILARITY SEARCH AND APPLICATIONS, PROCEEDINGS, 2008, : 43 - +
  • [48] Using homology relations within a database markedly boosts protein sequence similarity search
    Tong, Jing
    Sadreyev, Ruslan I.
    Pei, Jimin
    Kinch, Lisa N.
    Grishin, Nick V.
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2015, 112 (22) : 7003 - 7008
  • [49] Approximate similarity search in genomic sequence databases using landmark-guided embedding
    Sacan, Ahmet
    Toroslu, I. Hakki
    2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOP, VOLS 1 AND 2, 2008, : 498 - +
  • [50] Data generation using sequence-to-sequence
    Joshi, Akshat
    Mehta, Kinal
    Gupta, Neha
    Valloli, Varun Kannadi
    2018 IEEE RECENT ADVANCES IN INTELLIGENT COMPUTATIONAL SYSTEMS (RAICS), 2018, : 108 - 112