Extracting corpus-specific strings by using suffix arrays enhanced with longest common prefix

被引:1
|
作者
机构
[1] Yoshida, Minoru
[2] Matsumoto, Kazuyuki
[3] Xiao, Qingmei
[4] Keranmu, Xielifuguli
[5] Kita, Kenji
[6] Nakagawa, Hiroshi
来源
Yoshida, Minoru | 1600年 / Springer Verlag卷 / 8870期
关键词
Extraction - Indexing (of information);
D O I
10.1007/978-3-319-12844-3_31
中图分类号
学科分类号
摘要
We propose a new term extraction algorithm that considers all of the substrings as term candidates. Our algorithm uses a suffix array as the data structure that emulates the suffix tree of the corpus. We use two scoring functions, one of which is used to detect good substring boundaries as linguistic chunks and the other is to find domain-specific phrases and combine them with a re-ranking approach. Experiments show that the proposed all-substring term extraction algorithm shows good performance for highly-frequent terms compared with the baseline algorithm that uses a morphological analyzer in the preprocessing step. © Springer International Publishing Switzerland 2014.
引用
收藏
相关论文
共 14 条
  • [1] Extracting Corpus-Specific Strings by Using Suffix Arrays Enhanced with Longest Common Prefix
    Yoshida, Minoru
    Matsumoto, Kazuyuki
    Xiao, Qingmei
    Keranmu, Xielifuguli
    Kita, Kenji
    Nakagawa, Hiroshi
    INFORMATION RETRIEVAL TECHNOLOGY, AIRS 2014, 2014, 8870 : 360 - 370
  • [2] Parallel Distributed Memory Construction of Suffix and Longest Common Prefix Arrays
    Flick, Patrick
    Aluru, Srinivas
    PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2015,
  • [3] Exact Tandem Repeats using Suffix Array and Longest Common Prefix
    Bhukya, Raju
    Naveen, I
    Gupta, Rohan
    Anurag, K.
    Achyuth, A.
    Taruni
    HELIX, 2018, 8 (05): : 3686 - 3691
  • [4] Computing longest common substrings via suffix arrays
    Babenko, Maxim A.
    Starikovskaya, Tatiana A.
    COMPUTER SCIENCE - THEORY AND APPLICATIONS, 2008, 5010 : 64 - 75
  • [5] Longest Common Prefix Arrays for Succinct k-Spectra
    Alanko, Jarno N.
    Biagi, Elena
    Puglisi, Simon J.
    STRING PROCESSING AND INFORMATION RETRIEVAL, SPIRE 2023, 2023, 14240 : 1 - 13
  • [6] A modification of the Landau-Vishkin algorithm computing longest common extensions via suffix arrays
    Miranda, RD
    Ayala-Rincón, M
    ADVANCES IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, PROCEEDINGS, 2005, 3594 : 210 - 213
  • [7] Mapping biological entities using the longest approximately common prefix method
    Alex Rudniy
    Min Song
    James Geller
    BMC Bioinformatics, 15
  • [8] Mapping biological entities using the longest approximately common prefix method
    Rudniy, Alex
    Song, Min
    Geller, James
    BMC BIOINFORMATICS, 2014, 15
  • [9] ACCURATE LONG READ MAPPING USING ENHANCED SUFFIX ARRAYS
    Vyverman, Michael
    De Schrijver, Joachim
    Van Criekinge, Wim
    Dawyndt, Peter
    Fack, Veerle
    BIOINFORMATICS 2011, 2011, : 102 - 107
  • [10] Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus
    Yamamoto, M
    Church, KW
    COMPUTATIONAL LINGUISTICS, 2001, 27 (01) : 1 - 30