Extracting corpus-specific strings by using suffix arrays enhanced with longest common prefix

被引：1

作者：

机构：

[1] Yoshida, Minoru

[2] Matsumoto, Kazuyuki

[3] Xiao, Qingmei

[4] Keranmu, Xielifuguli

[5] Kita, Kenji

[6] Nakagawa, Hiroshi

来源：

Yoshida, Minoru | 1600年 / Springer Verlag卷 / 8870期

关键词：

Extraction - Indexing (of information);

D O I：

10.1007/978-3-319-12844-3_31

中图分类号：

学科分类号：

摘要：

We propose a new term extraction algorithm that considers all of the substrings as term candidates. Our algorithm uses a suffix array as the data structure that emulates the suffix tree of the corpus. We use two scoring functions, one of which is used to detect good substring boundaries as linguistic chunks and the other is to find domain-specific phrases and combine them with a re-ranking approach. Experiments show that the proposed all-substring term extraction algorithm shows good performance for highly-frequent terms compared with the baseline algorithm that uses a morphological analyzer in the preprocessing step. © Springer International Publishing Switzerland 2014.

引用

共 14 条

[1] Extracting Corpus-Specific Strings by Using Suffix Arrays Enhanced with Longest Common Prefix
Yoshida, Minoru
Matsumoto, Kazuyuki
Xiao, Qingmei
Keranmu, Xielifuguli
Kita, Kenji
Nakagawa, Hiroshi
INFORMATION RETRIEVAL TECHNOLOGY, AIRS 2014, 2014, 8870 : 360 - 370
[2] Parallel Distributed Memory Construction of Suffix and Longest Common Prefix Arrays
Flick, Patrick
Aluru, Srinivas
PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2015,
[3] Exact Tandem Repeats using Suffix Array and Longest Common Prefix
Bhukya, Raju
Naveen, I
Gupta, Rohan
Anurag, K.
Achyuth, A.
Taruni
HELIX, 2018, 8 (05): : 3686 - 3691
[4] Computing longest common substrings via suffix arrays
Babenko, Maxim A.
Starikovskaya, Tatiana A.
COMPUTER SCIENCE - THEORY AND APPLICATIONS, 2008, 5010 : 64 - 75
[5] Longest Common Prefix Arrays for Succinct k-Spectra
Alanko, Jarno N.
Biagi, Elena
Puglisi, Simon J.
STRING PROCESSING AND INFORMATION RETRIEVAL, SPIRE 2023, 2023, 14240 : 1 - 13
[6] A modification of the Landau-Vishkin algorithm computing longest common extensions via suffix arrays
Miranda, RD
Ayala-Rincón, M
ADVANCES IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, PROCEEDINGS, 2005, 3594 : 210 - 213
[7] Mapping biological entities using the longest approximately common prefix method
Alex Rudniy
Min Song
James Geller
BMC Bioinformatics, 15
[8] Mapping biological entities using the longest approximately common prefix method
Rudniy, Alex
Song, Min
Geller, James
BMC BIOINFORMATICS, 2014, 15
[9] ACCURATE LONG READ MAPPING USING ENHANCED SUFFIX ARRAYS
Vyverman, Michael
De Schrijver, Joachim
Van Criekinge, Wim
Dawyndt, Peter
Fack, Veerle
BIOINFORMATICS 2011, 2011, : 102 - 107
[10] Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus
Yamamoto, M
Church, KW
COMPUTATIONAL LINGUISTICS, 2001, 27 (01) : 1 - 30

← 1 2 →