Using an Advanced Text Index Structure for Corpus Exploration in Digital Humanities

被引:0
|
作者
Englmeier, Tobias [1 ]
Buechler, Marco [2 ]
Gerdjikov, Stefan [3 ]
Schulz, Klaus U. [1 ]
机构
[1] Ludwig Maximilians Univ Munchen, CIS, Munich, Germany
[2] Univ Gottingen, Inst Comp Sci, Gottingen, Germany
[3] Univ Sofia St Kliment Ohridski, FMI, Sofia, Bulgaria
来源
DIGITAL HUMANITIES QUARTERLY | 2021年 / 15卷 / 01期
关键词
ONLINE CONSTRUCTION;
D O I
暂无
中图分类号
C [社会科学总论];
学科分类号
03 ; 0303 ;
摘要
With suitable index structures many corpus exploration tasks can be solved in an efficient way without rescanning the text repository in an online manner. In this paper we show that symmetric compacted directed acyclic word graphs (SCDAWGs) - a refinement of suffix trees - offer an ideal basis for corpus exploration, helping to answer many of the questions raised in DH research in an elegant way. From a simplified point of view, the advantages of SCDAWGs rely on two properties. First, needing linear computation time, the index offers a joint view on the similarities (in terms of common substrings) and differences between all text. Second, structural regularities of the index help to mine interesting portions of texts (such as phrases and concept names) and their relationship in a language independent way without using prior linguistic knowledge. As a demonstration of the power of these principles we look at text alignment, text reuse in distinct texts or between distinct authors, automated detection of concepts, temporal distribution of phrases in diachronic corpora, and related problems.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] The Digital Text and the Choice of Digital Humanities
    Priani, Ernesto
    PALABRA CLAVE, 2015, 18 (04) : 1215 - 1233
  • [2] On the term 'text' in digital humanities
    Caton, Paul
    LITERARY AND LINGUISTIC COMPUTING, 2013, 28 (02): : 209 - 220
  • [3] Advanced corpus solutions for humanities researchers
    Wilson, James
    Hartley, Anthony
    Sharoff, Serge
    Stephenson, Paul
    PROCEEDINGS OF THE 24TH PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION, 2010, : 769 - 778
  • [4] Advanced corpus solutions for humanities researchers
    Wilson, James
    Hartley, Anthony
    Sharoff, Serge
    Stephenson, Paul
    PACLIC 24 - Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, 2010, : 769 - 778
  • [5] Visual Text Analysis in Digital Humanities
    Jaenicke, S.
    Franzini, G.
    Cheema, M. F.
    Scheuermann, G.
    COMPUTER GRAPHICS FORUM, 2017, 36 (06) : 226 - 250
  • [6] Beyond lexical frequencies: using R for text analysis in the digital humanities
    Taylor Arnold
    Nicolas Ballier
    Paula Lissón
    Lauren Tilton
    Language Resources and Evaluation, 2019, 53 : 707 - 733
  • [7] Beyond lexical frequencies: using R for text analysis in the digital humanities
    Arnold, Taylor
    Ballier, Nicolas
    Lisson, Paula
    Tilton, Lauren
    LANGUAGE RESOURCES AND EVALUATION, 2019, 53 (04) : 707 - 733
  • [8] Covers and Corpus wanted! Some Digital Humanities Fragments
    Clivaz, Claire
    DIGITAL HUMANITIES QUARTERLY, 2016, 10 (03):
  • [9] Digital Editions of Text: Surveying User Requirements in the Digital Humanities
    Franzini, Greta
    Terras, Melissa
    Mahony, Simon
    ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE, 2019, 12 (01):
  • [10] Text analysis using deep neural networks in digital humanities and information science
    Suissa, Omri
    Elmalech, Avshalom
    Zhitomirsky-Geffet, Maayan
    JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2022, 73 (02) : 268 - 287