Using an Advanced Text Index Structure for Corpus Exploration in Digital Humanities

被引:0
|
作者
Englmeier, Tobias [1 ]
Buechler, Marco [2 ]
Gerdjikov, Stefan [3 ]
Schulz, Klaus U. [1 ]
机构
[1] Ludwig Maximilians Univ Munchen, CIS, Munich, Germany
[2] Univ Gottingen, Inst Comp Sci, Gottingen, Germany
[3] Univ Sofia St Kliment Ohridski, FMI, Sofia, Bulgaria
来源
DIGITAL HUMANITIES QUARTERLY | 2021年 / 15卷 / 01期
关键词
ONLINE CONSTRUCTION;
D O I
暂无
中图分类号
C [社会科学总论];
学科分类号
03 ; 0303 ;
摘要
With suitable index structures many corpus exploration tasks can be solved in an efficient way without rescanning the text repository in an online manner. In this paper we show that symmetric compacted directed acyclic word graphs (SCDAWGs) - a refinement of suffix trees - offer an ideal basis for corpus exploration, helping to answer many of the questions raised in DH research in an elegant way. From a simplified point of view, the advantages of SCDAWGs rely on two properties. First, needing linear computation time, the index offers a joint view on the similarities (in terms of common substrings) and differences between all text. Second, structural regularities of the index help to mine interesting portions of texts (such as phrases and concept names) and their relationship in a language independent way without using prior linguistic knowledge. As a demonstration of the power of these principles we look at text alignment, text reuse in distinct texts or between distinct authors, automated detection of concepts, temporal distribution of phrases in diachronic corpora, and related problems.
引用
收藏
页数:21
相关论文
共 50 条
  • [21] Digital Approaches to Text Reuse in the Early Chinese Corpus
    Sturgeon, Donald
    JOURNAL OF CHINESE LITERATURE AND CULTURE, 2018, 5 (02) : 186 - 213
  • [22] DIGITAL HUMANITIES, CORPUS AND LANGUAGE TECHNOLOGIES: A LOOK FROM DIVERSE CASE STUDIES
    Alcaniz, Roque Fernandez
    SIGNA-REVISTA DE LA ASOCIACION ESPANOLA DE SEMIOTICA, 2025, 34 : 649 - 651
  • [23] Historical corpora meet the digital humanities: the Jerusalem Corpus of Emergent Modern Hebrew
    Aynat Rubinstein
    Language Resources and Evaluation, 2019, 53 : 807 - 835
  • [24] The crystal goblet revisited: approaching text as documentation within the digital humanities
    Kosciejew, Marc
    DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2021, 36 (04) : 934 - 949
  • [25] Historical corpora meet the digital humanities: the Jerusalem Corpus of Emergent Modern Hebrew
    Rubinstein, Aynat
    LANGUAGE RESOURCES AND EVALUATION, 2019, 53 (04) : 807 - 835
  • [26] Aligning Text and Document Illustrations: towards Visually Explainable Digital Humanities
    Baraldi, Lorenzo
    Cornia, Marcella
    Grana, Costantino
    Cucchiara, Rita
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 1097 - 1102
  • [27] Text Segmentation Algorithm Focused on Corpus Mining for Oilfield Exploration and Development
    Gong, Xuchao
    Zhang, Miao
    Wang, Zhen
    Liu, He
    Duan, Hongjie
    Yang, Yaozhong
    2024 9TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS, ICCCS 2024, 2024, : 1375 - 1379
  • [28] Collaborative Perspectives on Translation and the Digital Humanities in the Advanced French Classroom
    Antonioli, Kathleen
    Cro, Melinda A.
    FRENCH REVIEW, 2018, 91 (04): : 130 - 145
  • [29] Methods and Advanced Tools for the Analysis of Film Colors in Digital Humanities
    Flueckiger, Barbara
    Halter, Gaudenz
    DIGITAL HUMANITIES QUARTERLY, 2020, 14 (04):
  • [30] HOW CAN GEOGRAPHIC INFORMATION IN TEXT DOCUMENTS BE VISUALIZED TO SUPPORT INFORMATION EXPLORATION IN THE HUMANITIES?
    Bruggmann, Andre
    Fabrikant, Sara, I
    Purves, Ross S.
    INTERNATIONAL JOURNAL OF HUMANITIES AND ARTS COMPUTING-A JOURNAL OF DIGITAL HUMANITIES, 2020, 14 (1-2): : 98 - 118