A Versatile Hypergraph Model for Document Collections

被引:1
|
作者
Spitz, Andreas [1 ]
Aumiller, Dennis [2 ]
Soproni, Balint [2 ]
Gertz, Michael [2 ]
机构
[1] Ecole Polytech Fed Lausanne, Lausanne, Switzerland
[2] Heidelberg Univ, Heidelberg, Germany
关键词
COOCCURRENCE DATA; CENTRALITY;
D O I
10.1145/3400903.3400919
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Efficiently and effectively representing large collections of text is of central importance to information retrieval tasks such as summarization and search. Since models for these tasks frequently rely on an implicit graph structure of the documents or their contents, graph-based document representations are naturally appealing. For tasks that consider the joint occurrence of words or entities, however, existing document representations often fall short in capturing cooccurrences of higher order, higher multiplicity, or at varying proximity levels. Furthermore, while numerous applications benefit from structured knowledge sources, external data sources are rarely considered as integral parts of existing document models. To address these shortcomings, we introduce heterogeneous hypergraphs as a versatile model for representing annotated document collections. We integrate external metadata, document content, entity and term annotations, and document segmentation at different granularity levels in a joint model that bridges the gap between structured and unstructured data. We discuss selection and transformation operations on the set of hyperedges, which can be chained to support a wide range of query scenarios. To ensure compatibility with established information retrieval methods, we discuss projection operations that transform hyperedges to traditional dyadic cooccurrence graph representations. Using PostgreSQL and Neo4j, we investigate the suitability of existing database systems for implementing the hypergraph document model, and explore the impact of utilizing implicit and materialized hyperedge representations on storage space requirements and query performance.
引用
收藏
页数:12
相关论文
共 50 条
  • [11] Document Retrieval on Repetitive Collections
    Navarro, Gonzalo
    Puglisi, Simon J.
    Siren, Jouni
    ALGORITHMS - ESA 2014, 2014, 8737 : 725 - 736
  • [12] Searching Corrupted Document Collections
    Soo, Jason
    Frieder, Ophir
    PROCEEDINGS OF 12TH IAPR WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS, (DAS 2016), 2016, : 440 - 445
  • [13] Document Listing on Repetitive Collections
    Gagie, Travis
    Karhu, Kalle
    Navarro, Gonzalo
    Puglisi, Simon J.
    Siren, Jouni
    COMBINATORIAL PATTERN MATCHING, 2013, 7922 : 107 - 119
  • [14] Semantic Wordification of Document Collections
    Paulovich, Fernando V.
    Toledo, Franklina M. B.
    Telles, Guilherme P.
    Minghim, Rosane
    Nonato, Luis Gustavo
    COMPUTER GRAPHICS FORUM, 2012, 31 (03) : 1145 - 1153
  • [15] DOCUMENT COLLECTIONS OF THE LIBRARY OF CONGRESS
    Falkner, Roland P.
    LIBRARY JOURNAL, 1901, 26 (12) : 870 - 871
  • [16] Metrics for XML document collections
    Klettke, M
    Schneider, L
    Heuer, A
    XML-BASED DATA MANAGEMENT AND MULTIMEDIA ENGINEERING-EDBT 2002 WORKSHOPS, 2002, 2490 : 15 - 28
  • [17] Parallel information retrieval scalability using the relational model on large document collections
    Alford, K
    Chen, JX
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-V, 2000, : 1705 - 1711
  • [18] DEVA - An extensible ontology-based annotation model for visual document collections
    Jelmini, C
    Marchand-Maillet, S
    INTERNET IMAGING IV, 2003, 5018 : 131 - 138
  • [19] Dynamic and Static Topic Model for Analyzing Time-Series Document Collections
    Hida, Rem
    Takeishi, Naoya
    Yairi, Takehisa
    Hori, Koichi
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, 2018, : 516 - 520
  • [20] A rough set model with ontologies for discovering maximal association rules in document collections
    Bi, YX
    Anderson, T
    McClean, S
    KNOWLEDGE-BASED SYSTEMS, 2003, 16 (5-6) : 243 - 251