A Versatile Hypergraph Model for Document Collections

被引:1
|
作者
Spitz, Andreas [1 ]
Aumiller, Dennis [2 ]
Soproni, Balint [2 ]
Gertz, Michael [2 ]
机构
[1] Ecole Polytech Fed Lausanne, Lausanne, Switzerland
[2] Heidelberg Univ, Heidelberg, Germany
关键词
COOCCURRENCE DATA; CENTRALITY;
D O I
10.1145/3400903.3400919
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Efficiently and effectively representing large collections of text is of central importance to information retrieval tasks such as summarization and search. Since models for these tasks frequently rely on an implicit graph structure of the documents or their contents, graph-based document representations are naturally appealing. For tasks that consider the joint occurrence of words or entities, however, existing document representations often fall short in capturing cooccurrences of higher order, higher multiplicity, or at varying proximity levels. Furthermore, while numerous applications benefit from structured knowledge sources, external data sources are rarely considered as integral parts of existing document models. To address these shortcomings, we introduce heterogeneous hypergraphs as a versatile model for representing annotated document collections. We integrate external metadata, document content, entity and term annotations, and document segmentation at different granularity levels in a joint model that bridges the gap between structured and unstructured data. We discuss selection and transformation operations on the set of hyperedges, which can be chained to support a wide range of query scenarios. To ensure compatibility with established information retrieval methods, we discuss projection operations that transform hyperedges to traditional dyadic cooccurrence graph representations. Using PostgreSQL and Neo4j, we investigate the suitability of existing database systems for implementing the hypergraph document model, and explore the impact of utilizing implicit and materialized hyperedge representations on storage space requirements and query performance.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] Measuring group cohesion in document collections
    Renoust, Benjamin
    Melancon, Guy
    Viaud, Marie-Luce
    2013 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 1, 2013, : 373 - 380
  • [32] Document Expansion Using External Collections
    Sherman, Garrick
    Efron, Miles
    SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 1045 - 1048
  • [33] User adaptive categorization of document collections
    Nürnberger, A
    ADAPTIVE MULTIMEDIA RETRIEVAL, 2004, 3094 : 87 - 98
  • [34] Document retrieval on repetitive string collections
    Travis Gagie
    Aleksi Hartikainen
    Kalle Karhu
    Juha Kärkkäinen
    Gonzalo Navarro
    Simon J. Puglisi
    Jouni Sirén
    Information Retrieval Journal, 2017, 20 : 253 - 291
  • [35] Retrieval from document image collections
    Balasubramanian, A
    Meshesha, M
    Jawahar, C
    DOCUMENT ANALYSIS SYSTEMS VII, PROCEEDINGS, 2006, 3872 : 1 - 12
  • [36] Measurement of clustering effectiveness for document collections
    Meng Yuan
    Justin Zobel
    Pauline Lin
    Information Retrieval Journal, 2022, 25 : 239 - 268
  • [37] Facilitating Understanding of Large Document Collections
    Bae, Jae Hyeon
    Xu, Weijia
    Esteva, Maria
    11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 1334 - 1338
  • [38] Fast categorisation of large document collections
    Shanks, V
    Williams, HE
    EIGHTH SYMPOSIUM ON STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2001, : 194 - 204
  • [39] Organization of document collections and services.
    Wood, RJ
    LIBRARY COLLECTIONS ACQUISITIONS & TECHNICAL SERVICES, 1999, 23 (03): : 389 - 390
  • [40] A rough set model with ontological information for discovering maximal association rules in document collections
    Bi, YX
    Anderson, T
    McClean, S
    RESEARCH AND DEVELOPMENT IN INTELLIGENT SYSTEM XIX, 2003, : 19 - 32