A Versatile Hypergraph Model for Document Collections

被引：1

作者：

Spitz, Andreas ^{[1
]}

Aumiller, Dennis ^{[2
]}

Soproni, Balint ^{[2
]}

Gertz, Michael ^{[2
]}

机构：

[1] Ecole Polytech Fed Lausanne, Lausanne, Switzerland

[2] Heidelberg Univ, Heidelberg, Germany

来源：

PROCEEDINGS OF THE 32TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, SSDBM 2020 | 2020年

关键词：

COOCCURRENCE DATA; CENTRALITY;

D O I：

10.1145/3400903.3400919

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Efficiently and effectively representing large collections of text is of central importance to information retrieval tasks such as summarization and search. Since models for these tasks frequently rely on an implicit graph structure of the documents or their contents, graph-based document representations are naturally appealing. For tasks that consider the joint occurrence of words or entities, however, existing document representations often fall short in capturing cooccurrences of higher order, higher multiplicity, or at varying proximity levels. Furthermore, while numerous applications benefit from structured knowledge sources, external data sources are rarely considered as integral parts of existing document models. To address these shortcomings, we introduce heterogeneous hypergraphs as a versatile model for representing annotated document collections. We integrate external metadata, document content, entity and term annotations, and document segmentation at different granularity levels in a joint model that bridges the gap between structured and unstructured data. We discuss selection and transformation operations on the set of hyperedges, which can be chained to support a wide range of query scenarios. To ensure compatibility with established information retrieval methods, we discuss projection operations that transform hyperedges to traditional dyadic cooccurrence graph representations. Using PostgreSQL and Neo4j, we investigate the suitability of existing database systems for implementing the hypergraph document model, and explore the impact of utilizing implicit and materialized hyperedge representations on storage space requirements and query performance.

引用

页数：12

共 50 条

[21] Document computing: technologies for managing electronic document collections.
Ashford, J
JOURNAL OF DOCUMENTATION, 2000, 56 (01) : 95 - 97
[22] Hypergraph based Understanding for Document Semantic Entity Recognition
Li, Qiwei
Li, Zuchao
Wang, Ping
Ai, Haojun
Zhao, Hai
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 2950 - 2960
[23] Asking questions on handwritten document collections
Minesh Mathew
Lluis Gomez
Dimosthenis Karatzas
C. V. Jawahar
International Journal on Document Analysis and Recognition (IJDAR), 2021, 24 : 235 - 249
[24] Collections of facts. Document and discussions
de Pury, Jean
ARCHIVES DE PSYCHOLOGIE, 1902, 2 (05) : 58 - 60
[25] Document retrieval on repetitive string collections
Gagie, Travis
Hartikainen, Aleksi
Karhu, Kalle
Karkkainen, Juha
Navarro, Gonzalo
Puglisi, Simon J.
Siren, Jouni
INFORMATION RETRIEVAL JOURNAL, 2017, 20 (03): : 253 - 291
[26] Measurement of clustering effectiveness for document collections
Yuan, Meng
Zobel, Justin
Lin, Pauline
INFORMATION RETRIEVAL JOURNAL, 2022, 25 (03): : 239 - 268
[27] Pattern based browsing in document collections
Feldman, R
Klosgen, W
Ben-Yehuda, Y
Kedar, G
Reznikov, V
PRINCIPLES OF DATA MINING AND KNOWLEDGE DISCOVERY, 1997, 1263 : 112 - 122
[28] Asking questions on handwritten document collections
Mathew, Minesh
Gomez, Lluis
Karatzas, Dimosthenis
Jawahar, C., V
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021, 24 (03) : 235 - 249
[29] Structured Search in Annotated Document Collections
Gupta, Dhruv
Berberich, Klaus
PROCEEDINGS OF THE TWELFTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'19), 2019, : 794 - 797
[30] Efficient search in document image collections
Kumar, Anand
Jawahar, C. V.
Manmatha, R.
COMPUTER VISION - ACCV 2007, PT I, PROCEEDINGS, 2007, 4843 : 586 - +

← 1 2 3 4 5 →