Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning

被引:12
|
作者
Fan, Grace [1 ]
Wang, Jin [2 ]
Li, Yuliang [2 ]
Zhang, Dan [2 ]
Miller, Renee J. [1 ]
机构
[1] Northeastern Univ, Boston, MA 02115 USA
[2] Megagon Labs, Mountain View, CA USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2023年 / 16卷 / 07期
关键词
WEB TABLES; SEARCH;
D O I
10.14778/3587136.3587146
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical results on real table benchmarks show that Starmie outperforms the best-known solutions in the effectiveness of table union search by 6.8 in MAP and recall. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index to accelerate query processing of table union search which provides a 3,000X performance gain over the linear scan baseline and a 400X performance gain over an LSH index (the state-of-the-art solution for data lake indexing).
引用
收藏
页码:1726 / 1739
页数:14
相关论文
共 43 条
  • [1] Fine-grained Semantics-aware Representation Learning for Text-based Person Retrieval
    Wang, Di
    Yan, Feng
    Wang, Yifeng
    Zhao, Lin
    Liang, Xiao
    Zhong, Haodi
    Zhang, Ronghua
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 92 - 100
  • [2] Learning Semantics-Aware Locomotion Skills from Human Demonstration
    Yang, Yuxiang
    Meng, Xiangyun
    Yu, Wenhao
    Zhang, Tingnan
    Tan, Jie
    Boots, Byron
    CONFERENCE ON ROBOT LEARNING, VOL 205, 2022, 205 : 2205 - 2214
  • [3] An API Semantics-Aware Malware Detection Method Based on Deep Learning
    Ma, Xin
    Guo, Shize
    Bai, Wei
    Chen, Jun
    Xia, Shiming
    Pan, Zhisong
    SECURITY AND COMMUNICATION NETWORKS, 2019, 2019
  • [4] A Preliminary Investigation of Reversing RML: From an RDF dataset to its Column-Based data source
    Allocca, Carlo
    Gougousis, Alexandros
    BIODIVERSITY DATA JOURNAL, 2015, 3
  • [5] Column-based Decoder of Internal Prediction Representation in Cortical Learning Algorithms
    Aoki, Takeru
    Takadama, Keiki
    Sato, Hiroyuki
    2020 JOINT 11TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND INTELLIGENT SYSTEMS AND 21ST INTERNATIONAL SYMPOSIUM ON ADVANCED INTELLIGENT SYSTEMS (SCIS-ISIS), 2020, : 385 - 391
  • [6] Tuning Personalized PageRank for Semantics-Aware Recommendations Based on Linked Open Data
    Musto, Cataldo
    Semeraro, Giovanni
    de Gemmis, Marco
    Lops, Pasquale
    SEMANTIC WEB ( ESWC 2017), PT I, 2017, 10249 : 169 - 183
  • [7] Keyword-Based Diverse Image Retrieval by Semantics-aware Contrastive Learning and Transformer
    Zhao, Minyi
    Wang, Jinpeng
    Liao, Dongliang
    Wang, Yiru
    Duan, Huanzhong
    Zhou, Shuigeng
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1262 - 1272
  • [8] Using Ontology-Based Data Summarization to Develop Semantics-Aware Recommender Systems
    Di Noia, Tommaso
    Magarelli, Corrado
    Maurino, Andrea
    Palmonari, Matteo
    Rula, Anisa
    SEMANTIC WEB (ESWC 2018), 2018, 10843 : 128 - 144
  • [9] Web-Scale Normalization of Geospatial Metadata Based on Semantics-Aware Data Sources
    Fugazza, Cristiano
    Tagliolato, Paolo
    Frigerio, Luca
    Carrara, Paola
    ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2017, 6 (11)
  • [10] Semantics-aware Recommender Systems exploiting Linked Open Data and graph-based features
    Musto, Cataldo
    Lops, Pasquale
    de Gemmis, Marco
    Semeraro, Giovanni
    KNOWLEDGE-BASED SYSTEMS, 2017, 136 : 1 - 14