Nested Dolls: Towards Unsupervised Clustering of Web Tables

被引:0
|
作者
Khan, Rituparna [1 ]
Gubanov, Michael [1 ]
机构
[1] Florida State Univ, Dept Comp Sci, Tallahassee, FL 32306 USA
基金
美国国家科学基金会;
关键词
Web-search; Large-scale Data Management; Big Data; Data Fusion; Data Integration; Data Cleaning; Summarization; Human-Computer Interaction; Machine Learning; Natural Language Processing (NLP);
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Here we discuss our initial efforts towards unsupervised clustering of a large-scale Web tables dataset. We improve our previous approach of weakly-supervised clustering, where an operator would provide a few descriptive keywords to generate an entity-identifying classifier, which is applied to the corpora to form a cohesive entity-centric cluster [1]. Here, we make a next step towards fully unsupervised algorithm by automatically generating these descriptive keywords. These keywords then can be used to generate high-precision training data and train a classifier to form a cluster. Here, we describe and evaluate this new unsupervised keyword generation algorithm and apply it to a large-scale Web tables corpus to form initial small high-precision clusters.
引用
收藏
页码:5357 / 5359
页数:3
相关论文
共 50 条
  • [1] Unsupervised nested Dirichlet finite mixture model for clustering
    Fares Alkhawaja
    Nizar Bouguila
    Applied Intelligence, 2023, 53 : 25232 - 25258
  • [2] Assessing Search and Unsupervised Clustering Algorithms in Nested Sampling
    Maillard, Lune
    Finocchi, Fabio
    Trassinelli, Martino
    ENTROPY, 2023, 25 (02)
  • [3] Unsupervised nested Dirichlet finite mixture model for clustering
    Alkhawaja, Fares
    Bouguila, Nizar
    APPLIED INTELLIGENCE, 2023, 53 (21) : 25232 - 25258
  • [4] Clustering Web Documents with Tables for Information Extraction
    Shchekotykhin, Kostyantyn
    Jannach, Dietmar
    Friedrich, Gerhard
    K-CAP'07: PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON KNOWLEDGE CAPTURE, 2007, : 169 - 170
  • [5] Towards unsupervised online word clustering
    Brandl, Holger
    Joublin, Frank
    Goerick, Christian
    2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 5073 - +
  • [6] Unsupervised clustering for nontextual web document classification
    Chan, SWK
    Chong, MWC
    DECISION SUPPORT SYSTEMS, 2004, 37 (03) : 377 - 396
  • [7] Clustering header categories extracted from web tables
    Nagy, George
    Embley, David W.
    Krishnamoorthy, Mukkai
    Seth, Sharad
    DOCUMENT RECOGNITION AND RETRIEVAL XXII, 2015, 9402
  • [8] Towards dependencies on the web: The nested attribute approach
    Link, S
    7TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL IX, PROCEEDINGS: COMPUTER SCIENCE AND ENGINEERING: II, 2003, : 206 - 211
  • [9] Towards a Hybrid Imputation Approach Using Web Tables
    Ahmadov, Ahmad
    Thiele, Maik
    Eberius, Julian
    Lehner, Wolfgang
    Wrembel, Robert
    2015 IEEE/ACM 2ND INTERNATIONAL SYMPOSIUM ON BIG DATA COMPUTING (BDC), 2015, : 21 - 30
  • [10] Representing and querying semistructured web data using nested tables with structural variants
    da Silva, AS
    Filha, IMRE
    Laender, AHF
    Embley, DW
    CONCEPTUAL MODELING - ER 2002, 2002, 2503 : 135 - 151