Characteristic sets of strings common to semi-structured documents

被引:0
|
作者
Ikeda, D [1 ]
机构
[1] Kyushu Univ, Ctr Comp, Fukuoka 8128581, Japan
来源
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We consider Maximum Agreement Problem which is, given positive and negative documents, to find a characteristic set that matches many of positive documents but rejects many of negative ones. A characteristic set is a sequence (x(1),...,x(d)) of strings such that each x(i) is a suffix of x(i+1) and all x(i)'s appear in a document without overlaps. A characteristic set matches semi-structured documents with primitives or user's defined macros. For example, ("set", "characteristic set","<title> chararteristic set") is a characteristic set extracted from an HTML file. But, an algorithm that solves Maximum Agreement Problem does not output useless characteristic sets, such as those made of only tags of HTML, since such characteristic sets may match most of positive documents but also match most of negative ones. We present an algorithm that, given an integer d which is the number of strings in a characteristic set, solves Maximum Agreement Problem in O(n(2)h(d)) time, where n is the total length of documents and h is the height of the suffix tree of the documents.
引用
收藏
页码:139 / 147
页数:9
相关论文
共 50 条
  • [31] Mining Entities and their Values from Semi-Structured Documents in Business Process Outsourcing
    Guggilla, Chinnappa
    Pandey, Ankit G.
    Kummamuru, Krishna
    Shivaram, Madhura
    PROCEEDINGS OF THE ACM INDIA JOINT INTERNATIONAL CONFERENCE ON DATA SCIENCE AND MANAGEMENT OF DATA (CODS-COMAD'18), 2018, : 283 - 288
  • [32] An Automatic Ontology Population with a Machine Learning Technique from Semi-Structured Documents
    Song, Hyun-Je
    Park, Seong-Bae
    Park, Se-Young
    ICIA: 2009 INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, VOLS 1-3, 2009, : 519 - 524
  • [33] Contrastive Training Improves Zero-Shot Classification of Semi-structured Documents
    Khalifa, Muhammad
    Vyas, Yogarshi
    Wang, Shuai
    Horwood, Graham
    Mallya, Sunil
    Ballesteros, Miguel
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 7499 - 7508
  • [34] Consideration of the Word's Neighborhood in GATs for Information Extraction in Semi-structured Documents
    Belhadj, Djedjiga
    Belaid, Yolande
    Belaid, Abdel
    DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II, 2021, 12822 : 854 - 869
  • [35] Retracted: Extracting information fro m semi-structured web documents: A framework
    Department of Computer Science and Engineering, Aalborg University, Niels Bohrs Vej 8, Esbjerg
    DK-6700, Denmark
    不详
    不详
    Lect. Notes Comput. Sci., 2008, (54-64):
  • [36] A rule-based transformation system for converting semi-structured medical documents
    Heurix J.
    Rella A.
    Fenz S.
    Neubauer T.
    Health and Technology, 2013, 3 (1) : 51 - 63
  • [37] A knowledge-based information extraction system for semi-structured labeled documents
    Yang, JY
    Oh, H
    Doh, KG
    Choi, J
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002, 2002, 2412 : 105 - 110
  • [38] Semi-Structured Distributional Regression
    Ruegamer, David
    Kolb, Chris
    Klein, Nadja
    AMERICAN STATISTICIAN, 2024, 78 (01): : 88 - 99
  • [39] Querying semi-structured data
    Abiteboul, S
    DATABASE THEORY - ICDT'97, 1997, 1186 : 1 - 18
  • [40] Keyword Search on Structured and Semi-Structured Data
    Chen, Yi
    Wang, Wei
    Liu, Ziyang
    Lin, Xuemin
    ACM SIGMOD/PODS 2009 CONFERENCE, 2009, : 1005 - 1009