Characteristic sets of strings common to semi-structured documents

被引:0
|
作者
Ikeda, D [1 ]
机构
[1] Kyushu Univ, Ctr Comp, Fukuoka 8128581, Japan
来源
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We consider Maximum Agreement Problem which is, given positive and negative documents, to find a characteristic set that matches many of positive documents but rejects many of negative ones. A characteristic set is a sequence (x(1),...,x(d)) of strings such that each x(i) is a suffix of x(i+1) and all x(i)'s appear in a document without overlaps. A characteristic set matches semi-structured documents with primitives or user's defined macros. For example, ("set", "characteristic set","<title> chararteristic set") is a characteristic set extracted from an HTML file. But, an algorithm that solves Maximum Agreement Problem does not output useless characteristic sets, such as those made of only tags of HTML, since such characteristic sets may match most of positive documents but also match most of negative ones. We present an algorithm that, given an integer d which is the number of strings in a characteristic set, solves Maximum Agreement Problem in O(n(2)h(d)) time, where n is the total length of documents and h is the height of the suffix tree of the documents.
引用
收藏
页码:139 / 147
页数:9
相关论文
共 50 条
  • [1] Adding Structure to Semi-Structured Documents
    Moens, Marie-Francine
    LEGAL KNOWLEDGE AND INFORMATION SYSTEMS: JURIX 2009: THE TWENTY-SECOND ANNUAL CONFERENCE, 2009, 205 : IX - IX
  • [2] Automatic Generation of Semi-structured Documents
    Belhadj, Djedjiga
    Belaid, Yolande
    Belaid, Abdel
    DOCUMENT ANALYSIS AND RECOGNITION, ICDAR 2021, PT II, 2021, 12917 : 191 - 205
  • [3] A Semantic Kernel for semi-structured documents
    Aseervatham, Sujeevan
    Viennet, Emmanuel
    Bennani, Younes
    ICDM 2007: PROCEEDINGS OF THE SEVENTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2007, : 403 - 408
  • [4] Semantic annotation of semi-structured documents
    Ranganathan, Girish R.
    Biletskiy, Yevgen
    Kaltchenko, Alexey
    2008 CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, VOLS 1-4, 2008, : 877 - +
  • [5] Automatic Content Extraction on Semi-Structured Documents
    dos Santos, Jose Eduardo Bastos
    11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 1235 - 1239
  • [6] Advancing the terminological classification of semi-structured documents
    Stratogiannis, Georgios
    Siolas, Georgios
    Stamou, Georgios
    Stafylopatis, Andreas
    Chortaras, Alexandros
    Tagaris, Athanasios
    2015 IEEE 27TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2015), 2015, : 333 - 339
  • [7] Partial retrieval of compressed semi-structured documents
    Gupta, Ashutosh
    Agarwal, Suneeta
    INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS IN TECHNOLOGY, 2010, 38 (04) : 239 - 249
  • [8] Semi-structured documents mining: a review and comparison
    Madani, Amina
    Boussaid, Omar
    Zegour, Djamel Eddine
    17TH INTERNATIONAL CONFERENCE IN KNOWLEDGE BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS - KES2013, 2013, 22 : 330 - 339
  • [9] Towards the automated verification of semi-structured documents
    Weitl, Franz
    Jaksic, Mirjana
    Freitag, Burkhard
    DATA & KNOWLEDGE ENGINEERING, 2009, 68 (03) : 292 - 317
  • [10] Supporting Semantic Search on Heterogeneous Semi-structured Documents
    Mrabet, Yassine
    Bennacer, Nacera
    Pernelle, Nathalie
    Thiam, Mouhamadou
    ADVANCED INFORMATION SYSTEMS ENGINEERING, PROCEEDINGS, 2010, 6051 : 224 - +