Characteristic sets of strings common to semi-structured documents

被引:0
|
作者
Ikeda, D [1 ]
机构
[1] Kyushu Univ, Ctr Comp, Fukuoka 8128581, Japan
来源
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We consider Maximum Agreement Problem which is, given positive and negative documents, to find a characteristic set that matches many of positive documents but rejects many of negative ones. A characteristic set is a sequence (x(1),...,x(d)) of strings such that each x(i) is a suffix of x(i+1) and all x(i)'s appear in a document without overlaps. A characteristic set matches semi-structured documents with primitives or user's defined macros. For example, ("set", "characteristic set","<title> chararteristic set") is a characteristic set extracted from an HTML file. But, an algorithm that solves Maximum Agreement Problem does not output useless characteristic sets, such as those made of only tags of HTML, since such characteristic sets may match most of positive documents but also match most of negative ones. We present an algorithm that, given an integer d which is the number of strings in a characteristic set, solves Maximum Agreement Problem in O(n(2)h(d)) time, where n is the total length of documents and h is the height of the suffix tree of the documents.
引用
收藏
页码:139 / 147
页数:9
相关论文
共 50 条
  • [41] Rationale in Semi-structured Processes
    Kannengiesser, Udo
    Zhu, Liming
    BUSINESS PROCESS MANAGEMENT WORKSHOPS, 2011, 66 : 634 - +
  • [42] Autonomous vehicles in structured and semi-structured environments
    Ozguner, U
    Redmill, K
    Ogras, U
    Dagci, O
    Launsbach, M
    PROCEEDINGS OF THE 41ST IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-4, 2002, : 124 - 129
  • [43] Inferring structure and meaning of semi-structured documents by using a Gibbs sampling based approach
    Ravindranath, Vinodh Kumar
    Deshpande, Devashish
    Girish, K. Venkata Vijay
    Patel, Darshan
    Jambhekar, Neel
    2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION WORKSHOPS (ICDARW), VOL 5, 2019, : 169 - 174
  • [44] RETRACTED: Extracting Information from Semi-structured Web Documents: A Framework (Retracted Article)
    Memon, Nasrullah
    Qureshi, Abdul Rasool
    Hicks, David
    Harkiolakis, Nicholas
    ADVANCED WEB AND NETWORK TECHNOLOGIES, AND APPLICATIONS, 2008, 4977 : 54 - +
  • [45] Selection Fusion in Semi-Structured Retrieval
    Norozi, Muhammad Ali
    Arvola, Paavo
    PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 1291 - 1300
  • [46] Multigrid Methods on Semi-Structured Grids
    Carmen Rodrigo
    Francisco J. Gaspar
    Francisco J. Lisbona
    Archives of Computational Methods in Engineering, 2012, 19 : 499 - 538
  • [47] Query optimization for semi-structured data
    Li, GY
    Bian, S
    Zhang, J
    Xie, Y
    PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MANAGEMENT SCIENCE & ENGINEERING, VOLS 1 AND 2, 2004, : 97 - 100
  • [48] Partial merging of semi-structured knowledgebases
    Bölöni, L
    Turgut, D
    KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 2, PROCEEDINGS, 2004, 3214 : 1121 - 1127
  • [49] Designing good semi-structured databases
    Lee, SY
    Lee, ML
    Ling, TW
    Kalinichenko, LA
    CONCEPTUAL MODELING - ER'99, 1999, 1728 : 131 - 145
  • [50] Bayesian Semi-structured Subspace Inference
    Dold, Daniel
    Ruegamer, David
    Sick, Beate
    Duerr, Oliver
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238