Characteristic sets of strings common to semi-structured documents

被引：0

作者：

Ikeda, D ^{[1
]}

机构：

[1] Kyushu Univ, Ctr Comp, Fukuoka 8128581, Japan

来源：

DISCOVERY SCIENCE, PROCEEDINGS | 1999年 / 1721卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We consider Maximum Agreement Problem which is, given positive and negative documents, to find a characteristic set that matches many of positive documents but rejects many of negative ones. A characteristic set is a sequence (x(1),...,x(d)) of strings such that each x(i) is a suffix of x(i+1) and all x(i)'s appear in a document without overlaps. A characteristic set matches semi-structured documents with primitives or user's defined macros. For example, ("set", "characteristic set","<title> chararteristic set") is a characteristic set extracted from an HTML file. But, an algorithm that solves Maximum Agreement Problem does not output useless characteristic sets, such as those made of only tags of HTML, since such characteristic sets may match most of positive documents but also match most of negative ones. We present an algorithm that, given an integer d which is the number of strings in a characteristic set, solves Maximum Agreement Problem in O(n(2)h(d)) time, where n is the total length of documents and h is the height of the suffix tree of the documents.

引用

页码：139 / 147

页数：9

共 50 条

[31] Mining Entities and their Values from Semi-Structured Documents in Business Process Outsourcing
Guggilla, Chinnappa
Pandey, Ankit G.
Kummamuru, Krishna
Shivaram, Madhura
PROCEEDINGS OF THE ACM INDIA JOINT INTERNATIONAL CONFERENCE ON DATA SCIENCE AND MANAGEMENT OF DATA (CODS-COMAD'18), 2018, : 283 - 288
[32] An Automatic Ontology Population with a Machine Learning Technique from Semi-Structured Documents
Song, Hyun-Je
Park, Seong-Bae
Park, Se-Young
ICIA: 2009 INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, VOLS 1-3, 2009, : 519 - 524
[33] Contrastive Training Improves Zero-Shot Classification of Semi-structured Documents
Khalifa, Muhammad
Vyas, Yogarshi
Wang, Shuai
Horwood, Graham
Mallya, Sunil
Ballesteros, Miguel
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 7499 - 7508
[34] Consideration of the Word's Neighborhood in GATs for Information Extraction in Semi-structured Documents
Belhadj, Djedjiga
Belaid, Yolande
Belaid, Abdel
DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II, 2021, 12822 : 854 - 869
[35] Retracted: Extracting information fro m semi-structured web documents: A framework
Department of Computer Science and Engineering, Aalborg University, Niels Bohrs Vej 8, Esbjerg
DK-6700, Denmark
不详
不详
Lect. Notes Comput. Sci., 2008, (54-64):
[36] A rule-based transformation system for converting semi-structured medical documents
Heurix J.
Rella A.
Fenz S.
Neubauer T.
Health and Technology, 2013, 3 (1) : 51 - 63
[37] A knowledge-based information extraction system for semi-structured labeled documents
Yang, JY
Oh, H
Doh, KG
Choi, J
INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002, 2002, 2412 : 105 - 110
[38] Semi-Structured Distributional Regression
Ruegamer, David
Kolb, Chris
Klein, Nadja
AMERICAN STATISTICIAN, 2024, 78 (01): : 88 - 99
[39] Querying semi-structured data
Abiteboul, S
DATABASE THEORY - ICDT'97, 1997, 1186 : 1 - 18
[40] Keyword Search on Structured and Semi-Structured Data
Chen, Yi
Wang, Wei
Liu, Ziyang
Lin, Xuemin
ACM SIGMOD/PODS 2009 CONFERENCE, 2009, : 1005 - 1009

← 1 2 3 4 5 →