共 50 条
Extracting Information Networks from the Blogosphere
被引:9
|作者:
Merhav, Yuval
[1
]
Mesquita, Filipe
[2
]
Barbosa, Denilson
[2
]
Yee, Wai Gen
[3
]
Frieder, Ophir
[4
]
机构:
[1] IIT, Dept Comp Sci, Informat Retrieval Lab, Chicago, IL 60616 USA
[2] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2M7, Canada
[3] Orbitz Worldwide, Chicago, IL 60661 USA
[4] Georgetown Univ, Washington, DC 20057 USA
关键词:
Algorithms;
Experimentation;
Performance;
open information extraction;
relation extraction;
named entities;
domain frequency;
clustering;
D O I:
10.1145/2344416.2344418
中图分类号:
TP [自动化技术、计算机技术];
学科分类号:
0812 ;
摘要:
We study the problem of automatically extracting information networks formed by recognizable entities as well as relations among them from social media sites. Our approach consists of using state-of-the-art natural language processing tools to identify entities and extract sentences that relate such entities, followed by using text-clustering algorithms to identify the relations within the information network. We propose a new term-weighting scheme that significantly improves on the state-of-the-art in the task of relation extraction, both when used in conjunction with the standard tf.df scheme and also when used as a pruning filter. We describe an effective method for identifying benchmarks for open information extraction that relies on a curated online database that is comparable to the hand-crafted evaluation datasets in the literature. From this benchmark, we derive a much larger dataset which mimics realistic conditions for the task of open information extraction. We report on extensive experiments on both datasets, which not only shed light on the accuracy levels achieved by state-of-the-art open information extraction tools, but also on how to tune such tools for better results.
引用
收藏
页数:33
相关论文