DISTRIBUTED WEB-SCALE INFRASTRUCTURE FOR CRAWLING, INDEXING AND SEARCH WITH SEMANTIC SUPPORT

被引:1
|
作者
Dlugolinsky, Stefan [1 ]
Seleng, Martin [1 ]
Laclavik, Michal [1 ]
Hluchy, Ladislav [1 ]
机构
[1] Slovak Acad Sci, Inst Informat, Bratislava, Slovakia
来源
COMPUTER SCIENCE-AGH | 2012年 / 13卷 / 04期
关键词
distributed web crawling; information extraction; information retrieval; semantic search; geocoding; spatial search;
D O I
10.7494/csci.2012.13.4.5
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we describe our work in progress in the scope of web-scale information extraction and information retrieval utilizing distributed computing. We present a distributed architecture built on top of the MapReduce paradigm for information retrieval, information processing and intelligent search supported by spatial capabilities. Proposed architecture is focused on crawling documents in several different formats, information extraction, lightweight semantic annotation of the extracted information, indexing of extracted information and finally on indexing of documents based on the geo-spatial information found in a document. We demonstrate the architecture on two use cases, where the first is search in job offers retrieved from the LinkedIn portal and the second is search in BBC news feeds and discuss several problems we had to face during the implementation. We also discuss spatial search applications for both cases because both LinkedIn job offer pages and BBC news feeds contain a lot of spatial information to extract and process.
引用
收藏
页码:5 / 19
页数:15
相关论文
共 50 条
  • [21] Automatic Web Image Annotation via Web-Scale Image Semantic Space Learning
    Xu, Hongtao
    Zhou, Xiangdong
    Lin, Lan
    Xiang, Yu
    Shi, Baile
    ADVANCES IN DATA AND WEB MANAGEMENT, PROCEEDINGS, 2009, 5446 : 211 - +
  • [22] Indexing scheme for keyword search over semantic web documents
    Kim, YounHee
    Shin, HyeYeon
    Chong, KyunRak
    Lim, HaeChull
    9TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY: TOWARD NETWORK INNOVATION BEYOND EVOLUTION, VOLS 1-3, 2007, : 1205 - +
  • [23] Modeling Search Assistance Mechanisms within Web-Scale Discovery Systems
    Mischo, William H.
    Schlembach, Mary C.
    Norman, Michael A.
    JCDL'13: PROCEEDINGS OF THE 13TH ACM/IEEE-CS JOINT CONFERENCE ON DIGITAL LIBRARIES, 2013, : 407 - 408
  • [24] Search Query Quality and Web-Scale Discovery: A Qualitative and Quantitative Analysis
    Meadow, Kelly
    Meadow, James
    COLLEGE & UNDERGRADUATE LIBRARIES, 2012, 19 (2-4) : 163 - 175
  • [25] Using Web-Scale Graph Analytics to Counter Technical Support Scams
    Larson, Jonathan
    Tower, Bryan
    Hadfield, Duane
    Edge, Darren
    White, Christopher
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 3968 - 3971
  • [26] Large Scale Semantic Annotation, Indexing, and Search at The National Archives
    Maynard, Diana
    Greenwood, Mark A.
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3487 - 3494
  • [27] Vantage Point Latent Semantic Indexing for multimedia web document search
    D. Srikanth
    S. Sakthivel
    Cluster Computing, 2019, 22 : 10587 - 10594
  • [28] Using Inverted Indexing to Semantic WEB Service Discovery Search Model
    Zhou, Bo
    Huang, Tinglei
    Liu, Jie
    Shen, Meizhou
    2009 5TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-8, 2009, : 4872 - 4875
  • [29] Web Page Indexing through Page Ranking for Effective Semantic Search
    Sharma, Robin
    Kandpal, Ankita
    Bhakuni, Priyanka
    Chauhan, Rashmi
    Goudar, R. H.
    Tyagi, Asit
    7TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND CONTROL (ISCO 2013), 2013, : 389 - 392
  • [30] Vantage Point Latent Semantic Indexing for multimedia web document search
    Srikanth, D.
    Sakthivel, S.
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 5): : 10587 - 10594