DISTRIBUTED WEB-SCALE INFRASTRUCTURE FOR CRAWLING, INDEXING AND SEARCH WITH SEMANTIC SUPPORT

被引:1
|
作者
Dlugolinsky, Stefan [1 ]
Seleng, Martin [1 ]
Laclavik, Michal [1 ]
Hluchy, Ladislav [1 ]
机构
[1] Slovak Acad Sci, Inst Informat, Bratislava, Slovakia
来源
COMPUTER SCIENCE-AGH | 2012年 / 13卷 / 04期
关键词
distributed web crawling; information extraction; information retrieval; semantic search; geocoding; spatial search;
D O I
10.7494/csci.2012.13.4.5
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we describe our work in progress in the scope of web-scale information extraction and information retrieval utilizing distributed computing. We present a distributed architecture built on top of the MapReduce paradigm for information retrieval, information processing and intelligent search supported by spatial capabilities. Proposed architecture is focused on crawling documents in several different formats, information extraction, lightweight semantic annotation of the extracted information, indexing of extracted information and finally on indexing of documents based on the geo-spatial information found in a document. We demonstrate the architecture on two use cases, where the first is search in job offers retrieved from the LinkedIn portal and the second is search in BBC news feeds and discuss several problems we had to face during the implementation. We also discuss spatial search applications for both cases because both LinkedIn job offer pages and BBC news feeds contain a lot of spatial information to extract and process.
引用
收藏
页码:5 / 19
页数:15
相关论文
共 50 条
  • [1] Building web-scale data mining infrastructure for search
    Ma, Wei-Ying
    PROGRESS IN WWW RESEARCH AND DEVELOPMENT, PROCEEDINGS, 2008, 4976 : 9 - 9
  • [2] Web-Scale Semantic Product Search with Large Language Models
    Muhamed, Aashiq
    Srinivasan, Sriram
    Teo, Choon-Hui
    Cui, Qingjun
    Zeng, Belinda
    Chilimbi, Trishul
    Vishwanathan, S. V. N.
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2023, PT III, 2023, 13937 : 73 - 85
  • [3] Web-scale distributed AI search across disconnected and heterogeneous infrastructures
    Kelsey, Tom
    McCaffery, Martin
    Kotthoff, Lars
    2014 IEEE 10TH INTERNATIONAL CONFERENCE ON E-SCIENCE (E-SCIENCE), VOL 1, 2014, : 39 - 46
  • [4] Web-scale semantic information processing
    Heflin, Jeff
    Stuckenschmidt, Heiner
    JOURNAL OF WEB SEMANTICS, 2012, 10 : 1 - 2
  • [5] MultiCrawler: A pipelined architecture for crawling and indexing Semantic Web data
    Harth, Andreas
    Umbrich, Juergen
    Decker, Stefan
    SEMANTIC WEB - ISEC 2006, PROCEEDINGS, 2006, 4273 : 258 - +
  • [6] Web-Scale Responsive Visual Search at Bing
    Hu, Houdong
    Wang, Yan
    Yang, Linjun
    Komlev, Pavel
    Huang, Li
    Chen, Xi
    Huang, Jiapei
    Wu, Ye
    Merchant, Meenaz
    Sacheti, Arun
    KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, : 359 - 367
  • [7] Distributed Community Detection in Web-Scale Networks
    Ovelgoenne, Michael
    2013 IEEE/ACM INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING (ASONAM), 2013, : 72 - 79
  • [8] Web-scale workflow - Integrating distributed services
    Blake, M. Brian
    Huhns, Michael N.
    IEEE INTERNET COMPUTING, 2008, 12 (01) : 55 - 59
  • [9] Deep crawling in the semantic web: In search of deep knowledge
    Navas-Delgado, I
    Roldan-Garcia, MD
    Aldana-Montes, JF
    WEB INFORMATION SYSTEMS - WISE 2004, PROCEEDINGS, 2004, 3306 : 541 - 546
  • [10] Semantic Rule Filtering for Web-Scale Relation Extraction
    Moro, Andrea
    Li, Hong
    Krause, Sebastian
    Xu, Feiyu
    Navigli, Roberto
    Uszkoreit, Hans
    SEMANTIC WEB - ISWC 2013, PART I, 2013, 8218 : 347 - 362