DISTRIBUTED WEB-SCALE INFRASTRUCTURE FOR CRAWLING, INDEXING AND SEARCH WITH SEMANTIC SUPPORT

被引:1
|
作者
Dlugolinsky, Stefan [1 ]
Seleng, Martin [1 ]
Laclavik, Michal [1 ]
Hluchy, Ladislav [1 ]
机构
[1] Slovak Acad Sci, Inst Informat, Bratislava, Slovakia
来源
COMPUTER SCIENCE-AGH | 2012年 / 13卷 / 04期
关键词
distributed web crawling; information extraction; information retrieval; semantic search; geocoding; spatial search;
D O I
10.7494/csci.2012.13.4.5
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we describe our work in progress in the scope of web-scale information extraction and information retrieval utilizing distributed computing. We present a distributed architecture built on top of the MapReduce paradigm for information retrieval, information processing and intelligent search supported by spatial capabilities. Proposed architecture is focused on crawling documents in several different formats, information extraction, lightweight semantic annotation of the extracted information, indexing of extracted information and finally on indexing of documents based on the geo-spatial information found in a document. We demonstrate the architecture on two use cases, where the first is search in job offers retrieved from the LinkedIn portal and the second is search in BBC news feeds and discuss several problems we had to face during the implementation. We also discuss spatial search applications for both cases because both LinkedIn job offer pages and BBC news feeds contain a lot of spatial information to extract and process.
引用
收藏
页码:5 / 19
页数:15
相关论文
共 50 条
  • [31] Duplicate-Search-Based Image Annotation Using Web-Scale Data
    Wang, Xin-Jing
    Zhang, Lei
    Ma, Wei-Ying
    PROCEEDINGS OF THE IEEE, 2012, 100 (09) : 2705 - 2721
  • [32] Querying Web-Scale Knowledge Graphs Through Effective Pruning of Search Space
    Jin, Jiahui
    Luo, Junzhou
    Khemmarat, Samamon
    Gao, Lixin
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (08) : 2342 - 2356
  • [33] Web-scale system for image similarity search: When the dreams are coming true
    Novak, David
    Batko, Michal
    Zezula, Pavel
    2008 INTERNATIONAL WORKSHOP ON CONTENT-BASED MULTIMEDIA INDEXING, 2008, : 430 - 437
  • [34] High Throughput Indexing for Large-scale Semantic Web Data
    Cheng, Long
    Kotoulas, Spyros
    Ward, Tomas E.
    Theodoropoulos, Georgios
    30TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, VOLS I AND II, 2015, : 416 - 422
  • [35] Defense Against Adversarial Images using Web-Scale Nearest-Neighbor Search
    Dubey, Abhimanyu
    van der Maaten, Laurens
    Yalniz, Zeki
    Li, Yixuan
    Mahajan, Dhruv
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8759 - 8768
  • [36] A semantic web service framework to support intelligent distributed manufacturing
    Kulvatunyou, Boonserm
    Cho, Hyunbo
    Son, Young Jun
    INTERNATIONAL JOURNAL OF KNOWLEDGE-BASED AND INTELLIGENT ENGINEERING SYSTEMS, 2005, 9 (02) : 107 - 127
  • [37] Semantic Web Support for Intelligent Search and Retrieval of Business Knowledge
    Tamma, Valentina
    IEEE INTELLIGENT SYSTEMS, 2010, 25 (01) : 84 - 88
  • [38] Efficient Learning to Learn a Robust CTR Model for Web-scale Online Sponsored Search Advertising
    Wang, Xin
    Yang, Peng
    Chen, Shaopeng
    Liu, Lin
    Zhao, Lian
    Guo, Jiacheng
    Sun, Mingming
    Li, Ping
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 4203 - 4213
  • [39] Instability of Relevance-Ranked Results Using Latent Semantic Indexing for Web Search
    Kettani, Houssain
    Newby, Gregory B.
    43RD HAWAII INTERNATIONAL CONFERENCE ON SYSTEMS SCIENCES VOLS 1-5 (HICSS 2010), 2010, : 1553 - +
  • [40] Robust and Distributed Web-Scale Near-Dup Document Conflation in Microsoft Academic Service
    Wu, Chieh-Han
    Song, Yang
    PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 2606 - 2611