Clustering and searching WWW images using link and page layout analysis

被引:31
|
作者
He, Xiaofei
Cai, Deng
Wen, Ji-Rong
Ma, Wei-Ying
Zhang, Hong-Jiang
机构
[1] Yahoo Res Labs, Burbank, CA 91504 USA
[2] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
[3] Microsoft Res Asia, Beijing, Peoples R China
关键词
algorithms; management; performance; experimentation; web mining; image search; image clustering; link analysis;
D O I
10.1145/1230812.1230816
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Due to the rapid growth of the number of digital images on the Web, there is an increasing demand for an effective and efficient method for organizing and retrieving the available images. This article describes iFind, a system for clustering and searching WWW images. By using a vision-based page segmentation algorithm, a Web page is partitioned into blocks, and the textual and link information of an image can be accurately extracted from the block containing that image. The textual information is used for image indexing. By extracting the page-to-block, block-to-image, block-to-page relationships through link structure and page layout analysis, we construct an image graph. Our method is less sensitive to noisy links than previous methods like PageRank, HITS, and PicASHOW, and hence the image graph can better reflect the semantic relationship between images. Using the notion of Markov Chain, we can compute the limiting probability distributions of the images, ImageRanks, which characterize the importance of the images. The ImageRanks are combined with the relevance scores to produce the final ranking for image search. With the graph models, we can also use techniques from spectral graph theory for image clustering and embedding, or 2-D visualization. Some experimental results on 11.6 million images downloaded from the Web are provided in the article.
引用
收藏
页数:25
相关论文
共 50 条
  • [21] Page layout analysis and classification for complex scanned documents
    Erkilinc, M. Sezer
    Jaber, Mustafa
    Saber, Eli
    Bauer, Peter
    Depalov, Dejan
    APPLICATIONS OF DIGITAL IMAGE PROCESSING XXXIV, 2011, 8135
  • [22] DeepLayout: A Semantic Segmentation Approach to Page Layout Analysis
    Li, Yixin
    Zou, Yajun
    Ma, Jinwen
    INTELLIGENT COMPUTING METHODOLOGIES, ICIC 2018, PT III, 2018, 10956 : 266 - 277
  • [23] Page Layout Analysis System for Unconstrained Historic Documents
    Kodym, Oldrich
    Hradis, Michal
    DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II, 2021, 12822 : 492 - 506
  • [24] Web page scoring based on link analysis of web page sets
    Nakakubo, Hitoshi
    Nakajima, Shinsuke
    Hatano, Kenji
    Miyazaki, Jun
    Uemura, Shunsuke
    DEXA 2007: 18TH INTERNATIONAL CONFERENCE ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2007, : 269 - +
  • [25] Finding related pages using the link structure of the WWW
    Chirita, PA
    Olmedilla, D
    Nejdl, W
    IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2004), PROCEEDINGS, 2004, : 632 - 635
  • [26] Layout Analysis for Arabic Historical Document Images Using Machine Learning
    Bukhari, Syed Saqib
    Breuel, Thomas M.
    Asi, Abedelkadir
    El-Sana, Jihad
    13TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION (ICFHR 2012), 2012, : 639 - 644
  • [27] Image clustering system on WWW using Web texts
    Sunayama, W
    Nagata, A
    Yachida, M
    HIS'04: Fourth International Conference on Hybrid Intelligent Systems, Proceedings, 2005, : 230 - 235
  • [28] Adaptive layout analysis of document images
    Malerba, D
    Esposito, F
    Altamura, O
    FOUNDATIONS OF INTELLIGENT SYSTEMS, PROCEEDINGS, 2002, 2366 : 526 - 534
  • [29] Page clustering using a distance based algorithm
    Mojica, JA
    Rojas, DA
    Gómez, J
    González, F
    THIRD LATIN AMERICAN WEB CONGRESS, PROCEEDINGS, 2005, : 223 - 229
  • [30] Layout analysis of urdu document images
    Shafait, Faisal
    Adnan-ul-Hasan
    Keysers, Daniel
    Breuel, Thomas M.
    10TH IEEE INTERNATIONAL MULTITOPIC CONFERENCE 2006, PROCEEDINGS, 2006, : 293 - +