HTML text segmentation for Web page summarization by a key sentence extraction method

被引:0
|
作者
Sunayama, Wataru [1 ,3 ]
Iyama, Akihiro [2 ,4 ]
Yachida, Masahiko [2 ,5 ,6 ,7 ]
机构
[1] Faculty of Information Sciences, Hiroshima City University, Hiroshima, 731-3194, Japan
[2] Graduate School of Engineering Science, Osaka University, Toyonaka, 560-8531, Japan
[3] Department of Information Sciences, Hiroshima City University
[4] TIS, Inc.
[5] Graduate School of Engineering Science
[6] IPSJ
[7] RSJ
来源
Systems and Computers in Japan | 2006年 / 37卷 / 07期
关键词
The information displayed as the search result by search engines is important for quickly finding the desired information; In particular; the summary of each Web page in the search results is important for determining the Web page content; as well as for determining how the input search term is used in each Web page; namely; the relation between the search term and the Web page. However; the summaries of the search results in conventional search engines have problems such as extracting only the opening text and not containing the search term; or containing the search term but having the sentence truncated in the middle so that the context of the term or the content of the Web page cannot be determined. Therefore; a summary in sentence units is desirable; but since [!text type='HTML']HTML[!/text] text includes many nonsentence items that do not contain punctuation; if they are unprocessed; it is difficult for a key sentence extraction system that treats sentences as units to provide a summary. Thus; i n this paper; we propose an [!text type='HTML']HTML[!/text] text segmentation system that divides the source text of each Web page into meaningfully connected groups of text corresponding to sentences. We also verify experimentally that the text generated by this system can be used effectively in a Web page summarization. © 2006 Wiley Periodicals; Inc;
D O I
暂无
中图分类号
学科分类号
摘要
Journal article (JA)
引用
收藏
页码:26 / 36
相关论文
共 50 条
  • [41] Theme Extraction from Chinese Web Documents Based on Page Segmentation and Entropy
    Wang, Deqing
    Zhang, Hui
    Zhou, Gang
    FOUNDATIONS OF INTELLIGENT SYSTEMS, PROCEEDINGS, 2009, 5722 : 221 - 230
  • [42] A Web Spam Link Detection Method Based on Web Page Structure and Text Features
    Yang W.
    Jiang Y.-H.
    Zhang S.-F.
    Dongbei Daxue Xuebao/Journal of Northeastern University, 2020, 41 (08): : 1091 - 1096
  • [43] Automatic Extraction of Key Sentences via Word Sense Identification for Chinese Text Summarization
    Kuo, Yau-Hwang
    Huang, Hsun-Hui
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2007, 11 (04) : 416 - 422
  • [44] Long Text Summarization and Key Information Extraction in a Multi-Task Learning Framework
    Lu M.
    Chen R.
    Applied Mathematics and Nonlinear Sciences, 2024, 9 (01)
  • [45] A Method of Automatic Web Information Extraction Based on Page Clustering
    Yang, Tianqi
    Qiu, Taofen
    2011 9TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA 2011), 2011, : 390 - 393
  • [46] A statistically based sentence scoring method using mathematical combination for extractive Hindi text summarization
    Dhankhar, Sunil
    Gupta, Mukesh Kumar
    JOURNAL OF INTERDISCIPLINARY MATHEMATICS, 2022, 25 (03) : 773 - 790
  • [47] A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations
    Sano, Hiroyuki
    Swezey, Robin M. E.
    Shiramatsu, Shun
    Ozono, Tadachika
    Shintani, Toramatsu
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2013, 13 (01): : 1 - 6
  • [49] Key word placing in Web page body text to increase visibility to search engines
    Kritzinger, W. T.
    Weideman, M.
    SOUTH AFRICAN JOURNAL OF INFORMATION MANAGEMENT, 2007, 9 (01):
  • [50] Key Frame Extraction, Localization and Segmentation of Caption Text in News Videos
    Phadke, Harsha H.
    Mallika, H.
    2017 2ND IEEE INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ELECTRONICS, INFORMATION & COMMUNICATION TECHNOLOGY (RTEICT), 2017, : 543 - 547