A French Human Reference Corpus for Multi-Document Summarization and Sentence Compression

被引:0
|
作者
de Loupy, Claude [1 ]
Guegan, Marie [1 ]
Ayache, Christelle [1 ]
Seng, Somara [1 ]
Moreno, Juan-Manuel Torres [2 ,3 ]
机构
[1] Syllabs, F-75013 Paris, France
[2] Lab Informat Avignon UAPV, F-84911 Avignon, France
[3] Ecole Polytech, Montreal, PQ H3C 3A7, Canada
关键词
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
This paper presents two corpora produced within the RPM2 project: a multi-document summarization corpus and a sentence compression corpus. Both corpora are in French. The first one is the only one we know in this language. It contains 20 topics with 20 documents each. A first set of 10 documents per topic is summarized and then the second set is used to produce an update summarization (new information). 4 annotators were involved and produced a total of 160 abstracts. The second corpus contains all the sentences of the first one. 4 annotators were asked to compress the 8432 sentences. This is the biggest corpus of compressed sentences we know, whatever the language. The paper provides some figures in order to compare the different annotators: compression rates, number of tokens per sentence, percentage of tokens kept according to their POS, position of dropped tokens in the sentence compression phase, etc. These figures show important differences from an annotator to the other. Another point is the different strategies of compression used according to the length of the sentence.
引用
收藏
页码:3113 / 3118
页数:6
相关论文
共 50 条
  • [21] Cohesion-based Sentence Ordering for Multi-document Summarization
    Jiang, Xiaoyu
    2016 INTERNATIONAL CONFERENCE ON INFORMATION ENGINEERING AND COMMUNICATIONS TECHNOLOGY (IECT 2016), 2016, : 78 - 83
  • [22] Sentence extraction using time features in multi-document summarization
    Lim, JM
    Kang, IS
    Bae, JHJ
    Lee, JH
    INFORMATION RETRIEVAL TECHNOLOGY, 2005, 3411 : 82 - 93
  • [23] Extractive multi-document summarization based on textual entailment and sentence compression via knapsack problem
    Naserasadi, Ali
    Khosravi, Hamid
    Sadeghi, Faramarz
    NATURAL LANGUAGE ENGINEERING, 2019, 25 (01) : 121 - 146
  • [24] Subtopic-focused sentence scoring in multi-document summarization
    Li Sujian
    Qu Weiguang
    ALPIT 2007: PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON ADVANCED LANGUAGE PROCESSING AND WEB INFORMATION TECHNOLOGY, 2007, : 98 - +
  • [25] Experimentation of Two Compression Strategies for Multi-Document Summarization
    Fatma, Jaoua Kallel
    Jaoua, Maher
    Belguith, Lamia Hadrich
    Ben Hamadou, Abdelmajid
    SECOND INTERNATIONAL CONFERENCE ON COMPUTER AND ELECTRICAL ENGINEERING, VOL 2, PROCEEDINGS, 2009, : 480 - +
  • [26] LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization
    Friedrich, Annemarie
    Valeeva, Marina
    Palmer, Alexis
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 1591 - 1599
  • [27] Sentence Similarity based on Dependency Tree Kernels for Multi-document Summarization
    Ozates, Saziye Betul
    Ozgur, Arzucan
    Radev, Dragomir R.
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 2833 - 2838
  • [28] A bottom-up approach to sentence ordering for multi-document summarization
    Bollegala, Danushka
    Okazaki, Naoaki
    Ishizuka, Mitsuru
    INFORMATION PROCESSING & MANAGEMENT, 2010, 46 (01) : 89 - 109
  • [29] Multi-Document Summarization using Sentence Fusion for Indonesian News Articles
    Christie, Felicia
    Khodra, Masayu Leylia
    2016 INTERNATIONAL CONFERENCE ON ADVANCED INFORMATICS - CONCEPTS, THEORY AND APPLICATION (ICAICTA), 2016,
  • [30] TWO-STAGE SENTENCE SELECTION APPROACH FOR MULTI-DOCUMENT SUMMARIZATION
    Zhang Shu Zhao Tiejun Zheng Dequan Zhao Hua (Department of Computer Science and Technology
    Journal of Electronics(China), 2008, (04) : 562 - 567