PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India

被引:0
|
作者
Urlanal, Ashok [1 ]
Chen, Pinzhen [2 ]
Zhao, Zheng [2 ]
Cohen, Shay B. [2 ]
Shrivastava, Manish [1 ]
Haddow, Barry [2 ]
机构
[1] IIIT Hyderabad, Hyderabad, India
[2] Univ Edinburgh, Edinburgh, Midlothian, Scotland
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023) | 2023年
基金
英国科研创新办公室;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper introduces PMIndiaSum, a multilingual and massively parallel summarization corpus focused on languages in India. Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs. We detail our construction workflow including data acquisition, processing, and quality assurance. Furthermore, we publish benchmarks for monolingual, cross-lingual, and multilingual summarization by fine-tuning, prompting, as well as translate-and-summarize. Experimental results confirm the crucial role of our data in aiding summarization between Indian languages. Our dataset is publicly available and can be freely modified and re-distributed.(1)
引用
收藏
页码:11606 / 11628
页数:23
相关论文
共 50 条
  • [21] Multilingual modeling of cross-lingual spelling variants
    Linden, Krister
    INFORMATION RETRIEVAL, 2006, 9 (03): : 295 - 310
  • [22] Multilingual and Cross-Lingual Graded Lexical Entailment
    Vulic, Ivan
    Ponzetto, Simone Paolo
    Glavas, Goran
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 4963 - 4974
  • [23] Zero-Shot Cross-Lingual Neural Headline Generation
    Ayana
    Shen, Shi-qi
    Chen, Yun
    Yang, Cheng
    Liu, Zhi-yuan
    Sun, Mao-song
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (12) : 2319 - 2327
  • [24] Mixed-Lingual Pre-training for Cross-lingual Summarization
    Xu, Ruochen
    Zhu, Chenguang
    Shi, Yu
    Zeng, Michael
    Huang, Xuedong
    1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 536 - 541
  • [25] Transformer-based Cross-Lingual Summarization using Multilingual Word Embeddings for English - Bahasa Indonesia
    Abka, Achmad F.
    Azizah, Kurniawati
    Jatmiko, Wisnu
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (12) : 636 - 645
  • [26] LEARNING CROSS-LINGUAL INFORMATION WITH MULTILINGUAL BLSTM FOR SPEECH SYNTHESIS OF LOW-RESOURCE LANGUAGES
    Yu, Quanjie
    Liu, Peng
    Wu, Zhiyong
    Kang, Shiyin
    Meng, Helen
    Cai, Lianhong
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5545 - 5549
  • [27] A Variational Hierarchical Model for Neural Cross-Lingual Summarization
    Liang, Yunlong
    Meng, Fandong
    Zhou, Chulun
    Xu, Jinan
    Chen, Yufeng
    Su, Jinsong
    Zhou, Jie
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 2088 - 2099
  • [28] CAKES: Cross-lingual Wikipedia Knowledge Enrichment and Summarization
    Fionda, Valeria
    Pirro, Giuseppe
    20TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE (ECAI 2012), 2012, 242 : 901 - 902
  • [29] Cross-Lingual Korean Speech-to-Text Summarization
    Yoon, HyoJeon
    Dinh Tuyen Hoang
    Ngoc Thanh Nguyen
    Hwang, Dosam
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2019, PT I, 2019, 11431 : 198 - 206
  • [30] clstk: The Cross-Lingual Summarization Tool-Kit
    Jhaveri, Nisarg
    Gupta, Manish
    Varma, Vasudeva
    PROCEEDINGS OF THE TWELFTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'19), 2019, : 766 - 769