MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset

被引:1
|
作者
Brugger, Tobias [1 ]
Sturmer, Matthias [1 ,2 ]
Niklaus, Joel [1 ,2 ,3 ]
机构
[1] Univ Bern, Bern, Switzerland
[2] Bern Univ Appl Sci, Bern, Switzerland
[3] Stanford Univ, Stanford, CA 94305 USA
来源
PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND LAW, ICAIL 2023 | 2023年
关键词
Sentence Boundary Detection; Natural Language Processing; Legal Document Analysis; Text Annotation; Multilingual;
D O I
10.1145/3594536.3595132
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the-art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available.
引用
收藏
页码:42 / 51
页数:10
相关论文
共 50 条
  • [21] Comparing evaluation metrics for sentence boundary detection
    Liu, Yang
    Shriberg, Elizabeth
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 185 - +
  • [22] MINION: a Large-Scale and Diverse Dataset for Multilingual Event Detection
    Ben Veyseh, Amir Pouran
    Minh Van Nguyen
    Dernoncourt, Franck
    Thien Huu Nguyen
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2286 - 2299
  • [23] Flood Detection in Social Media Using Multimodal Fusion on Multilingual Dataset
    Jony, Rabiul Islam
    Woodley, Alan
    Perrin, Dimitri
    2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 566 - 573
  • [24] Multilingual Image Corpus - Towards a Multimodal and Multilingual Dataset
    Koeva, Svetla
    Stoyanova, Ivelina
    Kralev, Jordan
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1509 - 1518
  • [25] Sentence boundary detection of various forms of Tunisian Arabic
    Asma Mekki
    Inès Zribi
    Mariem Ellouze
    Lamia Hadrich Belguith
    Language Resources and Evaluation, 2022, 56 : 357 - 385
  • [26] Sequence Labeling Approach to the Task of Sentence Boundary Detection
    The Anh Le
    ICMLSC 2020: PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND SOFT COMPUTING, 2020, : 144 - 148
  • [27] Sentence boundary detection of various forms of Tunisian Arabic
    Mekki, Asma
    Zribi, Ines
    Ellouze, Mariem
    Belguith, Lamia Hadrich
    LANGUAGE RESOURCES AND EVALUATION, 2022, 56 (01) : 357 - 385
  • [28] Annotation and Personality: Individual Differences in Sentence Boundary Detection
    Stepikhov, Anton
    Loukina, Anastassia
    SPEECH AND COMPUTER, 2014, 8773 : 105 - 112
  • [29] Improving Sentence Boundary Detection for Spoken Language Transcripts
    Rehbein, Ines
    Ruppenhofer, Josef
    Schmidt, Thomas
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 7102 - 7111
  • [30] Improving Efficiency of Sentence Boundary Detection by Feature Selection
    Thi-Nga Ho
    Tze Yuang Chong
    Van Hai Do
    Van Tung Pham
    Eng Siong Chng
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2016, PT II, 2016, 9622 : 594 - 603