MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset

被引:1
|
作者
Brugger, Tobias [1 ]
Sturmer, Matthias [1 ,2 ]
Niklaus, Joel [1 ,2 ,3 ]
机构
[1] Univ Bern, Bern, Switzerland
[2] Bern Univ Appl Sci, Bern, Switzerland
[3] Stanford Univ, Stanford, CA 94305 USA
来源
PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND LAW, ICAIL 2023 | 2023年
关键词
Sentence Boundary Detection; Natural Language Processing; Legal Document Analysis; Text Annotation; Multilingual;
D O I
10.1145/3594536.3595132
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the-art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available.
引用
收藏
页码:42 / 51
页数:10
相关论文
共 50 条
  • [31] Towards language-independent sentence boundary detection
    Lee, DG
    Rim, HC
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2004, 2945 : 142 - 145
  • [32] Study on Sentence Relations in the Automatic Detection of Argumentation in Legal Cases
    Mochales-Palau, Raquel
    Moens, Marie-Francine
    LEGAL KNOWLEDGE AND INFORMATION SYSTEMS, 2007, 165 : 89 - 98
  • [33] Joint detection of sentence stress and phrase boundary for prosody
    Lin, Binghuai
    Wang, Liyuan
    Feng, Xiaoli
    Zhang, Jinsong
    INTERSPEECH 2020, 2020, : 4392 - 4396
  • [34] Sentence Boundary Detection in Turkish News with Regular Expressions
    Ozbey, Can
    Dincsoy, Ozge
    2019 27TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2019,
  • [35] Multilingual sentence categorization and novelty mining
    Zhang, Yi
    Tsai, Flora S.
    Kwee, Agus Trisnajaya
    INFORMATION PROCESSING & MANAGEMENT, 2011, 47 (05) : 667 - 675
  • [36] A multilingual, multimodal dataset of aggression and bias: the ComMA dataset
    Kumar, Ritesh
    Ratan, Shyam
    Singh, Siddharth
    Nandi, Enakshi
    Devi, Laishram Niranjana
    Bhagat, Akash
    Dawer, Yogesh
    Lahiri, Bornini
    Bansal, Akanksha
    LANGUAGE RESOURCES AND EVALUATION, 2024, 58 (02) : 757 - 837
  • [37] Enhancing Answer Boundary Detection for Multilingual Machine Reading Comprehension
    Yuan, Fei
    Shou, Linjun
    Bai, Xuanyu
    Gong, Ming
    Liang, Yaobo
    Duan, Nan
    Fu, Yan
    Jiang, Daxin
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 925 - 934
  • [38] Resolving Ambiguities in Sentence Boundary Detection in Russian Spontaneous Speech
    Stepikhov, Anton
    TEXT, SPEECH, AND DIALOGUE, TSD 2013, 2013, 8082 : 426 - 433
  • [39] Sociolinguistic Factors in Text-Based Sentence Boundary Detection
    Stepikhov, Anton
    SPEECH AND COMPUTER (SPECOM 2015), 2015, 9319 : 372 - 380
  • [40] Sentence Boundary Detection Based on Parallel Lexical and Acoustic Models
    Che, Xiaoyin
    Luo, Sheng
    Yang, Haojin
    Meinel, Christoph
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2528 - 2532