MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset

被引:1
|
作者
Brugger, Tobias [1 ]
Sturmer, Matthias [1 ,2 ]
Niklaus, Joel [1 ,2 ,3 ]
机构
[1] Univ Bern, Bern, Switzerland
[2] Bern Univ Appl Sci, Bern, Switzerland
[3] Stanford Univ, Stanford, CA 94305 USA
关键词
Sentence Boundary Detection; Natural Language Processing; Legal Document Analysis; Text Annotation; Multilingual;
D O I
10.1145/3594536.3595132
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the-art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available.
引用
收藏
页码:42 / 51
页数:10
相关论文
共 50 条
  • [1] Unsupervised multilingual sentence boundary detection
    Kiss, Tibor
    Strunk, Jan
    COMPUTATIONAL LINGUISTICS, 2006, 32 (04) : 485 - 525
  • [2] iSentenizer-μ: Multilingual Sentence Boundary Detection Model
    Wong, Derek F.
    Chao, Lidia S.
    Zeng, Xiaodong
    SCIENTIFIC WORLD JOURNAL, 2014,
  • [3] Sentence Boundary Detection in German Legal Documents
    Glaser, Ingo
    Moser, Sebastian
    Matthes, Florian
    ICAART: PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE - VOL 2, 2021, : 812 - 821
  • [4] Adaptive multilingual sentence boundary disambiguation
    Palmer, DD
    Hearst, MA
    COMPUTATIONAL LINGUISTICS, 1997, 23 (02) : 241 - 267
  • [5] EUROPA: A Legal Multilingual Keyphrase Generation Dataset
    Salaun, Olivier
    Piedboeuf, Frederic
    Le Berre, Guillaume
    Hermelo, David Alfonso
    Langlais, Philippe
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12718 - 12736
  • [6] A Multilingual Dataset for Evaluating Parallel Sentence Extraction from Comparable Corpora
    Zweigenbaum, Pierre
    Sharoff, Serge
    Rapp, Reinhard
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3828 - 3833
  • [7] Legal sentence boundary detection using hybrid deep learning and statistical models
    Sheik, Reshma
    Ganta, Sneha Rao
    Nirmala, S. Jaya
    ARTIFICIAL INTELLIGENCE AND LAW, 2024,
  • [8] Dataset Alignment and Lexicalization to Support Multilingual Analysis of Legal Documents
    Stellato A.
    Fiorelli M.
    Turbati A.
    Lorenzetti T.
    Schmitz P.
    Francesconi E.
    Hajlaoui N.
    Batouche B.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, 10791 : 257 - 271
  • [9] Experiments on sentence boundary detection
    Stevenson, M
    Gaizauskas, R
    6TH APPLIED NATURAL LANGUAGE PROCESSING CONFERENCE/1ST MEETING OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE AND PROCEEDINGS OF THE ANLP-NAACL 2000 STUDENT RESEARCH WORKSHOP, 2000, : 84 - 89
  • [10] Sentence boundary detection in Turkish
    Dinçer, BT
    Karaoglan, B
    ADVANCES IN INFORMATION SYSTEMS, PROCEEDINGS, 2004, 3261 : 255 - 262