Studying the history of the Arabic language: language technology and a large-scale historical corpus

被引:11
|
作者
Belinkov, Yonatan [1 ,2 ]
Magidow, Alexander [3 ]
Barron-Cedeno, Alberto [4 ]
Shmidman, Avi [5 ,6 ]
Romanov, Maxim [7 ]
机构
[1] MIT, Comp Sci & Artificial Intelligence Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA
[2] Harvard John A Paulson Sch Engn & Appl Sci, Cambridge, MA 02138 USA
[3] Univ Rhode Isl, Dept Modern & Class Languages & Literatures, Kingston, RI 02881 USA
[4] HBKU, Qatar Comp Res Inst, Doha, Qatar
[5] Bar Ilan Univ, Dept Hebrew Literature, IL-5290002 Ramat Gan, Israel
[6] Dicta Israel Ctr Text Anal, Ve Olamo 8, IL-9546306 Jerusalem, Israel
[7] Univ Vienna, Dept Hist, Vienna, Austria
基金
以色列科学基金会;
关键词
Arabic; Corpus; Periodization; Text reuse; Historical linguistics;
D O I
10.1007/s10579-019-09460-w
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties.Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our efforts to clean and process this corpus using Arabic NLP tools, including the identification of reused text.We study the history of the Arabic language using a novel automatic periodization algorithm, as well as other techniques.Our findings confirm the established division of written Arabic into Modern Standard and Classical Arabic, and confirm other established periodizations, while suggesting that written Arabic may be divisible into still further periods of development.
引用
收藏
页码:771 / 805
页数:35
相关论文
共 50 条
  • [21] Historical glossary of the Arabic scientific language
    Sanagustin, Floreal
    ARABICA, 2018, 65 (1-2) : 271 - 273
  • [22] Historical glossary of the Arabic scientific language
    Druart, Therese-Anne
    JOURNAL OF THE HISTORY OF PHILOSOPHY, 2018, 56 (01) : 174 - 174
  • [23] Fine Tuning of large language Models for Arabic Language
    Tamer, Ahmed
    Hassan, Al-Amir
    Ali, Asmaa
    Salah, Nada
    Medhat, Walaa
    2023 20TH ACS/IEEE INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, AICCSA, 2023,
  • [24] Studying Arabic as an Additional Language Together with Arab Heritage Language Learners
    Dhahir, Omar
    AL-ARABIYYA-JOURNAL OF THE AMERICAN ASSOCIATION OF TEACHERS OF ARABIC, 2015, 48 : 43 - 59
  • [25] Evolution and Development of Artificial Intelligence Interpretation Technology in the Age of Large-scale Language Models
    Peng, Hao
    Zhou, Peipei
    JOURNAL OF ELECTRICAL SYSTEMS, 2024, 20 (02) : 1988 - 1996
  • [26] TC-BERT: large-scale language model for Korean technology commercialization documents
    Kim, Taero
    Oh, Changdae
    Hwang, Hyeji
    Lee, Eunkyeong
    Kim, Yewon
    Choi, Yunjeong
    Kim, Sungjin
    Choi, Hosik
    Song, Kyungwoo
    JOURNAL OF SUPERCOMPUTING, 2025, 81 (01):
  • [27] Pseudo In-Domain Data Selection from Large-Scale Web Corpus for Spoken Language Translation
    Lu, Shixiang
    Peng, Xingyuan
    Chen, Zhenbiao
    Xu, Bo
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2013, 2013, 400 : 116 - 126
  • [28] Assessing the feasibility of large-scale natural language processing in a corpus of ordinary medical records: A lexical analysis
    Hersh, WR
    Campbell, EM
    Malveau, SE
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 1997, : 580 - 584
  • [29] Multi-Word Expressions in Second Language Writing: A Large-Scale Longitudinal Learner Corpus Study
    Siyanova-Chanturia, Anna
    Spina, Stefania
    LANGUAGE LEARNING, 2020, 70 (02) : 420 - 463
  • [30] Improving Large-scale Language Models and Resources for Filipino
    Cruz, Jan Christian Blaise
    Cheng, Charibeth
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6548 - 6555