Extracting Parallel Paragraphs and Sentences from English-Persian Translated Documents

被引:0
|
作者
Rasooli, Mohammad Sadegh [1 ]
Kashefi, Omid [1 ]
Minaei-Bidgoli, Behrouz [1 ]
机构
[1] Iran Univ Sci & Technol, Dept Comp Engn, Tehran, Iran
来源
关键词
Sentence Alignment; Paragraph Alignment; Parallel Corpus; Bilingual Corpus; Persian; English; Machine Translation; ALIGNMENT;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The task of sentence and paragraph alignment is essential for preparing parallel texts that are needed in applications such as machine translation. The lack of sufficient linguistic data for under-resourced languages like Persian is a challenging issue. In this paper, we proposed a hybrid sentence and paragraph alignment model on Persian-English parallel documents based on simple linguistic features as well as length similarity between sentences and paragraphs of source and target languages. We apply a small bilingual dictionary of Persian-English nouns, punctuation marks, and length similarity as alignment metrics. We combine these features in a linear model and use genetic algorithm to learn the linear equation weights. Evaluation results show that the extracted features improve the baseline model which is only a length-based one.
引用
收藏
页码:574 / 583
页数:10
相关论文
共 26 条
  • [1] Extracting an English-Persian Parallel Corpus from Comparable Corpora
    Karimi, Akbar
    Ansari, Ebrahim
    Bigham, Bahram Sadeghi
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3477 - 3482
  • [2] TEP: Tehran English-Persian Parallel Corpus
    Pilevar, Mohammad Taher
    Faili, Heshaam
    Pilevar, Abdol Hamid
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, PT II, 2011, 6609 : 68 - +
  • [3] TPC: An Automatically Generated Comprehensive English-Persian Parallel Corpus
    Farzi, Saeed
    Faili, Heshaam
    2017 5TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL AND BUSINESS INTELLIGENCE (ISCBI), 2017, : 91 - 95
  • [4] Constructing a Large-Scale English-Persian Parallel Corpus
    Miangah, Tayebeh Mosavi
    META, 2009, 54 (01) : 181 - 188
  • [5] Shihab al-Din Yahya ibn Habash Suhrawardi, The 'Book of Radiance'. A parallel English-Persian text
    Khismatulin, AA
    ISLAM-ZEITSCHRIFT FUR GESCHICHTE UND KULTUR DES ISLAMISCHEN ORIENTS, 2000, 77 (02): : 350 - 351
  • [6] Extracting entity relationship diagram (ERD) from english sentences
    Al-Btoush, Amani Abdel-Salam
    International Journal of Database Theory and Application, 2015, 8 (02): : 235 - 244
  • [7] An Efficient Framework for Extracting Parallel Sentences from Non-Parallel Corpora
    Cuong Hoang
    Anh-Cuong Le
    Phuong-Thai Nguyen
    Son Bao Pham
    Tu Bao Ho
    FUNDAMENTA INFORMATICAE, 2014, 130 (02) : 179 - 199
  • [8] Extracting Parallel Sentences from Nonparallel Corpora Using Parallel Hierarchical Attention Network
    Zhu, Shaolin
    Yang, Yong
    Xu, Chun
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2020, 2020
  • [9] EXTRACTING KNOWLEDGE FROM ENGLISH TRANSLATED QURAN USING NLP PATTERN
    Ismail, Rohana
    Abu Bakar, Zainab
    Abd Rahman, Nurazzah
    JURNAL TEKNOLOGI-SCIENCES & ENGINEERING, 2015, 77 (19): : 67 - 73
  • [10] Extraction of Indonesian and English Parallel Sentences from Movie Subtitles
    Yeo, Boon Hong
    Aw, Ai Ti
    Wang, Xuancong
    2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 298 - 301