Automatic word spacing using probabilistic models based on character n-grams

被引:13
|
作者
Lee, Do-Gil [1 ]
Rim, Hae-Chang [1 ]
Yook, Dongsuk [1 ]
机构
[1] Korea Univ, Dept Comp Sci & Engn, Seoul 136701, South Korea
关键词
Probabilistic logics;
D O I
10.1109/MIS.2007.4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Probabilistic models based on Hidden Markov models (HMM) for automatic word spacing that use characters n-grams, which is a sub-sequence of n characters in a given character sequence, are discussed. Automatic word spacing is a preprocessing techniques used for correcting boundaries between words in a sentence containing spacing errors. These model can be effectively applied to a natural language with a small character set, such as English, using character n-grams that are larger than trigrams. These models, which are language independent and can be effectively used for languages having word spacing, can also be used for word segmentation in the languages without explicit word spacing. These models, by generalizing the HMMs, can consider a broad context and estimate accurate probabilities.
引用
收藏
页码:28 / 35
页数:8
相关论文
共 50 条
  • [31] Handwritten address recognition with open vocabulary using character n-grams
    Brakensiek, A
    Rottland, J
    Rigoll, G
    EIGHTH INTERNATIONAL WORKSHOP ON FRONTIERS IN HANDWRITING RECOGNITION: PROCEEDINGS, 2002, : 357 - 362
  • [32] Using character N-grams to explore diachronic change in medieval English
    Buckley, Kevin
    Vogel, Carl
    FOLIA LINGUISTICA, 2019, 53 : 249 - 299
  • [33] Feature selection on Chinese text classification using character n-grams
    Wei, Zhihua
    Miao, Duoqian
    Chauchat, Jean-Hugues
    Zhong, Caiming
    ROUGH SETS AND KNOWLEDGE TECHNOLOGY, 2008, 5009 : 500 - +
  • [34] Mining generalized character n-grams in large corpora
    Marques, Nuno C.
    Braud, Agnès
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2003, 2902 : 419 - 423
  • [35] ROBUST MODELING OF MUSICAL CHORD SEQUENCES USING PROBABILISTIC N-GRAMS
    Scholz, Ricardo
    Vincent, Emmanuel
    Bimbot, Frederic
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 53 - 56
  • [36] Turkish Spelling Error Detection and Correction by Using Word N-grams
    Dalkilic, Gokhan
    Cebi, Yalcin
    2009 FIFTH INTERNATIONAL CONFERENCE ON SOFT COMPUTING, COMPUTING WITH WORDS AND PERCEPTIONS IN SYSTEM ANALYSIS, DECISION AND CONTROL, 2010, : 63 - 66
  • [37] Classifying True and False Hebrew Stories Using Word N-Grams
    HaCohen-Kerner, Yaakov
    Dilmon, Rakefet
    Friedlich, Shimon
    Cohen, Daniel Nissim
    CYBERNETICS AND SYSTEMS, 2016, 47 (08) : 629 - 649
  • [38] Dissimilarities Detections in Texts Using Symbol n-grams and Word Histograms
    Andrejkova, Gabriela
    Almarimi, Abdulwahed
    OPEN COMPUTER SCIENCE, 2016, 6 (01): : 168 - 177
  • [39] Character N-Grams for Detecting Deceptive Controversial Opinions
    Sanchez-Junquera, Javier
    Villasenor-Pineda, Luis
    Montes-y-Gomez, Manuel
    Rosso, Paolo
    EXPERIMENTAL IR MEETS MULTILINGUALITY, MULTIMODALITY, AND INTERACTION (CLEF 2018), 2018, 11018 : 135 - 140
  • [40] Mining generalized character n-grams in large corpora
    Marques, NC
    Braud, A
    PROGRESS IN ARTIFICIAL INTELLIGENCE-B, 2003, 2902 : 419 - 423