Language independent unsupervised learning of short message service dialect

被引:2
|
作者
Acharyya, Sreangsu [2 ]
Negi, Sumit [1 ]
Subramaniam, L. Venkata [1 ]
Roy, Shourya [3 ]
机构
[1] IBM Res, New Delhi, India
[2] Univ Texas Austin, Dept Elect Engn, Austin, TX 78712 USA
[3] Xerox India Innovat Hub, Madras, Tamil Nadu, India
关键词
Noisy text; Unsupervised learning; Clustering;
D O I
10.1007/s10032-009-0093-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Noise in textual data such as those introduced by multilinguality, misspellings, abbreviations, deletions, phonetic spellings, non-standard transliteration, etc. pose considerable problems for text-mining. Such corruptions are very common in instant messenger and short message service data and they adversely affect off-the-shelf text mining methods. Most techniques address this problem by supervised methods by making use of hand labeled corrections. But they require human generated labels and corrections that are very expensive and time consuming to obtain because of multilinguality and complexity of the corruptions. While we do not champion unsupervised methods over supervised when quality of results is the singular concern, we demonstrate that unsupervised methods can provide cost effective results without the need for expensive human intervention that is necessary to generate a parallel labeled corpora. A generative model based unsupervised technique is presented that maps non-standard words to their corresponding conventional frequent form. A hidden Markov model (HMM) over a "subsequencized" representation of words is used, where a word is represented as a bag of weighted subsequences. The approximate maximum likelihood inference algorithm used is such that the training phase involves clustering over vectors and not the customary and expensive dynamic programming (Baum-Welch algorithm) over sequences that is necessary for HMMs. A principled transformation of maximum likelihood based "central clustering" cost function of Baum-Welch into a "pairwise similarity" based clustering is proposed. This transformation makes it possible to apply "subsequence kernel" based methods that model delete and insert corruptions well. The novelty of this approach lies in that the expensive (Baum-Welch) iterations required for HMM, can be avoided through an approximation of the loglikelihood function and by establishing a connection between the loglikelihood and a pairwise distance. Anecdotal evidence of efficacy is provided on public and proprietary data.
引用
收藏
页码:175 / 184
页数:10
相关论文
共 50 条
  • [1] Language independent unsupervised learning of short message service dialect
    Sreangsu Acharyya
    Sumit Negi
    L. Venkata Subramaniam
    Shourya Roy
    International Journal on Document Analysis and Recognition (IJDAR), 2009, 12 : 175 - 184
  • [2] Language/Dialect Recognition Based on Unsupervised Deep Learning
    Zhang, Qian
    Hansen, John H. L.
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (05) : 873 - 882
  • [3] Short message service (SMS) language and written language skills: educators' perspectives
    Geertsema, Salome
    Hyman, Charene
    van Deventer, Chantelle
    SOUTH AFRICAN JOURNAL OF EDUCATION, 2011, 31 (04) : 475 - 487
  • [4] UNSUPERVISED REPRESENTATION LEARNING OF SPEECH FOR DIALECT IDENTIFICATION
    Shon, Suwon
    Hsu, Wei-Ning
    Glass, James
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 105 - 111
  • [5] Language-Independent Text Tokenization Using Unsupervised Deep Learning
    Mahmoud, Hanan A. Hosni
    Hafez, Alaaeldin M.
    Alabdulkreem, Eatedal
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 35 (01): : 321 - 334
  • [6] NORTH FRISIAN AS A GERMAN DIALECT OR INDEPENDENT LANGUAGE
    WALKER, AGH
    ZEITSCHRIFT FUR DIALEKTOLOGIE UND LINGUISTIK, 1983, (02): : 145 - 160
  • [7] SMS: The short message service
    Brown, Jeff
    Shipman, Bill
    Vetter, Ron
    COMPUTER, 2007, 40 (12) : 106 - 110
  • [8] Hypertension short message service
    Magometschnigg D.
    Rothmayer G.
    Wiener Medizinische Wochenschrift, 2011, 161 (13-14) : 353 - 358
  • [9] PHONOLOGY OF DIALECT INTERFERENCE IN 2ND LANGUAGE-LEARNING - STYRIAN DIALECT AND ENGLISH-LANGUAGE LEARNING
    KARPF, A
    KETTEMANN, B
    VIERECK, W
    IRAL-INTERNATIONAL REVIEW OF APPLIED LINGUISTICS IN LANGUAGE TEACHING, 1980, 18 (03): : 193 - 208
  • [10] College students' perceptions of short message service-supported collaborative learning
    Zamani-Miandashti, Naser
    Ataei, Pouria
    INNOVATIONS IN EDUCATION AND TEACHING INTERNATIONAL, 2015, 52 (04) : 426 - 436