A study in machine learning from imbalanced data for sentence boundary detection in speech

被引:77
|
作者
Liu, Yang
Chawla, Nitesh V.
Harper, Mary R.
Shriberg, Elizabeth
Stolcke, Andreas
机构
[1] Int Comp Sci Inst, Speech Grp, Berkeley, CA 94704 USA
[2] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46530 USA
[3] Purdue Univ, Dept Elect & Comp Engn, W Lafayette, IN 47907 USA
[4] SRI Int, Menlo Pk, CA 94025 USA
来源
COMPUTER SPEECH AND LANGUAGE | 2006年 / 20卷 / 04期
基金
美国国家科学基金会; 美国国家航空航天局;
关键词
D O I
10.1016/j.csl.2005.06.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect sentence boundaries that uses both prosodic and textual information. Since there are more nonsentence boundaries than sentence boundaries in the data, the prosody model, which is implemented as a decision tree classifier, must be constructed to effectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST sentence boundary evaluation task across two corpora (conversational telephone speech and broadcast news speech), using both human transcriptions and recognition output. In the pilot study, when classification error rate is the performance measure, using the original training set achieves the best performance among the sampling methods, and an ensemble of multiple classifiers from different downsampled training sets achieves slightly poorer performance, but has the potential to reduce computational effort. However, when performance is measured using receiver operating characteristics (ROC) or area under the curve (AUC), then the sampling approaches outperform the original training set. This observation is important if the sentence boundary detection output. is used by downstream language processing modules. Bagging was found to significantly improve system performance for each of the sampling methods. The gain from these methods may be diminished when the prosody model is combined with the language model, which is,a strong knowledge source for the sentence detection task. The. patterns found in the pilot study were replicated in the full NIST evaluation task. The conclusions may be dependent on the task, the classifiers, and the knowledge combination approach. (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:468 / 494
页数:27
相关论文
共 50 条
  • [41] IMBALANCED DATA CLASSIFICATION BASED ON EXTREME LEARNING MACHINE AUTOENCODER
    Shen, Chu
    Zhang, Su-Fang
    Zhai, Jun-Hal
    Luo, Ding-Sheng
    Chen, Jun-Fen
    PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS (ICMLC), VOL 2, 2018, : 399 - 404
  • [42] Automatic Hate Speech Detection using Machine Learning: A Comparative Study
    Abro, Sindhu
    Shaikh, Sarang
    Ali, Zafar
    Khan, Sajid
    Mujtaba, Ghulam
    Khand, Zahid Hussain
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (08) : 484 - 491
  • [43] An Explainable Machine Learning Pipeline for Stroke Prediction on Imbalanced Data
    Kokkotis, Christos
    Giarmatzis, Georgios
    Giannakou, Erasmia
    Moustakidis, Serafeim
    Tsatalas, Themistoklis
    Tsiptsios, Dimitrios
    Vadikolias, Konstantinos
    Aggelousis, Nikolaos
    DIAGNOSTICS, 2022, 12 (10)
  • [44] Sampling Approaches for Imbalanced Data Classification Problem in Machine Learning
    Tyagi, Shivani
    Mittal, Sangeeta
    PROCEEDINGS OF RECENT INNOVATIONS IN COMPUTING, ICRIC 2019, 2020, 597 : 209 - 221
  • [45] An improved weighted extreme learning machine for imbalanced data classification
    Lu, Chengbo
    Ke, Haifeng
    Zhang, Gaoyan
    Mei, Ying
    Xu, Huihui
    MEMETIC COMPUTING, 2019, 11 (01) : 27 - 34
  • [46] Online Automated Machine Learning for Class Imbalanced Data Streams
    Wang, Zhaoyang
    Wang, Shuo
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [47] On Machine Learning with Imbalanced Data and Research Quality Evaluation Methodologies
    Lipitakis, Anastasia-Dimitra
    Lipitakis, Evangelia A. E. C.
    2014 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI), VOL 1, 2014, : 451 - 457
  • [48] An improved weighted extreme learning machine for imbalanced data classification
    Chengbo Lu
    Haifeng Ke
    Gaoyan Zhang
    Ying Mei
    Huihui Xu
    Memetic Computing, 2019, 11 : 27 - 34
  • [49] Informative Instance Detection for Active Learning on Imbalanced Data
    Xu, Jian
    Wang, Xinyue
    Cai, Zixin
    Yang, Liu
    Jing, Liping
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [50] Sentence boundary detection in conversational speech transcripts using noisily labeled examples
    Takeuchi, Hironori
    Subramaniam, L. Venkata
    Roy, Shourya
    Punjani, Diwakar
    Nasukawa, Tetsuya
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2007, 10 (3-4) : 147 - 155