A study in machine learning from imbalanced data for sentence boundary detection in speech

被引:77
|
作者
Liu, Yang
Chawla, Nitesh V.
Harper, Mary R.
Shriberg, Elizabeth
Stolcke, Andreas
机构
[1] Int Comp Sci Inst, Speech Grp, Berkeley, CA 94704 USA
[2] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46530 USA
[3] Purdue Univ, Dept Elect & Comp Engn, W Lafayette, IN 47907 USA
[4] SRI Int, Menlo Pk, CA 94025 USA
来源
COMPUTER SPEECH AND LANGUAGE | 2006年 / 20卷 / 04期
基金
美国国家科学基金会; 美国国家航空航天局;
关键词
D O I
10.1016/j.csl.2005.06.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect sentence boundaries that uses both prosodic and textual information. Since there are more nonsentence boundaries than sentence boundaries in the data, the prosody model, which is implemented as a decision tree classifier, must be constructed to effectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST sentence boundary evaluation task across two corpora (conversational telephone speech and broadcast news speech), using both human transcriptions and recognition output. In the pilot study, when classification error rate is the performance measure, using the original training set achieves the best performance among the sampling methods, and an ensemble of multiple classifiers from different downsampled training sets achieves slightly poorer performance, but has the potential to reduce computational effort. However, when performance is measured using receiver operating characteristics (ROC) or area under the curve (AUC), then the sampling approaches outperform the original training set. This observation is important if the sentence boundary detection output. is used by downstream language processing modules. Bagging was found to significantly improve system performance for each of the sampling methods. The gain from these methods may be diminished when the prosody model is combined with the language model, which is,a strong knowledge source for the sentence detection task. The. patterns found in the pilot study were replicated in the full NIST evaluation task. The conclusions may be dependent on the task, the classifiers, and the knowledge combination approach. (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:468 / 494
页数:27
相关论文
共 50 条
  • [31] Distributed and Weighted Extreme Learning Machine for Imbalanced Big Data Learning
    Zhiqiong Wang
    Junchang Xin
    Hongxu Yang
    Shuo Tian
    Ge Yu
    Chenren Xu
    Yudong Yao
    Tsinghua Science and Technology, 2017, 22 (02) : 160 - 173
  • [32] Distributed Weighted Extreme Learning Machine for Big Imbalanced Data Learning
    Wang, Zhiqiong
    Xin, Junchang
    Tian, Shuo
    Yu, Ge
    PROCEEDINGS OF ELM-2015, VOL 1: THEORY, ALGORITHMS AND APPLICATIONS (I), 2016, 6 : 319 - 332
  • [33] Distributed and Weighted Extreme Learning Machine for Imbalanced Big Data Learning
    Wang, Zhiqiong
    Xin, Junchang
    Yang, Hongxu
    Tian, Shuo
    Yu, Ge
    Xu, Chenren
    Yao, Yudong
    TSINGHUA SCIENCE AND TECHNOLOGY, 2017, 22 (02) : 160 - 173
  • [34] A Comprehensive Machine Learning Approach for Early Detection of Diabetes on Imbalanced Data with Missing and Outlier Values
    Yogendra Singh
    Mahendra Tiwari
    SN Computer Science, 6 (3)
  • [35] Dementia Detection from Speech Using Machine Learning and Deep Learning Architectures
    Kumar, M. Rupesh
    Vekkot, Susmitha
    Lalitha, S.
    Gupta, Deepa
    Govindraj, Varasiddhi Jayasuryaa
    Shaukat, Kamran
    Alotaibi, Yousef Ajami
    Zakariah, Mohammed
    SENSORS, 2022, 22 (23)
  • [36] Metric Learning from Imbalanced Data
    Gautheron, Leo
    Habrard, Amaury
    Morvant, Emilie
    Sebban, Marc
    2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 923 - 930
  • [37] Customer purchase prediction from the perspective of imbalanced data: A machine learning framework based on factorization machine
    Chen, Shui-xia
    Wang, Xiao-kang
    Zhang, Hong-yu
    Wang, Jian-qiang
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 173
  • [38] Investigation of Imbalanced Sentiment Analysis in Voice Data: A Comparative Study of Machine Learning Algorithms
    Shah, Viraj Nishchal
    Shah, Deep Rahul
    Shetty, Mayank Umesh
    Krishnan, Deepa
    Ravi, Vinnayakumar
    Singh, Swapnil
    EAI ENDORSED TRANSACTIONS ON SCALABLE INFORMATION SYSTEMS, 2024, 11 (06): : 1 - 12
  • [39] Machine Learning with Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection
    Lin, Ying-Dar
    Liu, Zi-Qiang
    Hwang, Ren-Hung
    Nguyen, Van-Linh
    Lin, Po-Ching
    Lai, Yuan-Cheng
    IEEE Access, 2022, 10 : 15247 - 15260
  • [40] Machine Learning With Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection
    Lin, Ying-Dar
    Liu, Zi-Qiang
    Hwang, Ren-Hung
    Van-Linh Nguyen
    Lin, Po-Ching
    Lai, Yuan-Cheng
    IEEE ACCESS, 2022, 10 : 15247 - 15260