A study in machine learning from imbalanced data for sentence boundary detection in speech

被引:77
|
作者
Liu, Yang
Chawla, Nitesh V.
Harper, Mary R.
Shriberg, Elizabeth
Stolcke, Andreas
机构
[1] Int Comp Sci Inst, Speech Grp, Berkeley, CA 94704 USA
[2] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46530 USA
[3] Purdue Univ, Dept Elect & Comp Engn, W Lafayette, IN 47907 USA
[4] SRI Int, Menlo Pk, CA 94025 USA
来源
COMPUTER SPEECH AND LANGUAGE | 2006年 / 20卷 / 04期
基金
美国国家科学基金会; 美国国家航空航天局;
关键词
D O I
10.1016/j.csl.2005.06.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect sentence boundaries that uses both prosodic and textual information. Since there are more nonsentence boundaries than sentence boundaries in the data, the prosody model, which is implemented as a decision tree classifier, must be constructed to effectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST sentence boundary evaluation task across two corpora (conversational telephone speech and broadcast news speech), using both human transcriptions and recognition output. In the pilot study, when classification error rate is the performance measure, using the original training set achieves the best performance among the sampling methods, and an ensemble of multiple classifiers from different downsampled training sets achieves slightly poorer performance, but has the potential to reduce computational effort. However, when performance is measured using receiver operating characteristics (ROC) or area under the curve (AUC), then the sampling approaches outperform the original training set. This observation is important if the sentence boundary detection output. is used by downstream language processing modules. Bagging was found to significantly improve system performance for each of the sampling methods. The gain from these methods may be diminished when the prosody model is combined with the language model, which is,a strong knowledge source for the sentence detection task. The. patterns found in the pilot study were replicated in the full NIST evaluation task. The conclusions may be dependent on the task, the classifiers, and the knowledge combination approach. (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:468 / 494
页数:27
相关论文
共 50 条
  • [1] Machine Learning for Prediction of Imbalanced Data: Credit Fraud Detection
    Thanh Cong Tran
    Tran Khanh Dang
    PROCEEDINGS OF THE 2021 15TH INTERNATIONAL CONFERENCE ON UBIQUITOUS INFORMATION MANAGEMENT AND COMMUNICATION (IMCOM 2021), 2021,
  • [2] Reranking for sentence boundary detection in conversational speech
    Roark, Brian
    Liu, Yang
    Harper, Mary
    Stewart, Robin
    Lease, Matthew
    Snover, Matthew
    Shafran, Izhak
    Dorr, Bonnie
    Hale, John
    Krasnyanskaya, Anna
    Yung, Lisa
    2006 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-13, 2006, : 545 - 548
  • [3] Evolutionary Online Machine Learning from Imbalanced Data
    Stein, Anthony
    2016 IEEE 1ST INTERNATIONAL WORKSHOPS ON FOUNDATIONS AND APPLICATIONS OF SELF* SYSTEMS (FAS*W), 2016, : 281 - 286
  • [4] Improved Practices in Machine Learning Algorithms for NTL Detection with Imbalanced Data
    Figueroa, Gerardo
    Chen, Yi-Shin
    Avila, Nelson
    Chu, Chia-Chi
    2017 IEEE POWER & ENERGY SOCIETY GENERAL MEETING, 2017,
  • [5] Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced Data
    Morillo, Paulina
    Bahamonde, Diego
    Tapia, Wilian
    INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 1, INTELLISYS 2023, 2024, 822 : 496 - 507
  • [6] Using Imbalanced Triangle Synthetic Data for Machine Learning Anomaly Detection
    Luo, Menghua
    Wang, Ke
    Cai, Zhiping
    Liu, Anfeng
    Li, Yangyang
    Cheang, Chak Fong
    CMC-COMPUTERS MATERIALS & CONTINUA, 2019, 58 (01): : 15 - 26
  • [7] Machine learning for mining imbalanced data
    Arafat, Md. Yasir
    Hoque, Sabera
    Xu, Shuxiang
    Farid, Dewan Md
    IAENG International Journal of Computer Science, 2019, 46 (02) : 332 - 348
  • [8] Imbalanced data issues in machine learning classifiers: a case study
    Gong, Mingxing
    JOURNAL OF OPERATIONAL RISK, 2022, 17 (04): : 17 - 36
  • [9] Neural Whispered Speech Detection with Imbalanced Learning
    Ashihara, Takanori
    Shinohara, Yusuke
    Sato, Hiroshi
    Moriya, Takafumi
    Matsui, Kiyoaki
    Fukutomi, Takaaki
    Yamaguchi, Yoshikazu
    Aono, Yushi
    INTERSPEECH 2019, 2019, : 3352 - 3356
  • [10] Smartwatch-Based Eating Detection: Data Selection for Machine Learning from Imbalanced Data with Imperfect Labels
    Stankoski, Simon
    Jordan, Marko
    Gjoreski, Hristijan
    Lustrek, Mitja
    SENSORS, 2021, 21 (05) : 1 - 25