A study in machine learning from imbalanced data for sentence boundary detection in speech

被引:77
|
作者
Liu, Yang
Chawla, Nitesh V.
Harper, Mary R.
Shriberg, Elizabeth
Stolcke, Andreas
机构
[1] Int Comp Sci Inst, Speech Grp, Berkeley, CA 94704 USA
[2] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46530 USA
[3] Purdue Univ, Dept Elect & Comp Engn, W Lafayette, IN 47907 USA
[4] SRI Int, Menlo Pk, CA 94025 USA
来源
COMPUTER SPEECH AND LANGUAGE | 2006年 / 20卷 / 04期
基金
美国国家科学基金会; 美国国家航空航天局;
关键词
D O I
10.1016/j.csl.2005.06.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect sentence boundaries that uses both prosodic and textual information. Since there are more nonsentence boundaries than sentence boundaries in the data, the prosody model, which is implemented as a decision tree classifier, must be constructed to effectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST sentence boundary evaluation task across two corpora (conversational telephone speech and broadcast news speech), using both human transcriptions and recognition output. In the pilot study, when classification error rate is the performance measure, using the original training set achieves the best performance among the sampling methods, and an ensemble of multiple classifiers from different downsampled training sets achieves slightly poorer performance, but has the potential to reduce computational effort. However, when performance is measured using receiver operating characteristics (ROC) or area under the curve (AUC), then the sampling approaches outperform the original training set. This observation is important if the sentence boundary detection output. is used by downstream language processing modules. Bagging was found to significantly improve system performance for each of the sampling methods. The gain from these methods may be diminished when the prosody model is combined with the language model, which is,a strong knowledge source for the sentence detection task. The. patterns found in the pilot study were replicated in the full NIST evaluation task. The conclusions may be dependent on the task, the classifiers, and the knowledge combination approach. (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:468 / 494
页数:27
相关论文
共 50 条
  • [21] Active Learning From Imbalanced Data: A Solution of Online Weighted Extreme Learning Machine
    Yu, Hualong
    Yang, Xibei
    Zheng, Shang
    Sun, Changyin
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2019, 30 (04) : 1088 - 1103
  • [22] Integrating Data Selection and Extreme Learning Machine for Imbalanced Data
    Mahdiyah, Umi
    Irawan, M. Isa
    Imah, Elly Matul
    INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND COMPUTATIONAL INTELLIGENCE (ICCSCI 2015), 2015, 59 : 221 - 229
  • [23] Adversarial Approaches to Tackle Imbalanced Data in Machine Learning
    Ayoub, Shahnawaz
    Gulzar, Yonis
    Rustamov, Jaloliddin
    Jabbari, Abdoh
    Reegu, Faheem Ahmad
    Turaev, Sherzod
    SUSTAINABILITY, 2023, 15 (09)
  • [24] A comparative analysis of machine learning techniques for imbalanced data
    Mrad, Ali Ben
    Lahiani, Amine
    Mefteh-Wali, Salma
    Mselmi, Nada
    ANNALS OF OPERATIONS RESEARCH, 2024,
  • [25] Machine-learning classifiers for imbalanced tornado data
    Trafalis T.B.
    Adrianto I.
    Richman M.B.
    Lakshmivarahan S.
    Computational Management Science, 2014, 11 (4) : 403 - 418
  • [26] An Improved Extreme Learning Machine for Imbalanced Data Classification
    Zhang, Xiaopeng
    Qin, Liangxi
    IEEE ACCESS, 2022, 10 : 8634 - 8642
  • [27] A machine learning method for incomplete and imbalanced medical data
    Salman, Issam
    Vomlel, Jiri
    PROCEEDINGS OF THE 20TH CZECH-JAPAN SEMINAR ON DATA ANALYSIS AND DECISION MAKING UNDER UNCERTAINTY, 2017, : 188 - 195
  • [28] Learning classifiers from imbalanced data based on biased minimax probability machine
    Huang, KZ
    Yang, HQ
    King, I
    Lyu, MR
    PROCEEDINGS OF THE 2004 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 2, 2004, : 558 - 563
  • [29] Ensemble Learning from Imbalanced Data Set for Video Event Detection
    Yang, Yimin
    Chen, Shu-Ching
    2015 IEEE 16TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION, 2015, : 82 - 89
  • [30] Machine Learning Classifiers for Speech Detection
    Prasanna, Dasari Lakshmi
    Tripathi, Suman Lata
    PROCEEDINGS OF 3RD IEEE CONFERENCE ON VLSI DEVICE, CIRCUIT AND SYSTEM (IEEE VLSI DCS 2022), 2022, : 143 - 147