A Comprehensive Review on Phoneme Classification in ML Models

A Sai Sarath; Dr. M Sreedevi

doi:10.32628/CSEIT20641019

Authors

A Sai Sarath PG Scholar, Sri Venkateswara University, Tirupati, Andhra Pradesh, India
Dr. M Sreedevi Assistant Professor, Department of Computer Science, Sri Venkateswara University, Tirupati, Andhra Pradesh, India

Keywords:

Phoneme Classification, Filter-Bank, Acoustic Features, Machine Learning, SVM, DNN, LSTM, Computing Methodologies, Artificial Intelligence, Speech Recognition, Machine Learning, Feature Selection, Information Extraction, Supervised Learning, Classification.

Abstract

This paper gives a relative performance examination of both shallow and profound machine learning classifiers for speech recognition errands utilizing outline level phoneme classification. Phoneme recognition is as yet a principal and similarly significant introductory advance toward automatic speech recognition (ASR) frameworks. Frequently regular classifiers perform outstandingly well on domain-explicit ASR frameworks having a constrained arrangement of jargon and preparing information as opposed to profound learning draws near. It is consequently basic to assess the performance of a framework utilizing profound artificial systems regarding effectively perceiving nuclear speech units, i.e., phonemes right now customary cutting-edge machine learning classifiers. Two profound learning models - DNN and LSTM with numerous arrangement structures by changing the quantity of layers and the quantity of neurons in each layer on the OLLO speech corpora alongside with six shallow machines get the hang of ing classifiers for Filterbank acoustic features are completely considered. Moreover, features with three and ten edges transient setting are registered and contrasted and no-setting features for various models. The classifier's performance is assessed as far as accuracy, review, and F1 score for 14 consonants and 10 vowels classes for 10 speakers with 4 distinct tongues. High classification precision of 93% and 95% F1 score is gotten with DNN and LSTM organizes separately on setting subordinate features for 3-shrouded layers containing 1024 hubs each. SVM shockingly acquired even a higher classification score of 96.13% and a misclassification blunder of under 5% for consonants and 4% for vowels.

References

P. Schwarz, P. Matejka, L. Burget, and O. Glembek, Phoneme recognizer based on long temporal context," Speech Processing Group, Faculty of Information Technology, Brno University of Technology.[Online]. Available: http://speech. t. vutbr. cz/en/software, 2006.
H. Hermansky and S. Sharma, Temporal patterns (TRAPs) in ASR of noisy speech," in IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99, vol. 1,pp. 289{292, March 1999.
A. Mohamed, G. E. Dahl, and G. Hinton, Acoustic modeling using deep belief networks," Transsactions on Audio, Speech and Language Processing, vol. 20,pp. 14{22, January 2012.
G. E. Dahl, D. Yu, L. Deng, and A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition," IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30{42, 2012.
B. Kingsbury, T. N. Sainath, and H. Soltau, Scalable minimum bayes risk training of deep neural network acoustic models using distributed hessian-free optimization," in 13th Annual Conference of the International Speech Communication Association (InterSpeech 2012), pp. 10{13, ISCA, September 2012.
D. Yu, F. Seide, and G. Li, Conversational speech transcription using context-dependent deep neural networks," in Proceedings of the 29th International Conference on Machine Learning, ICML’12, pp. 1{2, Omnipress, August 2012.
L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide,M. Seltzer, G. Zweig, X. He, J. Williams, Y. Gong, and A. Acero, Recent advances in deep learning for speech research at microsoft," in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8604{8608, 2013. Exported from https://app.dimensions.ai on 2018/12/18.
J. Padmanabhan and M. J. J. Premkumar, Machine learning in automatic speech recognition: A survey," IETE Technical Review, vol. 32, no. 4, pp. 240{251, 2015.
I. Gavat and D. Militaru, Deep learning in acoustic modeling for automatic speech recognition and understanding-an overview," in 2015 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 1{8, IEEE, October 2015.
L. Deng and J. C. Platt, Ensemble deep learning for speech recognition," in Fifteenth Annual Conference of the International Speech Communication Association (InterSpeech 2014), pp. 1915{1919, ISCA, September 2014.
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, Application of pretrained deep neural networks to large vocabulary speech recognition," in Thirteenth Annual Conference of the International Speech Communication Association (InterSpeech 2012), ISCA, September 2012.
J. M. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan, and D. O’Shaughnessy, Developments and directions in speech recognition and understanding, part 1 [dsp education]," IEEE Signal Processing Magazine, vol. 26, pp. 75{80, May 2009.
A. S. Shahrebabaki, A. S. Imran, N. Olfati, and T. Svendsen, Acoustic feature comparison for di erent speaking rates," in Human-Computer Interaction. Interaction Technologies (M. Kurosu, ed.), (Cham), pp. 176{189, Springer International Publishing, June 2018.
D. Povey, A. Ghoshal, G. Boulianne, N. Goel, M. Hannemann, Y. Qian, P. Schwarz, and G. Stemmer, The KALDI speech recognition toolkit," in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society, December 2011.
R. Caruana and A. Niculescu-Mizil, An empirical comparison of supervised learning algorithms," in Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, (New York, NY, USA), pp. 161{168, ACM, 2006.
A. Graves and J. Schmidhuber, Framewise phoneme classifcation with bidirectional LSTM and other neural network architectures," Neural Networks, vol. 18, pp. 602{610, July 2005.
M. Wollmer, B. Schuller, and G. Rigoll, Feature frame stacking in RNN-based tandem ASR systems-learned vs. prede ned context," in Twelfth Annual Conference of the International Speech Communication Association (InterSpeech 2011), pp.1233{1236, ISCA, August 2011.
J. Salomon, S. King, and J. Salomon, Framewise phone classifcation using support vector machines," in Seventh International Conference on Spoken Language Processing, pp. 2645{2648, ISCA, September 2002.

A Comprehensive Review on Phoneme Classification in ML Models

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite