THE AUTOMATIC SPEAKER RECOGNITION SYSTEM OF CRITICAL USE CLASSIFIER OPTIMIZATION

Authors

  • O. V Bisikalo Vinnytsia National Technical University, Vinnytsia, Ukraine, Ukraine
  • T. V. Grischuk Vinnytsia National Technical University, Vinnytsia, Ukraine, Ukraine
  • V. V. Kovtun Vinnytsia National Technical University, Vinnytsia, Ukraine, Ukraine

DOI:

https://doi.org/10.15588/1607-3274-2018-2-4

Keywords:

automated speaker recognition system of critical use, signal processing, neural network, feature analysis

Abstract

Context. The questions of adapting the convolution neural network classifier use in automatic speaker recognition system of critical use
(ASRSCU) are considered. The research object is the individual features of the human speech process.
Objective. Development of means for separating individual features from the speaker’s speech signal, increasing their informativeness as
a result of the factor analysis, their visual representation for the use of the convolution neural network classifier, and optimizing its
architecture for the needs of ASRSCU.
Method. Measures are proposed to optimize the speaker recognition procedure of the ASRSCU, for which the optimal way of informative
features representation and the method of increasing their informativeness are theoretically justified, the topology and measures for increasing
of the speaker recognition process efficiency are justified. In particular, it is justified the use of power normalized cepstral coefficients (PNCC)
for the description of phonograms recorded in noisy environment conditions. We propose to use Gabor filters to represent information that
will be analyzed by a convolution neural network, an optimal method of factor analysis (a sparse main components analyzing method) to
reduce of the features vector length while preserving its informativeness, an improved topology of the convolution neural network in which
the Gabor filters are integrated in to the convolution layer, which allows them to optimize their parameters during the neural network training
process, and in a fully connected layer a deep neural network with a bottleneck layer is used, whose weights after training are uses as inputs for
the GMM/HMM control classifier.
Results. Methods of representation and optimization of the speaker’s individual features, methods for their visual presentation and
improvement of the topology of a convolution neural network for making speaker recognition on their basis.
Conclusions. The obtained theoretical results have found empirical confirmation. In particular, the stability of an improved convolution
neural network to the noisy input phonograms proved to be higher than the results of an ordinary convolution neural network and a deep neural
network. With an SNR increase up to 10 dB, the GMM/HMM classifier is more efficient than the neural network, which can be explained by the efficiency of the used UBM models, but it is much more resource-intensive. Also, the parameters of the Gabor filter bank frames that
provide the most variable individual features from the speech signal for speaker recognition are determined empirically.

References

Kalinli O., Seltzer M. L., Acero A. Noise adaptive training using a vector Taylor series approach for noise robust automatic speech recognition, [Electronic resource], Access mode: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Ozlem_ICASSP09_final.pdf

Kovtun V. V., Bykov M. M. Otsiniuvannia nadiinosti

avtomatyzovanykh system rozpiznavannia movtsiv

krytychnoho zastosuvannia, Visnyk Vinnytskoho politekhnichnoho instytutu, Vinnytsia, 2017, No. 2, pp. 70–76.

Kim C., Stern R. M. Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring, [Electronic resource]. Access mode: http://c i t e s e e r x . i s t . p s u . e d u / v i e w d o c /download?doi=10.1.1.184.9018&rep=rep1&type=pdf

Mitra V., Franco H., Graciarena M., Mandal A. Normalized

amplitude modulation features for large vocabulary noise-robust speech recognition, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 25–30 March 2012 : proceedings. Kyoto, Japan, IEEE, 2012, pp. 4117–4120. DOI: 10.1109/ICASSP.2012.6288824.

Speech Processing, Transmission and Quality Aspects (STQ),[Electronic resource]. Access mode: http://www.etsi.org/deliver/e t s i_ es /2 0 1 1 0 0 _ 2 0 1 1 9 9 /2 0 1 1 0 8 / 0 1 . 0 1 . 0 3 _ 6 0 /es_201108v010103p.pdf

Graves A., Mohamed A. R., Hinton G. Speech recognition with deep recurrent neural networks, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 26–31 May 2013, proceedings, Vancouver, BC, Canada, IEEE, 2013, pp. 6645–6649. DOI: 10.1109/ICASSP.2013.6638947

Mohamed A., Dahl G., Hinton G. Acoustic modeling using deep belief networks, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 31 January 2011, proceedings, IEEE, 2011, pp. 14–22. DOI: 10.1109/

TASL.2011.2109382 8. Davis S., Mermelstein P. Comparison of parametric representation

of monosyllabic word recognition in continuously spoken sentences,

[Electronic resource], Access mode: http://

www.cs.northwestern.edu/~pardo/courses/eecs352/papers/

Davis1980-MFCC.pdf

Hermansky H., Cohen J., Stern R. Perceptual Properties of Current

Speech Recognition Technology, IEEE International Conference

on Acoustics, Speech and Signal Processing (ICASSP), 23 July

: proceedings, IEEE, 2013, pp. 1968–1985. DOI: 10.1109/

JPROC.2013.2252316.

Virtanen T., Singh R., Raj B. Techniques for Noise Robustness in

Automatic Speech Recognition, John Wiley & Sons, Ltd,

Chichester, UK, 2012. DOI: 10.1002/9781118392683.ch1.

Stern R., Morgan N. Hearing is Believing. Biologically inspired

methods for robust automatic speech recognition, [Electronic

resource]. Access mode: https://pdfs.semanticscholar.org/d4a9/

a6aa42dcb2011e45a99b0174da6a47777b7a.pdf

Kim C., Stern R. Power-normalized cepstralcoefficients (PNCC)

for robust speech recognitions, [Electronic resource]. Access mode:

http://www.cs.cmu.edu/~robust/Papers/OnlinePNCC_V25.pdf

Movellan J. Tutorial on Gabor Filters. [Electronic resource].

Access mode: http://mplab.ucsd.edu/tutorials/gabor.pdf

Mesgarani N., Shamma S. Speech Processing with a Cortical Representation of Audio, [Electronic resource]. Access mode: h t t p s : / / p d f s . s e m a n t i c s c h o l a r . o r g / f 1 d 8 /f93cdb64390b3a65f930cee4346c30bd86e4.pdf

Morgan N., Ravuri S. Using spectro-temporal features to improve AFE feature extraction for automatic speech recognition, [Electronic resource]. Access mode: https://

p d f s . s e m a n t i c s c h o l a r . o r g / c 7 c 5 /

f2107f0ea9a3cedeeaf5cc0c48c0c92.pdf

Berthet Q., Rigollet P. Optimal Detection of Sparse Principal Components in High Dimension, [Electronic resource]. Access mode: https://arxiv.org/pdf/1202.5070.pdf

Bereza A. O., Bykov M. M., Hafurova A. D., Kovtun V. V.

Optymizatsiia alfavitu informatyvnykh oznak dlia avtomatyzovanoi systemy rozpiznavannia movtsiv krytychnoho zastosuvannia, Visnyk Khmelnytskoho natsionalnoho universytetu, seriia: Tekhnichni nauky,

Khmelnytskyi, 2017, No. 3(249), pp. 222–228.

Mak M. W., Yu H. B. A study of voice activity detection techniques for NIST speaker recognition evaluations, [Electronic resource]. Access mode: https://pdfs.semanticscholar.org/541f/9cfacdac000aadd57cd33b6d86dc96bc3308.pdf

How to Cite

Bisikalo, O. V., Grischuk, T. V., & Kovtun, V. V. (2018). THE AUTOMATIC SPEAKER RECOGNITION SYSTEM OF CRITICAL USE CLASSIFIER OPTIMIZATION. Radio Electronics, Computer Science, Control, (2). https://doi.org/10.15588/1607-3274-2018-2-4

Issue

Section

Neuroinformatics and intelligent systems