DOI: https://doi.org/10.15588/1607-3274-2018-2-4

THE AUTOMATIC SPEAKER RECOGNITION SYSTEM OF CRITICAL USE CLASSIFIER OPTIMIZATION

O. V Bisikalo, T. V. Grischuk, V. V. Kovtun

Abstract


Context. The questions of adapting the convolution neural network classifier use in automatic speaker recognition system of critical use
(ASRSCU) are considered. The research object is the individual features of the human speech process.
Objective. Development of means for separating individual features from the speaker’s speech signal, increasing their informativeness as
a result of the factor analysis, their visual representation for the use of the convolution neural network classifier, and optimizing its
architecture for the needs of ASRSCU.
Method. Measures are proposed to optimize the speaker recognition procedure of the ASRSCU, for which the optimal way of informative
features representation and the method of increasing their informativeness are theoretically justified, the topology and measures for increasing
of the speaker recognition process efficiency are justified. In particular, it is justified the use of power normalized cepstral coefficients (PNCC)
for the description of phonograms recorded in noisy environment conditions. We propose to use Gabor filters to represent information that
will be analyzed by a convolution neural network, an optimal method of factor analysis (a sparse main components analyzing method) to
reduce of the features vector length while preserving its informativeness, an improved topology of the convolution neural network in which
the Gabor filters are integrated in to the convolution layer, which allows them to optimize their parameters during the neural network training
process, and in a fully connected layer a deep neural network with a bottleneck layer is used, whose weights after training are uses as inputs for
the GMM/HMM control classifier.
Results. Methods of representation and optimization of the speaker’s individual features, methods for their visual presentation and
improvement of the topology of a convolution neural network for making speaker recognition on their basis.
Conclusions. The obtained theoretical results have found empirical confirmation. In particular, the stability of an improved convolution
neural network to the noisy input phonograms proved to be higher than the results of an ordinary convolution neural network and a deep neural
network. With an SNR increase up to 10 dB, the GMM/HMM classifier is more efficient than the neural network, which can be explained by the efficiency of the used UBM models, but it is much more resource-intensive. Also, the parameters of the Gabor filter bank frames that
provide the most variable individual features from the speech signal for speaker recognition are determined empirically.

Keywords


automated speaker recognition system of critical use; signal processing; neural network; feature analysis

References


Kalinli O., Seltzer M. L., Acero A. Noise adaptive training using a vector Taylor series approach for noise robust automatic speech recognition, [Electronic resource], Access mode: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Ozlem_ICASSP09_final.pdf

Kovtun V. V., Bykov M. M. Otsiniuvannia nadiinosti

avtomatyzovanykh system rozpiznavannia movtsiv

krytychnoho zastosuvannia, Visnyk Vinnytskoho politekhnichnoho instytutu, Vinnytsia, 2017, No. 2, pp. 70–76.

Kim C., Stern R. M. Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring, [Electronic resource]. Access mode: http://c i t e s e e r x . i s t . p s u . e d u / v i e w d o c /download?doi=10.1.1.184.9018&rep=rep1&type=pdf

Mitra V., Franco H., Graciarena M., Mandal A. Normalized

amplitude modulation features for large vocabulary noise-robust speech recognition, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 25–30 March 2012 : proceedings. Kyoto, Japan, IEEE, 2012, pp. 4117–4120. DOI: 10.1109/ICASSP.2012.6288824.

Speech Processing, Transmission and Quality Aspects (STQ),[Electronic resource]. Access mode: http://www.etsi.org/deliver/e t s i_ es /2 0 1 1 0 0 _ 2 0 1 1 9 9 /2 0 1 1 0 8 / 0 1 . 0 1 . 0 3 _ 6 0 /es_201108v010103p.pdf

Graves A., Mohamed A. R., Hinton G. Speech recognition with deep recurrent neural networks, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 26–31 May 2013, proceedings, Vancouver, BC, Canada, IEEE, 2013, pp. 6645–6649. DOI: 10.1109/ICASSP.2013.6638947

Mohamed A., Dahl G., Hinton G. Acoustic modeling using deep belief networks, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 31 January 2011, proceedings, IEEE, 2011, pp. 14–22. DOI: 10.1109/

TASL.2011.2109382 8. Davis S., Mermelstein P. Comparison of parametric representation

of monosyllabic word recognition in continuously spoken sentences,

[Electronic resource], Access mode: http://

www.cs.northwestern.edu/~pardo/courses/eecs352/papers/

Davis1980-MFCC.pdf

Hermansky H., Cohen J., Stern R. Perceptual Properties of Current

Speech Recognition Technology, IEEE International Conference

on Acoustics, Speech and Signal Processing (ICASSP), 23 July

: proceedings, IEEE, 2013, pp. 1968–1985. DOI: 10.1109/

JPROC.2013.2252316.

Virtanen T., Singh R., Raj B. Techniques for Noise Robustness in

Automatic Speech Recognition, John Wiley & Sons, Ltd,

Chichester, UK, 2012. DOI: 10.1002/9781118392683.ch1.

Stern R., Morgan N. Hearing is Believing. Biologically inspired

methods for robust automatic speech recognition, [Electronic

resource]. Access mode: https://pdfs.semanticscholar.org/d4a9/

a6aa42dcb2011e45a99b0174da6a47777b7a.pdf

Kim C., Stern R. Power-normalized cepstralcoefficients (PNCC)

for robust speech recognitions, [Electronic resource]. Access mode:

http://www.cs.cmu.edu/~robust/Papers/OnlinePNCC_V25.pdf

Movellan J. Tutorial on Gabor Filters. [Electronic resource].

Access mode: http://mplab.ucsd.edu/tutorials/gabor.pdf

Mesgarani N., Shamma S. Speech Processing with a Cortical Representation of Audio, [Electronic resource]. Access mode: h t t p s : / / p d f s . s e m a n t i c s c h o l a r . o r g / f 1 d 8 /f93cdb64390b3a65f930cee4346c30bd86e4.pdf

Morgan N., Ravuri S. Using spectro-temporal features to improve AFE feature extraction for automatic speech recognition, [Electronic resource]. Access mode: https://

p d f s . s e m a n t i c s c h o l a r . o r g / c 7 c 5 /

f2107f0ea9a3cedeeaf5cc0c48c0c92.pdf

Berthet Q., Rigollet P. Optimal Detection of Sparse Principal Components in High Dimension, [Electronic resource]. Access mode: https://arxiv.org/pdf/1202.5070.pdf

Bereza A. O., Bykov M. M., Hafurova A. D., Kovtun V. V.

Optymizatsiia alfavitu informatyvnykh oznak dlia avtomatyzovanoi systemy rozpiznavannia movtsiv krytychnoho zastosuvannia, Visnyk Khmelnytskoho natsionalnoho universytetu, seriia: Tekhnichni nauky,

Khmelnytskyi, 2017, No. 3(249), pp. 222–228.

Mak M. W., Yu H. B. A study of voice activity detection techniques for NIST speaker recognition evaluations, [Electronic resource]. Access mode: https://pdfs.semanticscholar.org/541f/9cfacdac000aadd57cd33b6d86dc96bc3308.pdf


GOST Style Citations


1. Kalinli O. Noise adaptive training using a vector Taylor series
approach for noise robust automatic speech recognition /
O. Kalinli, M. L. Seltzer, A. Acero // [Electronic resource]. –
Access mode: https://www.microsoft.com/en-us/research/wpcontent/
uploads/2016/02/Ozlem_ICASSP09_final.pdf
2. Ковтун В. В. Оцінювання надійності автоматизованих систем
розпізнавання мовців критичного застосування / В. В. Ков-
тун, М. М. Биков, // Вісник Вінницького політехнічного інсти-
туту, Вінниця. – 2017. – № 2. – С. 70–76.
3. Kim C. Feature extraction for robust speech recognition based on
maximizing the sharpness of the power distribution and on power
flooring / C. Kim, R. M. Stern // [Electronic resource]. – Access
mode: http://citeseerx.ist.psu.edu/viewdoc/
download?doi=10.1.1.184.9018&rep=rep1&type=pdf
4. Normalized amplitude modulation features for large vocabulary
noise-robust speech recognition / [V. Mitra, H. Franco,
M. Graciarena, A. Mandal] // IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 25–30 March
2012 : proceedings. – Kyoto, Japan: IEEE, 2012. - P. 4117-
4120. DOI: 10.1109/ICASSP.2012.6288824.
5. Speech Processing, Transmission and Quality Aspects (STQ).
[Electronic resource]. – Access mode: http://www.etsi.org/deliver/
e t s i_ es /2 0 1 1 0 0 _ 2 0 1 1 9 9 /2 0 1 1 0 8 / 0 1 . 0 1 . 0 3 _ 6 0 /
es_201108v010103p.pdf
6. Graves A. Speech recognition with deep recurrent neural networks
/ A. Graves, A. R. Mohamed, G. Hinton // IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
26–31 May 2013 : proceedings. – Vancouver, BC, Canada : IEEE,
2013. – P. 6645–6649. DOI: 10.1109/ICASSP.2013.6638947
7. Mohamed A. Acoustic modeling using deep belief networks /
A. Mohamed, G. Dahl, G. Hinton // IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP). – 31 January
2011 : proceedings. – IEEE, 2011. – P. 14–22. DOI: 10.1109/
TASL.2011.21093828. Davis S. Comparison of parametric representation of monosyllabic
word recognition in continuously spoken sentences / S. Davis,
P. Mermelstein // [Electronic resource]. – Access mode: http://
www.cs.northwestern.edu/~pardo/courses/eecs352/papers/
Davis1980-MFCC.pdf
9. Hermansky H. Perceptual Properties of Current Speech Recognition
Technology / H. Hermansky, J. Cohen, R. Stern // IEEE
International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 23 July 2013 : proceedings. – IEEE, 2013. –
P. 1968–1985. DOI: 10.1109/JPROC.2013.2252316.
10. Virtanen T. Techniques for Noise Robustness in Automatic Speech
Recognition / T. Virtanen, R. Singh, B. Raj // John Wiley & Sons,
Ltd, Chichester, UK. – 2012. DOI: 10.1002/9781118392683.ch1.
11. Stern R. Hearing is Believing. Biologically inspired methods for
robust automatic speech recognition // R. Stern, N. Morgan //
[Electronic resource]. – Access mode: https://
p d f s . s e m a n t i c s c h o l a r . o r g / d 4 a 9 /
a6aa42dcb2011e45a99b0174da6a47777b7a.pdf
12. Kim C. Power-normalized cepstralcoefficients (PNCC) for robust
speech recognitions / C. Kim, R. Stern // [Electronic resource]. –
Access mode: http://www.cs.cmu.edu/~robust/Papers/
OnlinePNCC_V25.pdf
13. Movellan J. Tutorial on Gabor Filters. [Electronic resource] /
J. Movellan. – Access mode: http://mplab.ucsd.edu/tutorials/
gabor.pdf
14. Mesgarani N. Speech Processing with a Cortical Representation
of Audio / N. Mesgarani, S. Shamma // [Electronic resource]. –
Access mode: https://pdfs.semanticscholar.org/f1d8/
f93cdb64390b3a65f930cee4346c30bd86e4.pdf
15. Morgan N. Using spectro-temporal features to improve AFE
feature extraction for automatic speech recognition / N. Morgan,
S. Ravuri // [Electronic resource]. – Access mode: https://
p d f s . s e m a n t i c s c h o l a r . o r g / c 7 c 5 /
04087f2107f0ea9a3cedeeaf5cc0c48c0c92.pdf
16. Berthet Q. Optimal Detection of Sparse Principal Components in
High Dimension / Q. Berthet, P. Rigollet / [Electronic resource]. –
Access mode: https://arxiv.org/pdf/1202.5070.pdf
17. Оптимізація алфавіту інформативних ознак для автоматизованої
системи розпізнавання мовців критичного застосування / [А. О.
Береза, М. М. Биков, А. Д. Гафурова, В. В. Ковтун] // Вісник
Хмельницького національного університету, серія: Технічні на-
уки, Хмельницький. – 2017. – №3 (249). – С. 222–228.
18. Mak M. W. A study of voice activity detection techniques for
NIST speaker recognition evaluations / M. W. Mak, H. B. Yu //
[Electronic resource]. – Access mode: https://
p d f s . s e m a n t i c s c h o l a r . o r g / 5 4 1 f /
9cfacdac000aadd57cd33b6d86dc96bc3308.pdf
19. Research of neural network classifier in speaker recognition module
for automated system of critical use / [Mykola M. Bykov,
Viacheslav V. Kovtun, Andrzej Smolarz et al] // SPIE 10445,
Photonics Applications in Astronomy, Communications, Industry,
and High Energy Physics Experiments 2017, 1044521;
DOI:10.1117/12.2280930.






Copyright (c) 2018 O. V Bisikalo, T. V. Grischuk, V. V. Kovtun

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Address of the journal editorial office:
Editorial office of the journal «Radio Electronics, Computer Science, Control»,
Zaporizhzhya National Technical University, 
Zhukovskiy street, 64, Zaporizhzhya, 69063, Ukraine. 
Telephone: +38-061-769-82-96 – the Editing and Publishing Department.
E-mail: rvv@zntu.edu.ua

The reference to the journal is obligatory in the cases of complete or partial use of its materials.