ANALYSIS OF THE AUTOMATED SPEAKER RECOGNITION SYSTEM OF CRITICAL USE OPERATION RESULTS

Context. The article summarizes the statistical learning theory to evaluate the long-term operation results of the automated speaker recognition system of critical use (ASRSCU) taking into account the features of the system’s operation object and the structural specificity of such a class of recognition systems. Objective. The goal of the represented work is the development of a complex set of methods for the ASRSCU’s quality parameters stabilization during its long-term operation. Method. The article formulated set of methods for the ASRSCU’s operational risks estimation of its long-term operation. In particular, the dependence of the risk of an incorrect speaker recognition on the features space dimension is described. Based on the formulated measure of informativity, obtained a set of methods to analyze the training sample to identify examples that lead to increased risk. The influence of the phenomenon of the drift of the speech signal parameters on the quality indicators of the ASRSCU is described analytically. An estimation of the operation duration of the ASRSCU, during which it is impractical to re-train its the classifier, is carried out. Recommendations for choosing an optimal ASRSCU’s classifier are formulated from the position of its complexity minimization, taking into account the risks of the ASRSCU’s long-term operation and the possibility of re-training. Results. Represented in the article theoretical results are verified by the DET-curves experiments data, which summarize the information from long-term experiments with the ASRSCU, in which, during the features space configuration were taken into account the features based on the power normalized cepstral coefficients based and the features based on the spectral-temporal receptive fields theory. Within the framework of the created theoretical concept, an estimation of the influence of the features space configuration and the type and complexity of the classifier on the stability of the ASRSCU’s quality parameters during its long-term operation has been carried out. Conclusions. For the first time the theoretically analyzed the problem of average risk minimization by empirical operation results of a ASRSCU, where, unlike existing approaches, non-stationary input data with the drift of individual speech signals features and the characteristic parameters of the recognition system classifier were taken into account, which allowed to estimate the risk’s confidence interval for conditions for re-training sessions.

θ is a phase of spectral impulse filter feedback; 1 κ is a weight constant; 2 κ is a time constant; 3 κ is a constant, which set the CNN parameters; 4 κ is a constant, which set the CNN parameters; 5 κ is a constant, which set the CNN parameters; 6 κ is a constant, which set the CNN parameters; 7 κ is a training rank; n ξ is a set of n independent, equally distributed random variables; ρ is a probability that, at least in one of the N functions ( ) , the upper limit of the risk max R exceeds m R ; τ is a mathematical expectation of the interval between sequences of re-training; ϕ is a phase of spectral impulse filter feedback; is a function of estimation of the drift degree of the input data; ω is a density parameter of the spectral impulse response filters; is a the element of the training sample, X and Y are the set of empirical input and output data of the system; δ − 1 is a reliability of the classifier's training; is a impulse response of each filter in bank; Н is a class of hypotheses of indicator functions g ; i is a iterator; is a informative measure of rejection of empirical data from educational; j is a iterator; ( ) h L q is a loss function, which describes the average difference between a random variable Y and ( ) is a model of hair cells work; is a affine wavelet transform of the speech signal frame ( ) is a model of the lateral inhibitory network; u n y is a output feature map n of a layer u ; is a set of empirical data.

INTRODUCTION
The automated speaker recognition system of critical use [1], as well as all speaker recognition systems, performs the speaker's person recognition by analyzing the individual attributes isolated from the phonogram with the recording of the speech signal. Of course, the speaker is characterized by the pronunciation variability, due both to internal and external factors. To the internal speech variability factors, we will relate the style, tempo and volume of speech. External speech variability factors are characterized by the type and level of noise in the acoustic and hardware channels of the speech signal propagation, as well as distorted perception of the speech signal due to the reverberation of the speaker's spatial surroundings. Also highlighted such high-level individual characteristics of speech as dialect and speech style, which manifests itself in the acoustic characteristics of the speech signal and the tempo of speech. To establish in ASRSCU the potential for distinguishing internal variability factors, taking into account the high-level individual characteristics and resistance to external variability factors, can be used in the systematic approach to forming the features space, the selection and parameterization of classifier, the formation of a training sample and the regulation of the training process. There are other "extreme" volatility sources of the amplitude-frequency characteristics of the speech signal due to the state of the speaker's health, the acoustic parameters and the geometry of the room where the system is operate, the parameters and the location of the microphones. However, these types of variability are so significantly distorting the meaning of informative speech recognition features that they are reasonably easily identifiable and taken into account when deciding on the result of a speech recognition session, taking into account the degree of distortion and the scope of use and the type of the recognition system.
However, the study [2] showed that during prolonged use of the speaker recognition system, the speech signal parameters drift is due to simple normal physiological processes in the articulatory apparatus of the human, as a result of which the time difference between the training session of the ASRSCU classifier and the recognition session can significantly affect the quality system performance. Consequently, the possible critical use of the speaker recognition system necessarily requires the study of the influence of the operating time on the qualitative performance indicators of the recognition system in order to stabilize them.
The object of study is the individual features of the process of human speech activity and the process of hearing perception of speech signals by a human being and their analysis by the auditory cerebral cortex.
The subject of study is the methods of the pattern recognition theory for the modeling of the recognition system, the methods of the statistical training theory for the analysis of the risks arising from the long-term use of the recognition system, the methods of the neural networks theory for the implementation of the optimal classifier for the recognition system and methods of spectral-temporal receptive fields to describe the process of perception speech signals to the acoustic cerebral cortex of a human.
The purpose of the work is to estimate the risks of long-term operation of the ASRSCU and to propose measures to reduce them.

PROBLEM STATEMENT
, randomly selected in accordance with an unknown distribution of probabilities above Х , it is necessary to train a classifier with given accuracy 0 > ε  for examples of a training sample, for any objective function g and any probability distribution Р on Х , if the hypothesis is with probability greater -the loss of the optimal hypothesis Н . This assumption suggests that it is possible to minimize the empirical risk by using the classifier training algorithm, which ĥ is the result of a choice h with Н a minimum value ( ) h L q , and formulate the purpose of the training procedure as the choice of an element from Н , which minimizes the generalization error Next in the article, we will develop the concept formulated above in direction of identifying the relationship between the performance indicators of the ASRSCU and the training process parameters and the recognition system classifier parameters for its long-term operation.

REVIEW OF THE LITERATURE
In the theory of pattern recognition, one of the applied applications of which is the speaker recognition systems, one of the central problems is to minimize the average risk based on the analysis of empirical data, which developed into the theory of statistical learning [3]. In this theory, many complex problems are investigated, in particular, the restoration of dependencies and distributions density (pattern recognition) and the interpretation of the indirect experiments results. As already noted, the speech signal parameters are inherent in drift, which over time leads to a decrease in the qualitative characteristics of the ASRSCU. In studies [4.5] a hypothesis is formulated on the insignificant effect of the drift of the speech signal parameters on the quality of speaker recognition and is proposed to be compensated by introducing a number of corrective coefficients. So in [6] on the basis of the assumption of constant in time, but a small absolute value of the drift of the speech signal characteristic parameters, its probabilistic estimation is based on the study of the sequence of recognition sessions results and the upper limit of the degree of drift with the given error probability is estimated and taking into account the recognition system classifier training algorithm, but the question of the influence of the number of evaluated data on the reliability of the estimates isn't investigated. The paper [7] describes a method for determining the maximum drift rate allowed for a corresponding recognition system classifier, among which, however, there are no neural networks. In papers [8,9], the phenomenon of drift recognizes and formulates the requirements for optimizing the parameters of the speaker recognition system classifier re-training process stating, in particular, the requirements regarding the phonetic composition of language materials for retraining, thereby reducing the total amount of study sample. In work [10] the influence of "extreme" variations of speech signals on the quality of the speaker recognition system is estimated, and the permissible limits of variation of spectral individual parameters are estimated. However, in all the aforementioned works, a priori assumptions are made about the nature and parameters of drift in speech signals, therefore, an urgent task is the generalization of the theory of statistical learning to the problem of a speech signal parameters drift in the long-term operation of the speaker recognition system.

MATERIALS AND METHODS
In an unknown distribution ( ) -the smallest size of the set, under which the is violated. On the basis of the foregoing, we consider the problem of minimizing the risk functional as a task of minimizing the functional of empirical risk . In this case (1) is characterized by the probability of incorrect classification , and (2) -by the frequency of occurrence of such an event. If all empirical data z are taken from the same distribution, then with probability The equation (3) allows us to describe the dependence of the incorrect classification risk on the factor space dimension, which can be reduced by applying, in particular, the principal component analysis [11], thus reducing the computational complexity of the recognition task. However, in the context of the critical use of the recognition system, the increase in the wrong classification risk is unacceptable, which can be prevented by removing examples that increase risk from the training sample. This operation is proposed to be carried out on the basis of Shannon's informativeness [12]: If the parameters of the initiating probability distribution are unknown, then it is suggested to use the test to identify the distribution point in the context of the speaker's identity, which will be recognized by the ASRSCU [13]: In addition to the parameters of the training procedure on the quality parameters of the ASRSCU, the drift of the speech signal parameters is also influenced by the physiological changes in the speech apparatus of the human. If ASRSCU will operate for a long time, the quality of recognition will decrease, as the initiating distribution of input data will change. Next, we call this phenomenon a drift of a compatible distribution ( ) y x P , . The theory of machine learning regulates the definition of the adequacy of the amount of training sample О for "drifting" data in the form [13]. However, the relevance of the relationship of drift with the empirical and true risk is relevant. Assume that the recognition is performed by the Bayesian classifier in . If the example provided for classification i x is close to the the , then the classification is carried out in accordance with a reliable estimate of conditional probability, and the error probability is which defines the deviation degree of the input data from the data of the training sample. That is, if i x is mach different from the data of the training sample, then the error probability can be estimated as . In the field of drift, these indicators will generally take a form . Consequently, there is a link between the need for re-training of the classifier and the value of empirical risk, embodied in the value of informativity. When creating critical systems, risk management is necessary, so we will combine the retraining operation with the situation of exceeding the value of the empirical risk of some threshold. That is, we will carry out repeated training if with a probability ρ in at least one of an N functions ( ) k z Q α , , N k , , 1 … = the upper limit of risk exceeds the thresholds by the testing results on the sample no less than from the m elements: , or by revealing this relation: Inequality (7) describes the ratio of empirical risk to threshold values for any one m l ≥ . If the recognition system is used, then based on the generalization of the results of its work for a certain time you can calculate the empirical risks ( ) k e R α for the various classes of system parameters, for example, the length or content of the passphrase, the number of microphones for its recording, acoustic space parameters where the system is operating, etc. If for some class the empirical risk exceeds the , then for the data of this class, you need re-training.
Let's describe the probability of a situation when after testing a system on a set of elements the empirical risk will exceed the threshold tr R : where unknown true risks ( ) k R α are used, the limit values of which can be obtained by analyzing the empirical risks, but taking into account the limited sample size, these estimates will be understated. However, it is possible to obtain a reliable lower boundary r providing for a monotonous increase in the likelihood of re-training r P with growth ( ) k R α . Let ( ) 1 , ≥ ξ n n be a sequence of intervals between the re-training procedures of the recognition system, measured in the number of recognition sessions performed. Assume that n ξindependent randomly distributed probability variables, then the probability that re-training will occur through sessions will describe by , and the probability that re-training will occur no more than through the sessions we describe like The mathematical expectation of the interval between sequences of re-training will describe by Determine the effect of re-training on some limited positive rational function taking into account the corrective operator Ψ . For a given r Р correction, the evaluation (8) We simplify inequality (9) Using the above considerations, we obtain on the basis of (9) taking into account (11) the expression to determine the true risk: and generalizing (10) and (11) we obtain the constraints on the choice 0 ρ for (12): We formulate measures on the practical use of the aforementioned theory of the risk assessment of ASRSCU taking into account the procedure of the classifier retraining as a result of the drift of the speech signal parameters. In the context of the foregoing, ASRSCU requires a classifier designed to take into account the balance between the reduction of the empirical risk and the increase in the difference between the empirical and true risk with increasing complexity of the classifier, by which we mean the capacity of the set of input data that the classifier is capable of recognizing. The indicated balance is proposed to be ensured by minimizing the upper limit of true risk for the specified values of reliability and duration of the training sample. Based on [14] we formulate a kind of indicator function that minimizes empirical risk with probability ρ − 1 : where ( ) . If the re-training procedure is implemented, it's expedient to minimize the true risk, and estimate the limit of the empirical risk for the specified re-training risk. It has been previously grounded that re-training of the classifier with reliability ρ − 1 willn't occur, if ρ − > 1 t q , that allows (14) to obtain an analytical expression for calculating the boundary that describes the effect of retraining on the choice of a classifier: Consequently, the authors proposed a set of measures for assessing the operational risks of long-term use of ASRSCU. In particular, using (3) describes the dependence of the risk of incorrect classification from the dimension of the factor space. Based on the formulated measure of informativity with the help of (4) it is possible to analyze the study sample on the presence of examples that lead to increased risk. With the help of (8) we describe the influence of the phenomenon of drift of input parameters on the qualitative performance indicators of the ASRSCU, and with the help of (13) an estimation of the operation duration of the ASRSCU is performed, during which it is impractical to re-train the classifier. Also (15), it is possible to choose the optimal classifier on the position of minimizing its complexity, taking into account the risks of long-term use of the ASRSCU and the possibilities of re-training. In general, the abovementioned material for the first time comprehensively describes the problem of minimizing the average operation risk of the ASRSCU under empirical data, generalized taking into account nonstationary input data with drift patterns and parameters of the recognition system classifier. The limits of confidence intervals of risk are calculated taking into account the procedures of classifier re-training.

EXPERIMENTS
The statistical data for the empirical assessment of the adequacy of the above theoretical concepts for the operational risks analysis of the ASRSCU is obtained on the basis of the analysis of the results of long-term use of ASRSCU at the Department of Computer Control Systems of Vinnytsia National Technical University. The mentioned ASRSCU has a classical architecture, which includes a block of preliminary speech signal processing, a block of informative features allocation and a classification block.
In the pre-processing block, the detection of speech activity intervals in phonograms was performed using a two-channel VAD algorithm [16]. Intervals of linguistic activity lasting 3 seconds were segmented into frames of duration 30 ms with 15 ms shift. To compensate for the Gibbs effect, the signal was weighed by the Hemming window. Effects of channel distortions at the factor level were offset by the calculation of the cepstral mean subtraction and, taking into account the sufficient duration of the analysis frameworks, the implementation of the feature warping [17].
In the block of informative features extraction from each of the received from the block of preliminary processing frames extracted 19 normalized by the power cepstral coefficients [18], their energy and their first and second derivatives. Also, for the presentation of speech signals, the position of the theory of spectral-temporal receptive fields was used, which describes the work of the human auditory system with the involvement of the results of psychoacoustic and neuropsychological studies of the peripheral and central auditory system of mammals in the spectral and temporal spaces [19,20]. The STRFdescription of the speech signal included two stages. At the first stage, the auditory spectrum was obtained as a result of the simulation of the peripheral auditory system. At the second stage on the basis of the first stage results the high-level representations of linguistic representations as the results of simulation of the auditory cortex of the central nervous system of human ware synthesized.
For the implementation of the first stage, an affine wavelet transform ( ) f t y C , of the speech signal frame ( ) t s , was initially carried out, which was passed through a bank of cochlear filters: , where t * -is a convolution operation in the time space. Further, the work of hair cells was modeled, which was consistently performed: the operation of high-frequency filtration to emulate the process of converting sound pressure into the speed of hairs; nonlinear compression operation ( ) u g ; low frequency filtering operation ( ) t w to emulate phase blocking of the auditory nerve: . Next, the work of the lateral inhibitory network of the cochlear nucleus ( ) f t y LIN , was modeled in the form of a frequency selection operation, for which the partial derivative of the ( ) f t y A , frequency was passed through a half-period rectifier: And the first stage was ended by receiving the auditory spectrogram ( ) . In Fig. 1 we can visually compare examples of Fourier and STRF-spectrograms.  Fig. 2 shows a scalable STRF representation of one of the speech signal frames from Fig. 1 in rt sc h h space and MFCC-representation of the same frame.
In the investigated ASRSCU from the frames of the speech signal, according to the results of the STRF analysis, three informative features were distinguished. The first feature where , ω N -is the number sc h elements, and the values of the phase parameters ϕ and θ , given their small informativity for a speaker recognition task [19], was considered equal to 0 for a simplification of calculations. The second feature was obtained by . Thus, the vector of informational attributes for one frame of the input speech signal after its processing consisted of 79 elements that are visually represented in the form of a spectrogram-like structure, where the axis of the ordinates is postponed by the number of frames along the abscissa, the values of the ordinate axes correspond to the numbers of informative features, and the intensity of the color shows the value of the corresponding features within the frame, multiplied by the corresponding weighting factor. Such a way of presenting informative features is due to the type of ASRSCU classifier. In the classification block of the ASRSCU, a convolutional neural network [21] was implemented. Its architecture is designed taking into account the recommendation (15) regarding the complexity of the recognition system classifier, taking into account the features space parameters, operating conditions and the purpose of the ASRSCU. The structure of network (see Fig. 3) includes two convolution layers for features extraction, two subsampling layers to reduce features dimension, two local normalization layers, three full-connected layers and finalized by an output SOFTMAX layer.
The  . For network training a stochastic gradient descent method with step 128 was used. The rule for updating weight k w on k iteration looks like is a derivative. The initial values of the neuron weights on each layer were set using the zero mean Gaussian distribution with a standard deviation of 1. The training error was 0.0002.

RESULTS
The main purpose of the experiments carried out with the above-described ASRSCU was to assess the impact of the operation duration on the recognition system quality performance with the generalization of data on the informativity of the attributes space elements. For this purpose, the ASRSCU software was installed on three computers at the Department of Computer Control Systems of the Vinnytsia National Technical University, which operated for two years. Experiments were attended by 6 speakers (4 male and 2 female), each of whom conducted regular recognition sessions at least once every five days (total of over 2000 recognition sessions per speaker per study period) with fixation of results . The possible result of the recognition session was the correct speaker's recognition, the speaker's confusion (the first kind error, Miss) or denial access (second kind error, False Alarm). The results of experiments are presented in the form of detection error trade-off curves, which show the dependence of the likelihood of the first kind errors α P occurrence from the second kind errors β P occurrence Figure 3 -Architecture of the ASRSCU's convolutional neural network classifier probability, with the same threshold decision making recognition system's classifier. In particular, Fig. 4 shows the DET curves depending β α P P on the operation duration of the ASRSCU without the re-training of the classifier, to evaluate the drift of the individual features that characterized the speakers in the recognition process. Fig. 5 shows the DET curves for β α P P compliance with the recommendations for the re-training frequency  (8) and (13) respectively. Fig. 6 shows the DET curves β α P P depending on the configuration of the ASRSCU features space, which regularly passed re-training procedures with the parameters of the training sample, regulated by the above theoretical results. The obtained DET curves confirm the theoretical assessments adequacy of the sufficiency of the classifier's complexity to make decisions on the speaker personality of the ASRSCU, confirming the expediency of the retraining procedure of the ASRSCU classifier and correctness determined on the basis of theoretical estimates of this procedure parameters.

DISCUSSION
The results of the experiments on a Fig. 4 show that the quality indicators of the ASRSCU during a long-term exploitation process are reduced stochastically without the possibility of identifying an adequate tendency that to some extent allows the use of the speaker recognition system for its intended purpose, but makes its critical application impossible, one of the conditions of which is predictability of the work results.
The results shown in Fig. 5 clearly confirm the expediency of the re-training procedure of the classifier, whose parameters are regulated by the theoretical results obtained in part 4 of the article. It should be noted that, in addition to observing the periodicity of re-training, the obtained results reveal the relationship between the ASRSCU's first and second kind errors probabilities and the composition and the size of the training samples used for re-training. On the basis of the results analysis of a long-term exploitation of the ASRSCU, the effect the informative features drift on the quality of the system's operation was found which provides objective material for optimization of the ASRSCU factor space by reevaluating the weight of the informative features, which are subsequently visualized before using the convolutional neural network classifier.
The results shown in Fig.6 on the one hand show a greater informativeness of the features based on the power-normalized cepstral speech signals analysis. However, the features that result from the practical application of the theory of spectral-temporal receptive fields make up only about 4% of the features space, but not only can significantly increase the quality of the ASRSCU, but also make the DET curves more linear, that is, in general, stabilize the decision-making process by the system critical use.

CONCLUSIONS
In the article a theoretical analysis of the long-term operation process of the ASRSCU was conducted, on the basis of which the practical recommendations for the stabilization of the quality indicators of the recognition system are formulated.
The scientific novelty of the obtained results can be attributed to the fact that for the first time a theoretical analysis of the problem of an average risk minimization has been made on the empirical operation results of the speaker recognition system for critical use, in which, unlike the existing approaches, non-stationary input data with drift patterns and characteristic features of the recognition system classifier are taken into account, which allowed to estimate the limits of the risk confidence intervals, provided that the re-training sessions were carried out. The practical consequence of the theoretical analysis is the formulated set of measures for assessing the operational risks of long-term use of the ASRSCU. In particular, using (3) the dependence of the wrong classification risk on the dimension of the factor space is described. Based on the formulated measure of informativity (4), an analysis of the training sample on the identification of elements that lead to increased risk was made. Using (8), we describe the influence of the phenomenon of drift of the speech signals parameters on the qualitative performance indicators of the ASRSCU. With the help of (13), an estimation of the operation duration of the ASRSCU was carried out, during which it was impractical to re-train the classifier. Applying (15) the optimal classifier was chosen from the position of minimization of its complexity, taking into account the risks of long-term use of ASRSCU and the possibility of re-training. In particular, the resulting ASRSCU-based convolutional neural network classifier has a compact structure and confirmed the predicted efficiency. The formulated recommendations correctness is confirmed by empirical results presented in the form of DET curves.
Subsequent studies are planned to devote to the detection of the final potential of the spectral-temporal receptive fields theory in the context of the informative features for speaker recognition synthesis. As the results of experiments have shown, their use not only significantly increases the qualitative performance of the ASRSCU, but also make the DET curves more linear, that's, in general stabilizes the decision-making process by a system of critical use. It is planned to investigate the potential of introducing into the list of information features used in the ASRSCU the human speech source parameters and to make the final factor space optimization.