ARCHITECTURE AND TRAINING ALGORITHM FOR NEURAL NETWORK TO RECOGNIZE VOICE SIGNALS

Context. Typically, interaction between user and mobile devices is realized by touchings. However, many situations, when to implement such interaction is too awkward or impossible, exist. For example, with some diseases of musculoskeletal system, motility of movements may be impaired. It leads to inability to use device efficiently. In that case, a task of looking for alternative ways of person-device interaction becomes relevant. Voice interface development can be one of the most prospective tasks in that way. Objective. The goal of the study is to develop a project of neural network architecture and internal components for voicecontrolled systems. Resulting interface have to be adapted for processing and recognition Ukrainian speech. Method. An approach, based on audio signal analyzing by sound wave shape and spectrogram, is used for making got via microphone data, appropriable for processing. Using neural network makes possible sounds classification by generated audio signal and information of its transcription. The neural network structure is completely adapted to peculiarities of Ukrainian phonetics. It takes into account the nature of the sound wave, generated during sound pronunciation, as well the number of sounds in Ukrainian phonetics. Results. Experiments were carried out aimed to choosing optimal neural network architecture and training sample dimension. The root-mean-square deviation of neural network error was used as the main criterion in assessing its effectiveness. A comparative analysis of effectiveness of the proposed neural network and existed on the market speech recognition tools showed improvement in the relative measures of recognition by 9.26%. Conclusions. Obtained in the research results can be used for full-featured voice interface implementation. Despite the fact that the work is focused on recognition Ukrainian speech, the proposed ideas can be used during developing transcribing services for other languages.

NOMENCLATURE А(t) is a dependence of sound signal amplitude from time over a continuous time period; А'(t k ) are discrete values of audio signal amplitude; e i is an experimental error of learning a neural network a i-th iteration; Е is a permissible error of neural network learning; f j is a function of converting the sound wave characteristics at definite time moment to a sound reference, which generates sound wave with corresponding characteristics; h is a neuron of a hidden layer of a neural network; H is a set of hidden layer neurons; H(t) is a dependence of sound signal frequency from time during some continual time period; H'(t k ) are discrete values of sound wave frequencies; i is a neuron of an input layer of a neural network; I is a set of hidden layer neurons; J is an index of a word, sound, or speech fragment in a predefined alphabet R; L is a size of training sample; n is a size of output alphabet; N is a number of neurons in the neural network; N I is a number of neurons in a input layer of the neural network; N H is a number of neurons in a hidden layer of the neural network; N O is a number of neurons in a output layer of the neural network; o is a neuron of an output layer of a neural network; O is a set of output layer neurons; P is a training sample; p j is probability of matching the sound wave to an element; R is an alphabet of sounds and words; r j is a single, predefined sound, word, or speech fragment; t is a time; t k is a discrete time; T is a test sample.

INTRODUCTION
Voice recognition methods in conjunction with speech-to-text technologies is a very important tool for creating voice interfaces. Quality voice interface can act as a full-fledged alternative to traditional interactive one for hands-free systems. Such services are especially useful for users, suffering from musculoskeletal disorders. Due to impaired movements coordination, it is difficult and often almost impossible for them to interact device touching screen. Other way for using voice interface is in situations when user's hands are busy with executing another action.
The object of research is a is a process of speech -totext converting.
The subject of research is neural networks application in development transcribation services adapted on pronunciation features in Ukrainian, able to recognize separate speech fragment.
The purpose of the paper is to develop a method for recognition voice commands and to implement based on it service, which provides a voice interface for devices running Android. As a main mathematic tool is supposed using neural networks. This service will let use phone, tablet, etc. fully, without necessity to touch its screen by fingers. The service will be represented as an additional add-in, which can recognize voice commands in natural human language and translate them into device control commands.
To achieve the goal, it is necessary to solve a number of tasks: 1) To analyze main characteristics of sound waves generated when sounds and words are pronounced in Ukrainian; 2) To consider existing approaches to recognizing sounds by nature of their sound waves from the viewpoint of possibility to adapt them for recognition voice commands in Ukrainian; 3) To develop architecture of the neural network and to carry out its training; 4) To test the proposed neural network and to compare its efficiency in recognizing voice commands in Ukrainian to existing voice transcribation tools efficiency.

PROBLEM STATEMENT
, is given, each its element r i can correspond to a separate sound, word or speech fragment.
As sound signal has arrived, sound wave is generated. According to [1][2][3], dependences of sound wave amplitude and frequency from time are main characteristics of sound wave which identify uniquely the nature of sound. It lets identify sounds basing on sound wave data and/or spectrogram for a certain time period (1,2), as well as perform the inverse transformation digital data of the sound wave into analog sound. (2) The functions f 1 and f 2 implement determination probability of correspondence signal, which has generated the investigated sound wave, and each sound from the alphabet R. The first one uses only its amplitude during a certain short time period, the second one has an extra parameter, which let use extra information about the signal frequency. The sound classification process is implemented by artificial neural networks.
The final decision, whether the sound signal corresponds to some element r j is taken based on corresponding to it maximum p j (3): Designing the set R, it is appropriate to include one element corresponds to an empty result, i.e. to a situation, when no one sound from set has been identified. Restriction max max 1 p p − >> at (3) let avoid false positives, when found probability maximum isn't quite large to classify sound generated sound wave to corresponding class.

REVIEW OF THE LITERATURE
The idea to develop a service, which is capable to recognize voice commands and speech in a natural language, isn't essentially new. Developments in speech recognition and voice controlling have been ongoing for over 20 years.
The Dragon Dictate Naturally Speaking system [4] was one of the first software systems capable to perform human speech recognition. In some cases, its recognition accuracy reached 95%. However, high recognition rates were achieved only for speech in English with certain pronunciation rate. In 1997, an attempt to adapt this system for recognition speech in Russian was made. Thus, the Gorynych system was developed [5]. The Gorynych system supported possibility to dictate text and to control some Windows functions with a voice. Meanwhile, speech recognition quality rarely exceeded 30%, which isn't an acceptable result. No attempt to adapt this system to speech recognition in Ukrainian has been made.
This direction of researches became especially popular when mobile devices running on Android and IOs operating systems, appeared. Nowadays, next voice assistants are the most popular: Siri (Apple) [6], Alexa (Amazon) [7,8], Google Voice Assistant [8], etc. However, they are focused on voice input of text messages, which afterwards usually send via standard messengers, and keywords, which are used for searching information on the Internet. These tools provide satisfactory results within declared capabilities. Meanwhile, they are sensitive to pronunciation quality as well as to speech timbre. An essential disadvantage of these systems is necessity for permanent Internet connection. This disadvantage is caused by necessity of voice processing on the server side of application. In addition, in this case, whether confidentiality of the transmitted information will be retained and possible further ways of using it remain unclear. Another essential disadvantage of these services is limited set of available languages. So, for example, none of above mentioned voice assistants is able to recognize voice commands in Ukrainian. In addition, these systems are only voice assistants. None of them is a fully-featured voice interface.
The technologies of converting speech into a command or text used for developing voice assistants can also be used for creation full-fledged voice interfaces. Moreover, developers provide complete API for this: Google Speech API, YandexToolkit API etc. In addition, there are specialized platforms, which main task is to convert voice to text, for example, PocketSphinx. Development of an original service based on these technologies will provide a full-fledged voice interface for devices managed by Android. However, it won't provide to fix the rest disadvantages of existing voice recognition services. Therefore, it is useful to develop an original application, which core will be focused on linguistic features of Ukrainian.
As a result of review literature sources devoted to the problem of converting speech into text [9][10][11][12], a number of tasks, which have to be solved sequentually were identified, as well as results which have to be obtained in course of solving each of them ( Fig. 1).
At this stage the most interest represents the task of recognizing voice signals and converting them to text.
There are 3 main approaches to speech recognition algorithms implementation: hidden Markov models, dynamic programming, and artificial neural networks.
The approach based on hidden Markov models [13] needs long-term system debugging on large sets of test samples. This approach is quite simple from implementation viewpoint. In addition, enlargement set of recognizable words does only a little effect to computational complexity increase. However, it doesn't guarantee high accuracy of result, because to estimate error value reliably isn't always possible The approach based on using dynamic programming [14] presupposes comparing two speech segments and determination difference indicator between them. Known in advance pattern is used as a first segment, identifiableas a second. Using dynamic programming in this approach lets perform optimization and determine the template, which most accurately matches the recognized one. This approach lets get good results for low time and computational costs for small data samples, upon a condition, if a recognizable pattern matches to one element is in set. However, even slight increase of test data sample or outputs variants leads to significant complication of the calculation model.
The most powerful tool for solving speech recognition problem is artificial neural networks [15]. This approach provides not only individual words and sounds recognition, but continuous speech. Using neural networks represents the most interest for development speakerindependent speech recognition systems. Nevertheless, in a cause of complexity of neural network structure determining and its proper training, using neural networks would be recommended only if two previous approaches proved ineffective.
Taking into account specifics of the project being developed (necessity for quick adaptation to specifics of each person's pronunciation and possible minor diction disturbances characteristic for disabled), approach based on using neural networks is the most prospective.

MATERIALS AND METHODS
Received from microphone sound signal is a sound wave with continually changing frequency H(t) and amplitude A(t). Amplitude determines the sound volume, and frequency -its tone. At the same time, digital sound processing and recognition can be performed only for discrete data sets . Thus, for further full-fledged signal processing, it is necessary to perform transformations (4, 5): The procedure to replace continuos dependence of sound charasteristics to discrete ones, is called sampling, in doing so sampling frequency determines quality of result discrete signal. Usually the sampling rate is in range of 8-48 kHz.
As patterns were used discretized sound signals, gained in pronouncing all kinds of sounds and their typical combinations, characteristic for Ukrainian phonetic (Table 1). Fig. 2 shows typical fragments of sound waves generated while some sounds of Ukrainian phonetic are being pronounced.
Shapes of sound waves, as well as spectrograms, make it possible to come up with information about amplitude To solve this task, it is assumed to use a three-layer neural network, which contains an input layer, one hidden layer and an output layer. Number of neurons are in input layer is defined as N I , in output layer -N 0 , in hidden layer -N H . Denote by , , 0 elements are in input, output, and hidden layers, respectively. Number of all neurons in the neural network is labeled as N, limit permissible learning error value as E.
Speech recognition will be performed based on the sound waveform data. To the input of neural network, data is provided sound wave frequency at each discrete time instant.
This approach implementation will lead to necessity to use input data vectors with length of 1000 or more elements. Algorithms used for the corresponding neural network learning will be demanding on computing resources, and got neural network will not always be able to provide result required accuracy.
To optimize neural network structure, the assumption was made that it is appropriable to take in attention only extreme values of the sound waves amplitudes A(t) and time moments t, when they're got detected (Fig. 3). Later, information about sound wave amplitudes between extreme values can be obtained by linear approximation.
Consequently, to inputs of the neural network, pairs (t, A(t)) will be supplied, and its number will be reduced to 20-30.
Number of output layer neurons corresponds to number of sounds which have to be recognized. Ukrainian phonetics involves 38 different sounds. Thus, number of neurons in the neural network output layer needed to recognizing every individual sound in Ukrainian phonetic is 38. As an activation function of the output layer a linear function is chosen. can be defined as the average between number of input and output layers neurons. Afterwards a learning error i e is calculated. Further optimization of neural network structure is performed by reducing (increasing) number hidden layer neurons and constructing a learning curve. The most optimal is considered solution, which provides learning error the closest to acceptable neural network learning error.
As an activation function of hidden layer neurons hyperbolic tangent is chosen.
Created neural network was trained by the direct error propagation method. Each element in training sample P L contains a vector, which dimension corresponds to number of neural network inputs, and a single integer value, which determines corresponding output of the neural network. Each element of training sample vector is represented as a pair of values (A(t),t).
For three-layer neural network training sample size L is determined by relationship (7): Thus, a training sample P L is formalized by the expression Final assessment of developed neural network efficiency is executed on test samples T, which hasn't been used in neural network training. Test sample for the proposed in this paper neural network is formalized by expression (9) ,..., , , , It should be noted that during test process, the neural network uses ready output values only to estimate error, but not to improve result.

EXPERIMENTS
Experiments on the developed neural network were carried out using the original application Voicer, which implements the neural network proposed in the paper. The application is adapted for running on any devices managed by Android. Android Studio was chosen as the development environment, because of nowadays it is the most popular tool for developing Android applications. The neural network is implemented with TensorFlow Mobile library.
The application Voicer has a friendly user interface and let recognize the voice command received from the device's microphone without necessity to delve into the structure and principles of the neural network.
During the experiment was being carried out, various neural network architectures were analyzed, as well as options for number of samples in training set. As determining criterion for choosing neural network structure was admitted the standard deviation MSE. The module of difference between its value and permissible neural network error has to be minimal. It should be noted that we shouldn't try to minimize MSE to 0 value, in this case the effect of retraining neural network is possible and as a subsequent, incorrect result will be got on test samples.

RESULTS
Preliminary calculations have let limit number of neurons in the hidden layer (6) in range from 20 to 38 and training sample dimension (7) in range from 2×N to 10×N. However, not each of possible architectures lets create a neural network efficient to solve assigned task. As an indicator of the neural network efficiency meansquared error was used. We've tested each of available architectures and estimated its efficiency. Results are assembled in table 2. Table columns contain dimension of test sample set, rows -number of hidden layer neurons and in the cells are mean square error values for respective neural network and test sample set. Table 2 experimental results show that the best quality in recognizing sounds in Ukrainian was got if training sample set dimension was in range from 4 × N to 8 × N, and number of hidden layer neurons was in range from 24 to 31. For illustrative purposes, this area is highlighted in the table and data, contained in, are presented in diagram (Fig. 4). Separately, minimum mean square error values for each number of elements in training sample set were highlighted.

Presented in
Examined alternative tools for speech recognition and compared their efficiency to efficiency of developed one. To pursuit of testing process some commonly known voice assistants as Siri, GoogleVoice Assistant, Alexa were used. The initial sample of words is constructed in such a way that it uses all sounds characteristic of pronunciation in Ukrainian. Each word was pronounced 100 times by different voices, with different intonations and timbre. Information about number of correct recognitions is summarized in table 3.

DISCUSSION
In accordance with table 2, we can conclude that there are several cases, when the best result is achieved. So, for example, for a neural network, which number of hidden layer neurons is in range from 24 to 27 neurons, optimal dimension of training sample is 4 × N. If number of hidden layer neurons is 28 or 29, as optimal dimension of training sample set will be 6×N, and if number of hidden layer neurons is 30 or 31, as optimal dimension of training sample set will be 7×N. In addition, the minimum values of the neural network mean square error 0.08, 0.15 and 0.24 are highlighted in the table 2.
Further experiments are reduced to comparing implemented neural network efficiency to efficiency of readymade tools available on market. It should be noted that any instruments, able to recognize Ukrainian phonetics, are not currently available on market. However, an attempt to adapt tools, used for recognizing sounds in Russian, for recognition in Ukrainian was carried out. Ready tools usually recognize not individual sounds but whole words. These words are being formed by a sequence of sounds, therefore, the neural network proposed in the paper is able to cope with this task successfully. A spoken word is considered recognized incorrectly if its transcription differs from sequence of sounds got by neural network.
On the basis of comparing efficiency of attempts to adapt ready-made speech recognition tools to phonetic features in Ukrainian and tool, proposed in the paper, we can conclude advantages of the last one. So, in considered examples, relative improvement in quality of recognition is 12.3%. This is due to the initial orientation developed neural network to recognition speech in Ukrainian.

CONCLUSIONS
Got result is satisfactory, so neural network, proposed for recognizing sounds in Ukrainian, can be used to develop a full-fledged system for voice control on Android devices.
The scientific novelty of the obtained results is that the method for optimizing data of sound waves formed by pronunciation of sounds and their combinations in Ukrainian was firstly proposed. The proposed method is based on using only extreme values of sound waves characteristics and obtaining intermediate data by linear ap-proximation. It reduced the training sample dimension and increased the data processing speed without losing quality of the result.
The practical significance of obtained results is that digitized sound waves, are generated during pronouncing separate sounds in Ukrainian, can be subjected to further intellectual analysis in order to search elements of certain device commands and their parameters. It is undoubtedly a very important element in building a full-fledged interface with voice control. However, considering this problem isn't within the scope of this study.
Prospects for further research are to train the proposed neural network to recognize whole words and collocations in Ukrainian and in other national languages.