THE INSTANCE INDIVIDUAL INFORMATIVITY EVALUATION FOR THE SAMPLING IN NEURAL NETWORK MODEL SYNTHESIS

The problem of mathematical support development is solved to automate the sampling at diagnostic and recognizing model building by precedents. The object of study is the process of diagnostic and recognizing neural network model building by precedents. The subject of study is the sampling methods for neural network model building by precedents. The purpose of the work is to increase the speed and quality of the formation process of selected training samples for neural network model building by precedents. The method of training sample selection is proposed which for a given initial sample of precedents and given feature space partition determines the weights characterizing the term and feature usefulness. It characterizes the individual absolute and relative informativity of instances relative to the centers and the boundaries of feature intervals based on the weight values. This allows to automate the sample analysis and its division into subsamples, and, as a consequence, to reduce the training data dimensionality. This in turn reduces the time and provides an acceptable accuracy of neural model training. The software implementing proposed indicators is developed. The experiments to study their properties are conducted. The experimental results allow to recommend the proposed indicators for use in practice, as well as to determine effective conditions for the application of the proposed indicators.


NOMENCLATURE
K jk is a number of classes, which instances hit the k-th interval of j-th feature values; N is a number of features characterizing original sample; N′ is a number of features in a subsample; opt is an optimal (desired or acceptable) value of the functional f() for the problem being solved; S is a number of instances in the original sample; S′ is a number of instances in a subsample; S jk is a number of instances in the k-th term of the j-th feature; t tr. is a time of neural network model training; w is a set of controlled (adjusted) parameters of the neural network model; x s j is a value of j-th input feature x j , characterizing the instance x s ; y s is an output feature value associated with the instance х s ; y s* is a calculated output feature value for the s-th instance on the neural model output; х s is s-th instance of a sample.

INTRODUCTION
To automate the decision making in problems of technical and medical diagnosis, as well as in pattern recognition problems it is necessary to have a model of a decision dependence from descriptive features, characterizing an instance to be recognized (an observation of the object or process condition at a certain time).As a rule, due to the lack or inadequacy of expert knowledge in practice such model constructed on the basis of observations or precedents (instances).
The one of the most popular and powerful tools for model building by precedents are artificial neural and neuro-fuzzy networks [1] that can learn by precedents, providing their generalization and extracting knowledge from the data.
The object of study is the process of diagnostic and recognizing neural network model building by precedents.
The process of neural model building is typically timeconsuming and highly iterative.This is caused by that training time and accuracy of the neural network model are essentially dependent on the dimensionality and quality of the used training sample.Therefore, to improve the construction speed and quality of neural model it is necessary to reduce the dimension of the sample, providing the preservation of its basic properties.
The subject of study is the sampling methods for neural network model building by precedents.
Thе known sampling methods  are highly iterative and low speed, as well as characterized by the uncertainty of quality criteria of formed subsample.
The purpose of the work is to increase the speed and quality of the formation process of selected training samples for neural network model building by precedents.
For a given sample of precedents <x, y> the problem of neural model synthesis can be presented as the problem of finding <F(), w>: y s* = F(w, x s ), f(F(), w, <x, y>) → opt, where the model structure F() usually specified by the user in practice, and the set of controlled parameters w is adjusted based on the training sample.

REVIEW OF THE LITERATURE
The sampling methods for decision-making model building by precedents in [2,3] are divided into prototype selection methods and prototype construction methods.Here, the prototype means selected subsample relative to the original sample.
The prototype selection methods [4][5][6][7][8][9][10][11][12][13][14][15] does not modify, but only select the most important instances from the original sample.Depending on the strategy of solution forming these methods are divided into incremental methods [4,5] (they successively add instances from the original sample to the subsample) and decremental methods [4][5][6][7][8] (they successively remove instances from the original sample, and obtain a subsample as a result).There are also separated such methods as noise filtering methods [6,8,[9][10][11] (they remove instances, which class labels do not equal with most of the neighbor labels), condensation methods [4-7, 12, 13] (this methods add instances from the original sample to the formed subsample, if they bring a new information, but do not add if they have the same class labels as their neighbors), and methods based on stochastic search [12,14,15] (they randomly form a subsample from the original sample, considering a set of variants of decisions and selecting the best of them).The common disadvantages of these methods are the high iterativity and big search time, as well as uncertainty in quality criteria selection of formed subsample.
The methods of prototype construction [12,[15][16][17][18][19][20][21][22][23] based on the original sample build artificial instances, allowing to describe the original sample.Among these methods it is possible to separate the cluster analysis based methods [18,19,23] (they replace the original sample by the centers of its clusters), the data squashing methods [17] (they replace the original sample instances by the artificial prototypes having weights obtained on their basis) and the neural network methods (neural network based methods) [16,[20][21][22] (they train a neural network on the original sample, which is then used for cluster centers extraction as instances of formed subsample).The common disadvantages of these methods is their high iterativity and a big operating time, and the uncertainty in the initial parameter setting.The methods based on a cluster analysis are characterized by disadvantages such as the uncertainty of cluster number, initial parameters, and metric selection for the clustering and training methods.The data squashing methods form prototypes, which are difficult to interpret.The neural network methods have such disadvantages as the difficulty of prototype extraction from the neural network model, the no guarantee of receiving of acceptable neural network model a result of training, the neural network model variability, entailing nonstationarity of constructed prototypes, the orientation on a specific model, the uncertainty in setting the initial parameters of the model and training methods.
Additionally the combined methods are distinguished [3].They combine the selection and formation of prototypes.The combined methods have the same disadvantages as methods of prototype selection and methods of prototype construction.
Since the prototype construction methods and the hybrid methods related with them are slower than the prototype selection methods, it is advisable to choose the latter as the basis for sampling problem solving.
In order to eliminate the disadvantages of these methods, it is advisable to form a sample without iterative busting of instance combinations by a certain percentage of instance selection from the original sample.This will significantly reduce the time.Herewith we also need to define the indicators to evaluate the individual instance informativity with regard to their position relatively to the interclass boundaries and to the centers of pseudo-clusters, which.This makes possible to generate a non-random sample, to estimate and guarantee the high quality of selected subsamples.

MATERIALS AND METHODS
Let's break feature space into rectangular regions limiting the range of values of each feature by its minimum and maximum values.Then the partition projections into feature axis allow to allocate feature intervals of for each of the rectangular block.The intervals can be formed as cluster projections or as a regular grid, or on the basis of class boundaries in sample one-dimensional projections on the feature axes [24].
Then each such interval can be considered as a term and it is possible to evaluate its importance for decision-making on instance belonging to the cluster with the weight of the k-th term of j-th feature of s-th instance x s based on a description of the corresponding interval center by the formula (1): as well as the weight of the k-th term of j-th feature of the sth instance x s relatively to the description of the intercluster boundaries determined by the formula (2): Then the overall significance of the k-th term of j-th feature of s-th instance x s relatively to the description of the intercluster boundaries can be estimated using the weight determined by the formula: Knowing the term significance we can define the feature informativity evaluations by formula (3): or by the formula (4): It is also possible to use the individual evaluation of the feature informativity in the range [0, 1] defined by the indicators [24].
or by the formula (6): Suggested indicators ( 5) and ( 6) provide evaluation of individual informativity of instance x s relatively to the initial sample in the range [0, 1].The greater the value of corresponding indicator, the more valuable is an instance, and vices versa.
If necessary, the estimates ( 5) and ( 6) can be further normalized so that they will give not an absolute but relative value of instance significance in the sample (7): In this case, the instance with the maximum individual informativity will receive evaluation equal to one, and the instance with minimal informativity will receive evaluation equal to zero.The application of (7) can be useful when it need to simplify the choice of the threshold for separating the sample by the corresponding informativity indicator.
The proposed indicators of evaluation of individual instance significance can be used in the subsample formation from the given original sample by one of the following methods: 1) to form a training subsample of those instances of the original sample, the normalized values of which individual informativity evaluations (7) are greater than some specified threshold; 2) to form a training subsample from the not more than S′ = δS instances of the original sample with the greatest individual informativity evaluation values; 3) to form a training subsample from the not more than S′/K instances of each class of the original sample with the greatest values of individual informativity evaluations; Based on evaluations of term and feature significance we can determine informativity evaluations for each s-th sample instance by the formula (5):

EXPERIMENTS
The computer program implementing the proposed method, which complements the «Automated system neural network and neuro-fuzzy model synthesis for nondestructive diagnosis and pattern classification on features» (certificate of copyright registration № 35431 from 21.10.2010) was developed to conduct experiments.
The developed software was studied in solving the Fisher Iris classification problem [25].The initial data sample contains 150 samples characterized by four input features.The output feature determines instance belonging to one of three classes.
On the basis of the original sample the instance informativity evaluations were obtained and subsets of instances as a training samples were selected by the second and third methods.
To study the second method the 25 %, 50 %, 75 % and 100 % (for the control) instances with the greatest values of individual significance was selected from the whole original sample and included to the training set, respectively.To study the third method the 25 %, 50 %, 75 % and 100 % (for the control) instances with the greatest values of individual significance in each class was selected from the original sample and included to the training set, respectively.
Further, for each sample a model based on a two-layer feed-forward neural network was built.It was trained using the Levenberg-Marquardt method [1].The number of network inputs was determined by Nis the number of features in the corresponding problem.The number of neurons in the second layer of the network corresponds to the number of classes K.The number neurons of the hidden (first) layer was defined as 2K.All neurons of a network were used the weighted sum as weight (postsynaptic) function and logistic sigmoid as transfer function.The training method parameters were set as follows: the learning rate is 0,01, the allowable number of iterations (epochs) of the method is 1000, the target value of the error function is 10 -6 .
After neural model training process completion its final characteristics were fixed: the training time t tr. and the number of spent training iterations ep tr. .After training each model was tested separately on the training and the whole original 4) to use a stochastic search based on evolutionary or multi-agent methods, selecting the best in a some sense combination of instances, using information about individual informativity of instances in the search operators to accelerate the search and focusing it on the most promising solutions.
The first method does not obviously determine the number of instances that will fall into the formed sample.The fourth method is iterative and requires the specification and use of quality indicators, the calculation of which can also be time consuming.Therefore, the second and third methods are the most simple applicable in practice and relatively simple from a computational point of view.They are appropriate to examine together with the proposed measures.samples, for each of which the error was determined, respectively, E tr. and E all .Here each error is the number of instances of corresponding sample for which the estimated value did not match the actual value of the output feature.

RESULTS
The fragment of the results of conducted experiments is presented in the table 1.Here we use the following notation for the coding method of sampling: G is a regular grid partition, N is an irregular partition based on class boundaries in one-dimensional sample projections on the feature axis, K is instance selection in each class separately, A is instance selection in the whole sample.Calculated instance informativity indicators are encoded as follows: the first digit codes the I calculation method: 1 -by the formula ( 5), 2 -by the formula ( 6)); second digit codes the method of w j calculation: 1 -by the formula (3), 2 -by the formula ( 4)).For each of the experimentally obtained indicators it is also listed a percentage of formed training sample volume relative to the original sample volume.Markers «min», «average» and «max» are designated, respectively, minimum, average and maximum values.
The table 1 shows that the use of the proposed method of instance significance determining allows in practice to select a subsample of smaller volume from of the original sample, enough to construct neural network models with the required accuracy, reducing the time to build models.
Fig. 1 graphically illustrates the instance placement of the original and formed samples in the space of the first two features (the sepal length in cm on the abscissa axis and the sepal width in cm on the ordinate axis are plotted).Here markers «.», «x» and «+» denote the instances of different classes of the initial sample, a marker «o» indicates instances selected to the training set.
It can be seen from the fig. 1 that the proposed method allows to select the most significant instances of the original sample.In this case the obtained results essentially depends on the subsample formation method, the feature space partitioning method and the method of individual instance informativity evaluation.

DISCUSSION
As it evident from the table 1, with the increasing of examples number in the formed sample the accuracy is increased (errors for formed training and for the original samples reducing), the training time and the number of training iterations are increased, and vice versa.At the same time a significant reduction of a sample volume to the 25 % of original leads to deterioration the training process characteristics (the time and number of iterations increase) and also to a decrease in accuracy.This can be explained by the fact that instances critical to describe the class separation can not be included to the sample of small volume.
Even a small reduction of the original sample volume in 25 % (up to 75 % of the original sample volume) yielded acceptable accuracy and reduces training time by more than 1.7 times.Reducing the volume of the original sample by a  selection of instances by the results of conducted experiments can be explained by that the method of irregular partitioning with allocation of class intervals on the axis of each feature [24] allows usually get the best partition in comparison to the regular grid partition method.However, reducing the width of the interval, and correspondingly increasing the number of intervals of each feature axis can improve the latter method results.The selection of the optimum width of the interval is a separate problem that should be carried out taking into account the complexity characteristics of the particular application.The closest analogue to the proposed method for determining the instance informativity is a set of indicators proposed in [26].In contrast to the proposed in this paper, the indicators [26] characterize separately instance properties to be informative relatively to the external and internal borders, as well as to the class centers, which is their advantage in the problems of the data visualization and analysis.However, their disadvantages are low speed due to the need to calculate distances between instances, as well as the need and ambiguity of indicator integration to the comprehensive measures of instance informativity.
The advantage of the indicators proposed in this paper is that there is no need to calculate the distances between instances, but disadvantage is the necessary to divide the feature space.However, this disadvantage can be seen as an advantage in the case of large samples: if we use a partition that is simple from a computational point of view (for example, a regular grid) and know the minimum and maximum values of each feature than the computational cost of the proposed indicators will be less than the using of a set [26].
half afforded the gain in speed by 2.3 times.This confirms expediency of application of the proposed mathematical support in the neural network model building by precedents.
A method of instance selections in which the subsample is extracted considering the instance significance in the whole original sample (fig.1a, fig.1b, fig.1f), leads to the selection of less informative instances in comparison with the instance selection considering the significance of instances in each class separately (fig.1c, fig.1d, fig.1e).This is because the frequencies of each class instances may be different and in the selection of instances excluding class numbers it is possible to pass a locally important instances.Another cause may be that the instances describing external borders of classes, but do not important for the separation of adjacent classes can be recognized individually as significant, if we ignore their belonging to classes.
We should also note that the method of calculation of individual instance informativity indicators not only quantitatively but also qualitatively effect on the formed sample.It has been established that the indicators I11, defined by formulas ( 5) and (1), and I12, defined by formulas (5) and (2), respectively, mostly lead to the similar results, which differ significantly from the results for indicators I21, defined by the formulas ( 6) and (1), and I22, defined by the formulas ( 6) and (2), respectively.At the same time the indicators I21 and I22 are more resistant to the instance selection method, and the indicators I11 and I12 are the most effective in the instance selection using the instance importance in each class separately.
The significant influence of a feature space partitioning method on the results of the significance evaluation and The practical significance of obtained results is that the software realizing the proposed indicators is developed, as well as experiments to study their properties are conducted.The experimental results allow to recommend the proposed indicators for use in practice, as well as to determine effective conditions for the application of the proposed indicators.
Defining for each s-th instance the term significances, we can also determine the term weights for the whole sample:

Table 1 -
The fragment of experimental results on model building by the formed samples