НЕЙРОІНФОРМАТИКА ТА ІНТЕЛЕКТУАЛЬНІ СИСТЕМИ НЕЙРОИНФОРМАТИКА И ИНТЕЛЛЕКТУАЛЬНЫЕ СИСТЕМЫ NEUROINFORMATICS AND INTELLIGENT SYSTEMS ESTIMATION OF THE INDUCTIVE MODEL OF OBJECTS CLUSTERING STABILITY BASED ON THE K -MEANS ALGORITHM FOR DIFFERENT LEVELS OF DATA NOISE

The inductive model of the objective clustering of objects based on the k -means algorithm clustering is presented in the paper. The algorithm for division of initial data into two equal power subsets is proposed and practically implemented. The difference between the mass centres of the appropriate clusters in different clustering is proposed to use as an external balance criterion. Approbation of the proposed model operation was carried out using the data “Compound” and “Aggregation” of the database of the Computing School in the Eastern Finland University. The researches on the estimation of the model stability to a noise component using the data “Seeds” are presented in the paper. The algorithms k -means, c -means, inductive k -means and agglomerative hierarchical algorithm were used to compare the results of the experiment. The ways of further improvement of the proposed model in order to increase the objectivity of investigated data clustering were defined by the results of the simulation.


NOMENCLATURE
GMDH is the Group Method of Data Handling; n is the number of observed objects; m is the number of attributes that characterize the objects; k is the number of clusters; ij x is the value of feature in column j of row i; ij x' is the normalized value of feature in column j of row i; j med is the median of j column; q is the number of clusters in clustering Q and R respectively.
INTRODUCTION Nowadays, great attention is devoted to the issues of the complex objects clustering at the conditions of various data noise levels. First of all it is connected with the increase of the requirements for accuracy of detection and identification systems operation under various conditions of information obtaining. A lot of clustering algorithms exist nowadays. Each of them has its advantages and disadvantages and is focused on a specific type of data. A high percent of subjectivism is one of the key disadvantages of existing algorithms, i.e. high quality of clustering on a one dataset does not guarantee the same results on another similar dataset. Clustering objectivity improvement is possible by using inductive methods of complex systems modelling based on the Group Method of Data Handling [1][2][3], where the data processing is carried out by two equal power subsets and a final decision concerning of the nature of the objects partition into the clusters is done based on the complex use of external criteria of relevance and internal criteria of clustering quality estimation. Thereby, the development of hybrid models and methods of objects clustering based on the complex systems inductive modeling methods is an actual problem both fundamentally and practically.

PROBLEM STATEMENT
The initial dataset of objects is a matrix: The aim of the clustering is a partition of objects into non-empty subsets of pairwise nonintersecting clusters in accordance with the criteria of remoteness of the object and cluster, taking into account the properties of the objects: Three fundamental principles, which are taken from different scientific fields, are the basis of methodology of complex systems inductive modeling [1][2][3][4][5][6]: -the principle of heuristic self-organizing, i.e., enumeration of models set and the selection of the best model on the basis of the external balance criterion; -the principle of external addition, i.e. the necessity of additional information using with purpose of objective verification of models; -the principle of inconclusive solution, i.e. generation of a certain set of intermediate results in order to select the best variant one.
The implementation of these principles within the objective clustering inductive model assumes the following steps: -normalization of the investigated objects features, i.e. their reduction to identical range with the same median of the objects attributes; -division of initial data set into two equal power subsets; -definition of external criterion or group of relevance criteria to choose the optimal clustering for two equal power subsets; -choice or development of the basic clustering algorithm used as a component in the inductive model of objective clustering of objects.

REVIEW OF THE LITERATURE
The basic conceptions of creation of inductive method of objects clustering on the basis of Group Method of Data Handling are described in the papers [2][3][4]. Further development of this theory is reflected in [5,6]. The conception of the objective cluster analysis is presented in [4] and has been further developed in [7][8][9]. The authors define the basic principles of objective clustering inductive model creation, show the ways and perspectives of its implementation, define the advantages of clustering inductive model by comparison with traditional data clustering methods. Theoretical developments for implementation of biclustering methods for systems of complex processes inductive modeling are presented in [10]. However, it should be noted that, in spite of the successful results achieved in this area, an objective clustering model based on the analysis of clustering systems has no practical realization at the present time.
The unsolved parts of the general problem are the absence of the effective algorithms for division of initial data set into two equal power subsets and integrated criterion approach for evaluation of the clustering efficiency during their enumeration one by one.
The aim of the paper is the development of inductive model of objective clustering of objects based on the kmeans clustering algorithm and evaluation of the stability of operation algorithm quality using of the noise data with different noise level.

MATERIALS AND METHODS
According to hereinbefore concept of complex systems inductive modeling the first step of data processing is the data normalization process. Data normalization was carried out for all columns by the formula (1): The choice of this normalization method is determined by the fact that as a result the set of data features in all columns had the same median with maximum of features variation range from -1 to 1, herewith the amount of data for each column which falls into the interquartile distance (50%), differs insignificantly.
Algorithm of the original set of objects Ω division into 2 equal power non-intersecting subsets А Ω and B Ω consists the following steps [4,9]: pairwise distances between the objects in the original sample of data; 2. Allocation of pairs of objects p s X X , , the distance between which is minimal: 4. Repetition of the steps 2 and 3 for the remaining objects. If the number of objects is odd, the last object is distributed into the two subsets.
The approach, outlined in [7], was taken as the basis to calculate the external balance criterion. Optimality criterion of regulated clustering was determined as minimum value of squared deviations sum between mass centers of appropriate clusters for different clustering (2): The mass center of k cluster in Q clustering was determined as the average of vectors attributes in this cluster (3): ( The absolute value of this criterion can be calculated for m-dimension feature space as follows (4): In the case of criterion normalization the formula (4) takes the form (5): To create the equal conditions for subsets A Ω and B Ω using k-means clustering algorithm, the same values are assigned to the centers of corresponding clusters for the different clustering at the initialization phase. The initial value of (4) criterion is zero in this case. The experiment has shown that on the subsequent iterations the criterion value increases at the first step, and then it varies monotonically to reach the saturation which corresponds to a sustainable clustering for the two equal power subsets. The relative change of the (4) criterion on two successive iterations vanishes in this case. Thereby, the external balance criterion can be represented as follows (6, 7): The scheme of the inductive cluster analysis model based on the k-means algorithm is shown in Fig. 1. The implementation of this algorithm guesses the next steps: Step 1. Formation of the initial set Ω of the objects. Data preprocessing (filtration and normalization). Presentation of data as a matrix m n× ; Step 2. Division of set Ω into two equal power subsets in accordance with hereinbefore algorithm. These subsets A Ω and B Ω can be formally represented as follows: Step 3. Setup of clustering procedure using k-means algorithm. Choose the number of clusters and setting the initial centers of clusters; Step 4. Sequential calculation of Euclidean distances from the objects to the cluster centers for two clustering. Distribution of objects into clusters in accordance with the condition: Step 5. Calculation of the new clusters centers by the formula (3); Step 6. Calculation of the external balance criteria by the formulas (6) and (7); Step 7. Fixation of obtained clustering when the conditions (6) and (7) are true. At against, if the current number of iterations less than maximum, go to step 4.

EXPERIMENTS
Approbation of the proposed model working was carried out using the data "Aggregation" and "Compound" of the database of the Computing School in the Eastern Finland University [11]. Estimation of the algorithm stability to the noise component was carried out using the "Seeds" data [12], representing the researches of kernels of three kinds of wheat. Each kernel was characterized by seven attributes, herewith each group included 70 observations. Thus, the initial data matrix had the size: . Data normalization was carried out by the formula (1). Then the "white noise", the amplitude of which varied from 2,5% to 50% of maximum of data scattering, was added to the data. Evaluation of the quality of algorithm operation was performed by counting the number of the correctly grouped objects. In order to compare the results, this problem was solved using the classical k-means algorithm, the fuzzy c-means algorithm and using the agglomerative hierarchical algorithm yet. The simulation was performed in the R software environment.

RESULTS
The results of the operation algorithm for division of initial data set into two equal power subsets are shown in Fig. 2. Fig. 3 shows the charts of external balance criteria calculated by formulas (7) and (8) dependence on the iterations number of the investigated data, herewith, 4 clusters are assigned for the "Compound" data and 7 -for "Aggregation" data -7. The results of studied objects division into clusters are presented in Fig. 4. Fig. 5 shows the boxplots of unnormalized and normalized data. The charts of incorrectly distributed objects depending on level of noise component using the different clustering algorithms are shown in Fig. 6.

DISCUSSION
The analysis of Fig. 2 allows to conclude about the high efficiency of the algorithm operation. The obtained subsets have the similar structure with lower density of objects distribution in the feature space. Fig. 3 shows that the relative change of external balance criterion achieves zero at 4 and 11 iterations for the data "Aggregation" and "Compound" respectively.
Therefore, the relevant clustering are optimal at these levels in terms of criteria applied. As it can be seen from Fig.  4, objects of the "Compound" data were divided adequately into clusters. Low percentage of incorrect data can be explained by the nature of the distribution. However, the surface, separating the clusters is rather distinct. The same conclusion can be done on the basis of the "Aggregation" data analysis. In this case some clusters intersection can be observed, however, the algorithm has divided the objects into clusters by surface fitting at feature space. Fig. 6 analysis allows to conclude that the inductive clustering model based on the k-means algorithm gives better results of objects division into clusters as compared to the classical k-means algorithm and the fuzzy c-means algorithm. Inductive clustering algorithm is more stable as compared to the k-means and fuzzy c-means algorithms for increase of noise level to 10%. With a further increase of the noise level up to 40% the number of incorrectly distributed objects varies on either side. A further increase of the noise level contributes to a monotonic increase of the number of falsely identified objects. However, it should be noted that in this case the high efficiency of operation and the best stability to the noise of the agglomerative hierarchical clustering algorithm. It can be explained by the nature of this algorithm. The profiles of objects or the centers of clusters are compared using the Euclid distance during the clustering. Herewith, the noise presence has no significant effect to the result of profiles comparison in case of significant differences of the objects profiles in different clusters. Moreover, the advantage of this algorithm is the independence of the initial choice of cluster centers because the number of clusters in the initial state equals the number of objects studied. In this case the choice of optimal clustering is the main problem, because the analysis of dendrogram doesn't allow to draw the conclusion about the clustering quality at the chosen level. Therefore creation of a hybrid inductive model of objective clustering based on the agglomerative hierarchical clustering algorithm is reasonable.

CONCLUSION
The hybrid model of objects clustering based on the methods of complex systems inductive modeling and k-means clustering algorithm is presented in the article. The methodology of inductive modeling to choice the optimal clustering during model operation through the implementation of objective criterial approach has been further developed. The "Compound" and "Aggregation" data of the database of the Computing School in the Eastern Finland University and the "Seeds" data, representing the researches of kernels of three kinds of wheat, were used as experimental ones. The algorithm for division of initial data set into two equal power subsets, which are then used in data clustering inductive model, has been further developed and practically implemented. The implementation of the proposed model was carried out using the R software environment. The results of simulation showed the high efficiency of proposed model operation. The algorithm has distributed the objects into the corresponding clusters adequately. The clusters intersection was not observed in the case of optimal distribution. The simulation of model operation using the "Seeds" data with different noise level was performed to estimate the model stability to the different data noise level. Level of noise was changed from 2,5% to 50% of maximum data variation. Clustering using the proposed inductive clustering model, the classic k-means algorithm, the fuzzy c-means algorithm and the agglomerative hierarchical clustering algorithm was carried out to compare the results of the experiment. The results of the simulation have shown better quality and stability of inductive k-means algorithm as compared to the classic k-means and the fuzzy c-means algorithms. However, the agglomerative hierarchical clustering algorithm has shown the best results in terms of clustering quality and stability to the noise. Therefore, the creation of a hybrid inductive model of objective clustering based on the agglomerative hierarchical clustering algorithm is the perspectives of further authors' researches.