ONLINE FUZZY CLUSTERING OF INCOMPLETE DATA USING CREDIBILISTIC APPROACH AND SIMILARITY MEASURE OF SPECIAL TYPE

Context. In most clustering (classification without a teacher) tasks associated with real data processing, the initial information is usually distorted by abnormal outliers (noise) and gaps. It is clear that “classical” methods of artificial intelligence (both batch and online) are ineffective in this situation.The goal of the paper is to propose the procedure of fuzzy clustering of incomplete data using credibilistic approach and similarity measure of special type. Objective. The goal of the work is credibilistic fuzzy clustering of distorted data, using of credibility theory. Method. The procedure of fuzzy clustering of incomplete data using credibilistic approach and similarity measure of special type based on the use of both robust goal functions of a special type and similarity measures, insensitive to outliers and designed to work both in batch and its recurrent online version designed to solve Data Stream Mining problems when data are fed to processing sequentially in real time. Results. The introduced methods are simple in numerical implementation and are free from the drawbacks inherent in traditional methods of probabilistic and possibilistic fuzzy clustering data distorted by abnormal outliers (noise) and gaps. Conclusions. The conducted experiments have confirmed the effectiveness of proposed methods of credibilistic fuzzy clustering of distorted data operability and allow recommending it for use in practice for solving the problems of automatic clusterization of distorted data. The proposed method is intended for use in hybrid systems of computational intelligence and, above all, in the problems of learning artificial neural networks, neuro-fuzzy systems, as well as in the problems of clustering and classification.


NOMENCLATURE
X is a data set matrix; X is a distorted data set matrix; F X is a data set matrix that contain all components; G X is a data set matrix that contain components of observation vectors that are absent in X ; Cl is a cluster; D is a Euclidean distance; P D is a partial distance; E is a goal function; w is a centriod of cluster; ( ) k η is learning step parameter; ( ) q Cr k is fuzzy credibilistic membership level; σ is a Cauchy distribution; is a Lagrange indefinite multipliers; β is a fuzzyfier.

INTRODUCTION
The problem of clustering (classification without a teacher) is an integral part of the general problem of Data Mining [1], for the solution of which many approaches, methods, and algorithms have been developed. Within the framework of this task, a special place is occupied by the problem of fuzzy clustering [2][3][4] which considers the situation when the classes being formed overlap, i.e. each observation can simultaneously belong to several or all classes. Within the framework of this subtask, two main approaches have been formed today: probabilistic [2], when the probability of its belonging to each of the possible classes is estimated for each observation, and the possibilistic [5], where the possibility (not probabilisty) of belonging to some of the classes is estimated. Both of these approaches are associated with solving the optimization problem (nonlinear programming) of the adopted goal functions and, in the general case, can lead to different final results. Despite the rather serious mathematical basis of these approaches, they suffer from a number of significant drawbacks: so the probabilistic approach is very sensitive to "abnormal" observations, which are practically "blurred" with the same levels of membership in all clusters.
The possibilistic approach, in turn, is associated with the so-called coincidence problem, when some clusters merge together, which generally does not allow splitting the processed sample into homogeneous groups -clusters.
Both of these approaches process data in batch mode, i.e. it is assumed that the entire array of observations is given a priori and does not change during the analysis. If the data are fed online (Data Stream Mining task), the classical probabilistic and possibilstic algorithms of fuzzy clustering become unworkable. In this situation, the fore sequential algorithms based on gradient optimization of goal functions taken. Such online procedures have been developed both within the framework of probabilistic [6][7][8][9] and possibilistic [8,10] approaches and have proven their efficiency.
In clustering problems related to the processing of real data, the initial information, as a rule, is distorted by abnormal outliers (noises) and gaps, and the number of these outliers and "holes" can be commensurate with the volume of "clean" data, while a situation is possible when all data are "dirty". It is clear that "classical" methods (both batch and online) are ineffective in this situation.
To combat anomalous outliers in fuzzy clustering problems, robust methods were proposed based on the use of both robust goal functions of a special type and similar-ity measures insensitive to outliers and designed to operate both in batch [11,12] and online [8,13] modes.
As for the presence of the gaps in observations, there was also developed a number of techniques (through probabilistic and possibilistic approaches) as a batch [14,15], and online [16]. And finally, in [17], a robust credibilisic procedure for fuzzy clustering of data distorted by both outliers and gaps based on a similarity measure of a special type was introduced.
The object of study is fuzzy clustering of data distorted by both outliers and gaps.
The subject of study is procedure for fuzzy clustering of data distorted by both outliers and gaps based on a similarity measure of a special type.
The purpose of the work is to introduce robust credibilisic procedure for fuzzy clustering of distorted data.

PROBLEM STATEMENT
The initial information for solving the problem of fuzzy clustering using any of the known approaches is a which in the self-learning mode has to be divided into mutually overlapping classes-clusters, while in the process of solving the problem for each observation ( ) x k the sample should be determined by its fuzzy membership level ( ) q U k to each of the possible clusters (1 ).
It is also usually assumed that the original data are preprocessed (normalized, centered) so that

REVIEW OF THE LITERATURE
Alternatively to probabilistic and possibilistic procedures [18,19] it was introduced credibilistic fuzzy clustering approach using as its basis the credibility theory [20], and is largely devoid of the drawbacks of known methods.
The most common approach within the framework of probabilistic fuzzy clustering is associated with minimizing the goal function [3]: Solution of nonlinear programming problem using the method of Lagrange indefinite multipliers leads to the well-known result coinciding with 2 β = with a popular method of Fuzzy C-Means of J. Bezdek (FCM) [2].
If the data are fed to processing sequentially, the solution of the nonlinear programming problem (1), (2) using the Arrow-Hurwitz-Uzawa algorithm leads to an online procedure [8]: The goal function of credibilistic fuzzy clustering has the form [18,19] close to (1) with "softer" than (2) сonstraints: It should be noted that the goal functions (1) and (5) are similar and that there are no rigid probabilistic constraints in (6) on the sum of the membership in (2).
In the procedures of credibilistic clustering, there is also the concept of fuzzy membership, which is calculated using the neighborhood function of the form [21] ( ) monotonically decreasing on the interval [0, ] ∞ so that Such a function is essentially an empirical similarity measure of [22] related to distance by the relation Note also that earlier it was shown in [16] that the first relation (3) for 2 β = can be rewritten as ( ) where ( ) which is a generalization of the function (8) (10)) and satifies all the conditions for (7). In batch form the algorithm of credibilistic fuzzy clustering in the accepted notation can be written as [18,19] ( (11) and in the online mode, taking into account (9), (10) [23]: From the point of view of computational implementation, algorithm (12) is not more complicated than proce-dure (4) and, in the general case, is its generalization to the case of credibilistic approach to fuzzy clustering.

{ }
(1), (2)..., ( ),..., ( ) X x x x k x N = contains gaps (missing observations), the approach considered above cannot be used and requires significant modification. Thus, in [14], a modification of the FCM procedure based on the partial distance strategy was proposed. Within the framework of this strategy, three subarrays of data are introduced into consideration: Further, the partial distance is introduced into consideration in the form (13) and instead of (1) -the goal function where In recurrent online form (15) can be rewritten as [24,25] ( ) (16) Similarly, using the partial distance strategy, a batch procedure of credibilistic fuzzy clustering can be introduced (17) and its online version: It is easy to see that algorithm (18) is a generalization of procedure (12) for the case of processing data not distorted by gaps.

EXPERIMENTS
To test the developed methods, as well as the analysis of translation over other more well-known approaches, the research was conducted using well-known test data sets of the UCI repository, such as Wine, Gas, Glass and Iris. Description of these data sets shown in Table 1.
Each of the data sets has its own of Attributes Number, Data Number, Cluster Number and Data Sourse. To assess the quality of data clustering, we used Silhouette index, Calinski-Harabasz index and Davis-Baldwin index. The results of clustering Iris data set demonstrated Table 3.

RESULTS
Of course, the quality of proposed method should be estimated.
For this reason, we used the overall accuracy comparison of 100 experiments for different datasets and two clustering algorithms: fuzzy c-means method (FCM) and credibilistic fuzzy clustering (CFC). Credibilistic fuzzy clustering algorithm works not only with complete data, but also with data that containing missing values. To conduct experimental studies, we artificially have introduced 10 missing values into the Iris data set. Figure 1 demonstrates credibilistic fuzzy clustering (CFC) Iris data set with 10 missing values.

DISCUSSION
The result of clustering data sets shown in Table 2. As the table shows, the prepositional credibilistic fuzzy clustering algorithm shows good results.
Comparative data analysis was performed with previously proposed clustering methods data that containing missing values such as adaptive probabilistic fuzzy clustering data with missing values, adaptive possibilistic fuzzy clustering missing values and classical algorithms FCM and K-means.
Thus, the silhouette index shows how the average distance to the objects of cluster differs from the average distance to the objects of other clusters. This value is in the range [-1, 1]. Values close to -1 correspond to "bad" (disparate) types of clustering. Values close to zero indicate that the clusters intersect and overlap. Values close to 1 correspond to "dense" clearly selected clusters. Thus, the larger the silhouette, the clearer the clusters and they are compact, densely grouped clouds of points. As can see from the silhouette index, the data recovery method works quite well. The higher the value of the Calinski-Harabasz index, the better is the solution. In the Davis-Baldwin index, values close to zero indicate the best section, i.e. as can see, with almost all missing data, the distribution is "good", so the method worked well.

CONCLUSIONS
The conducted experiments have confirmed the effectiveness proposed methods of credibilistic fuzzy clustering of distorted data operability and allow recommending it for use in practice for solving the problems of automatic clusterization of distorted data The proposed method is intended for use in hybrid systems of computational intelligence and, above all, in the problems of learning artificial neural networks, neuro-fuzzy systems, as well as in the problems of clustering and classification.
The scientific novelty of obtained results is that the method of credibilistic fuzzy clustering of distorted data based on the partial distance strategy, that shows good results in comparative analyses with another methods, that "worked" with distorted data sets.
The practical significance of obtained results is that analyze properties of the propose methods of credibilistic fuzzy clustering of distorted data. The experimental results allow to recommend the proposed methods for use in practice for solving the problems of automatic clusterization of distorted data.
Prospects for further research methods of credibilistic fuzzy clustering of distorted datafor a broad class of practical problems.