IMPLEMENTATION OF DBSCAN CLUSTERING ALGORITHM WITHIN THE FRAMEWORK OF THE OBJECTIVE CLUSTERING INDUCTIVE TECHNOLOGY BASED ON R AND KNIME TOOLS

Context. The problem of the data clustering within the framework of the objective clustering inductive technology is considered. Practical implementation of the obtained hybrid model based on the complex use of R and KNIME tools is performed. The object of the study is the hybrid model of the data clustering based on the complex use of both DBSCAN clustering algorithm and the objective clustering inductive technology. Objective. The aim of the work is the creation of the hybrid model of the objective clustering based on DBSCAN clustering algorithm and its practical implementation on the basis of the complex use of both R and KNIME tools. Method. The inductive methods of complex systems modelling have been used as the basis to determine the optimal parameters of DBSCAN clustering algorithm within the framework of the objective clustering inductive technology. The practical implementation of this technology involves: the use of two equal power subsets, which contain the same quantity of pairwise similar objects; calculation of the internal and the external clustering quality criteria; calculation of the complex balance criterion, maximum value of which corresponds to the best clustering in terms of the used criteria. Implementation of this process involves two main stages. Firstly, the optimal values of the EPS parameter were determined at each step within the range of the minPts value changes. The charts of the complex balance criterion versus the EPS value were obtained for each minPts value as the results of this stage implementation. Then, the analysis of the obtained intermediate results was performed in order to determine the optimal solution, which corresponds to both the maximum value of the complex balance criterion on the one side and the aims of the current clustering on the other side. Results. The developed hybrid model has been implemented based on software KNIME with the use of plugins, which have been written in software R. The efficiency of the model was tasted with the use of the different data: low dimensional data of the computing school of East Finland University; Fisher’s iris; gene expression profiles of the patients, which were investigated on lung cancer. Conclusions. The results of the simulation have shown high efficiency of the proposed method. The studied objects were distributed into clusters correctly in all cases. The proposed method allows us to decrease the reproducibility error, since the solution concerning determination of the clustering algorithm optimal parameters was taken based on both the clustering results obtained on equal power subsets separately and the difference of the clustering results obtained on the two equal power subsets.

QCE is an external clustering quality criterion; QCB is a complex balance clustering quality criterion; OCIT is an objective clustering inductive technology.

NOMENCLATURE
n is a number of the investigated objects; m is a number of features or attributes of the objects; k is a number of clusters; K is a set of the clusters; R(K) is a clustering result; e is a clustering error in the case of the two equal power subsets use; EPS is an epsilon-neighborhood of point; MinPts is a minimal quantity of points inside EPS; e 0 is a boarding admissible clustering error; {QC} is the set of the internal and the external clustering quality criteria; N is a number of the studied objects; s N is a number of the objects in s cluster; s i X is i-th object in s cluster; s C is a mass center of the s cluster; ( ) ⋅ d is a metric used to estimate the proximity level of the studied objects; r is the number of the internal and external clustering quality criteria; D is a set of points, each of them is determined the allocation of the studied object in m-dimensional feature space; q is a point inside EPS; ( ) p N EPS is a number of points inside EPS of the point p.
INTRODUCTION Relevance of the problem is determined by the current works in the field of complex data clustering in different fields of scientific research. There are a lot of clustering algorithms nowadays. Each from them has its advantages and disadvantages and is focused to a specific type of data. The results of the data clustering depend on: affinity metrics between objects, clusters and objects and clusters; type of the clustering algorithm and the parameters of its operation; type of the clustering quality criteria, which are used to estimate the character of the object distribution within clusters. However, it should be noted, that in spite of grate quantity of the clustering algorithms and different types of internal and external clustering quality criteria this problem has not final solution nowadays. One of the unsolved tasks in this subject area is the reproducibility error. In other words, successful clustering results, which are obtained on one dataset do not repeat in the case of the use of another similar dataset. The solution of this problem can be achieved by careful determination of the parameters of the used clustering algorithm operation in all cases of the data clustering. However, this fact complicates the data processing, since the researcher in all cases should determine the optimal parameters of the current clustering algorithm operation in terms of the extremums of the used clustering quality criteria. To solve this problem, we propose to carry out the data clustering on two equal power subsets concurrently with following calculation of both the internal and external clustering quality criteria at each step of the algorithm operation. The final decision concerning the studied objects grouping is performed on the basis of maximum value of the complex balance criterion, which contains the internal and external clustering quality criteria as the components.
The aim of the work is practical implementation of DBSCAN clustering algorithm within the framework of the objective clustering inductive technology based on the complex use of R and KNIME tools.

PROBLEM STATEMENT
The initial dataset of the studied data is presented as a matrix: The clustering process involves a partition of the investigated objects into non-empty subsets of the pairwise non-intersection clusters: Model of the objective clustering based on the inductive methods of complex systems modelling involves sequential enumeration of admissible clustering in order to select from them the best variants [1]. The strategy S of the objects grouping within the framework of the objective clustering inductive technology can be presented as the following: Under the strategy in this case we understand a purposeful process of sequential actions, which are performed for the objects grouping according to the current task within the framework of the admissible error. The clustering error is determined based on analysis of the complex balance criterion values, which contains both the internal and external clustering quality criteria as the components.

REVIEW OF THE LITERATURE
Classification of several clustering algorithms by their categories are presented in [2]. Each of them has its advantages and disadvantages. The choice of the appropriate clustering algorithm is determined by type of the investigated data and goal of the current task. One of the essential disadvantages of the existing clustering algorithms is the reproducibility error. The main idea to solve this problem was proposed in [1]. The authors have shown that decreasing of the reproducibility error can be achieved based on the use of the inductive methods of complex systems modelling, which are a logical continuation of the group methods of data handling [3,4]. The questions concerning creation of the methodology of inductive systems analysis as a tool of engineering research analytical planning are considered in [5]. The authors proposed the strategy of analytical project design based on the inductive principles. The final decision within the framework of the proposed methodology was done with the complex use of both the internal and external quality criteria. However, it should be noted, that authors' research is not focused to complex highdimensional data clustering.
The results of the research concerning development of the objective clustering inductive technology of highdimensional complex data are presented in [6]. The authors have shown that implementation of this technology based on some clustering algorithm involves determination of the affinity function between objects, clusters and objects and clusters at the first step. Then, division of the studied data into two equal power subsets, which contain the same quantity of pairwise similar objects should be performed. Formation of the internal, external and complex balance clustering quality criteria should be carried out at the next step. The optimal clustering is determined based on the extremum value of the used criteria during sequential enumeration of the admissible clustering. In [7] authors present the results of the research concerning criterial analysis of the gene expression profiles within the framework of the objective clustering inductive technology. The implementation of this technology based on k-means and agglomerative hierarchical clustering algorithms were presented in [8,9]. The authors conducted the comparison analysis of the different internal and external clustering quality criteria with evaluation of their effectivity in the case of gene expression profiles use. The results of the simulation have shown higher effectiveness of the appropriate algorithm in the case of its implementation within the framework of the objective clustering inductive technology in comparison with standard method of this algorithm use. However, it should be noted that the used algorithms do not allow us to divide the complex data correctly. The solution of this problem can be achieved by the use of the modern methods of complex data processing [10,11] within the framework of the objective clustering inductive technology and implementation of this technology based on other clustering algorithms.
In this work we propose the hybrid model of the objective clustering inductive technology based on DBSCAN clustering algorithm. The practical implementation of the proposed model has been performed on the basis of the complex use of both R and KNIME tools.

MATERIALS AND METHODS
Three fundamental principles, which are borrowed from various scientific fields, are the basis of the methodology of complex systems inductive modeling. In the case of the OCIT these principles can be presented as the following [1,12]: 1. The principle of sequential enumeration, i.e., sequential enumeration of admissible clustering within a given range in order to select from them the best variants by the used clustering quality criteria; 2. The principle of external edition, i.e., a necessity of the use of two equal power subsets, which contain the same quantity of pairwise similar objects; 3. The principle of inconclusive of solution, i.e., generation of several sets of intermediate results in order to select from them the best variant in terms of the goal of the current task. Fig. 1 presents the structural block chart of the OCIT. The practical implementation of this technology involves the following stages: Stage I. Problem statement. Data analysis and preprocessing.
1. Problem statement and aim formation.
2. Data analysis and its formation as a matrix, where number of rows is a number of the studied objects and number of columns is a number of the features, which characterized the objects.
3. The data preprocessing. This step involves: missing value processing (in the case of necessity); filtering; normalization.
Stage II. Choice of the affinity metrics and equal power subsets formation.
4. Choice the affinity metrics between objects, clusters, objects and clusters.
5. Formation of the two equal power subsets A and B in accordance with the following algorithm: Step 1. Calculation of ( ) Step 2. Allocation of the pair of the objects X s and X p , the distance between with is minimal: Step 3. Distribution of the object X s to subset A, and the object X p to subset B.
Step 4. Repetition of the steps 2 and 3 for remaining objects. If the number of the objects is odd, the last object is distributed into the both subsets.
Stage III. Calculation the internal, external and complex balance clustering quality criteria.
6. Calculation of the internal clustering quality criterion. This criterion allows us to evaluate the quality of the objects grouping in single clustering. It is obvious that quality clustering corresponds to both the high density of the objects distribution inside cluster and the less density of the clusters distribution in the features space. So, the internal clustering quality criterion should be complex and takes into account both the character of the objects distribution within clusters and the character of the mass centers of the obtained clusters distribution. The first component of this criterion within the framework of the proposed technology was calculated as an average distance from objects to the mass centers of the cluster, where these objects are by the formula (1): The second component of this criterion is calculated as an average distance between mass centers of the clusters in current clustering by the formula (2): Various combinations of this components in different internal clustering quality criteria in the case of the use of numeric data were considered in [7]. The authors have shown that Calinski-Harabasz criterion (CH) [13] and Within-Between index (WB) [14] show better results in the case of the use if high dimensional gene expression profiles. As the results of the simulation the complex internal clustering quality criterion was proposed in [15]. This criterion is calculated as the multiplicative combination of CH criterion and WB index by the formula (3): 7. Calculation of the external clustering quality criterion. This criterion takes into account the difference of the clustering results obtained on the two equal power subsets. The minimal value of this criterion corresponds to higher level of the clustering objectivity. The value of this criterion is calculated as normalised difference of the internal clustering quality criteria calculated on the two equal power subsets for the current clustering level by the formula (4): 8. Calculation of the complex balance criterion. The necessity of this criterion is determined by possible disagree between the extremums of both the internal and external clustering quality criteria. The Harrington desirability function was proposed to calculate the complex balance criterion [16]. The plot of this function is presented in Fig. 2. Determination of the general Harrington desirability index involves the following steps: Step 1. Transformation of scales of the internal and external clustering quality criteria into reaction scale Y, values of which are changed within the range from -2 to 5 by the formula (5): (5) The parameters a and b are determined empirically for each of the used criteria taking into account its boundary values according to the equations (6): Step 2. Calculation of the nondimensional parameter i Y for each of the used clustering quality criteria i QC by the formula (7): Step 3. Calculation of the private desirabilities for each of the criteria by the formula (8): The largest value of the criteria (9) corresponds to the best clustering in terms of the used criteria.
Stage IV. Data clustering on the equal power subsets A and B concurrently. 9. Choice of the clustering algorithm depend on type of the used data and goal of the research. Setup of its initial parameters, ranges and steps of these parameters change.
10. Data clustering on the equal power subsets A and B within the range of the algorithm parameters change. Calculation of the internal and the external clustering quality criteria at each step of this procedure implementation.
11. Calculation of the complex balance clustering quality criterion within the range of the algorithm parameters change.
Stage V. Analysis of the obtained results. Fixation of the optimal clustering.
12. Analysis of the obtained results. Fixation of the best clustering, which correspond to the maximum value of the complex balance criterion.
13. Comparison analysis of the intermediate solutions. Fixation of the optimal clustering, which corresponds to both the maximum value of the complex balance clustering quality criterion and the goal of the current task.
DBSCAN clustering algorithm was proposed in 1996 as a solution of the problem to divide the data into clusters of arbitrary shapes [17][18][19]. The following definitions are the basis of this algorithm operation [18]: Definition 1. The Eps-neighborhood of a point p is defined by the following:

Definition 2.
A point q is directly density-reachable from a point p if the following conditions are performed:

Definition 4.
A point q is density-connected with a point p if there is a point k such that both the points q and p are density-reachable from the point k.
Definition 5. A cluster C is a non-empty subset of a set of points D if the following conditions are performed: 1. q p, ∀ : if C p ∈ and q is density-reachable from p, : if q is density-connected with p, then C q p ∈ , .
is a set of the allocated clusters. The noise is the set of points of the database D, which not belonging to any cluster i C : Result of DBSCAN clustering algorithm operation depends on the two parameters: EPS and MinPts. To determine the optimal EPS value for appropriate MinPts the technology based on sorted k-dist graph was proposed in [18]. However, it should be noted, that implementation of this technology does not allow us to determine the EPS value exactly. This fact influences the quality of the algorithm operation. The implementation of the proposed technology allows us to determine only the range of the EPS values change for appropriate MinPts value. To solve this problem, we propose the use of DBSCAN clustering algorithm within the framework of the OCIT. The structural block chart of the algorithm to implement this process is presented in Fig. 3. The implementation of this algorithm involves the following steps: Step 1. Formation of the initial data as a matrix, where number of rows is the number of the studied objects and number of columns is a number of the features, which characterized the objects.
Step 2. Determination of the affinity functions in dependence on type of the studied data. Division of the initial data into two equal power subsets.
Step 3. Formation of the internal, external and complex balance clustering quality criteria.
Step 4. Setup of DBSCAN clustering algorithm. Determination of the range of the MinPts value change. Creation of the sorted k-dist graph within this range. Determination of both the range and step of the EPS value change.
Step 5. Setup of the initial value of the MinPts algorithms parameter (k = min(MinPts)).
Step 6. Setup of the initial value of the EPS algorithms parameter (e = min(EPS)).
Step 7. Data clustering on the two equal power subsets concurrently. Clusters formation.
Step 8. Calculation of both the internal and external clustering quality criteria by formulas (3) and (4).
Step 9. If the condition e ≤ max(EPS) is true increasing the EPS value (e=e+de) and repetition of the steps 7 and 8 of this procedure. Otherwise, calculation of the complex balance criterion by the formulas (5)-(9).
Step 10. If the MinPts value is less than maximum (k ≤ max(MinPts)) increasing the MinPts value (k=k+1) and transition to the step 6 of this algorithm. Otherwise, creation of the charts of the complex balance criterion versus the EPS for each MinPts value.
Step 11. Analysis of the obtained results. Fixation of the optimal clustering.

EXPERIMENTS
The simulation of the proposed technology was performed based on KNIME analytics platform [20] using R software [21]. The structure of the used model is presented in Fig. 4. To estimate the effectiveness of the proposed technology the data "Aggregation" [22], "Compound" [23], "Multishapes" [24] and "Jain" [25] of the school of computing of the Eastern Finland University were used. These data are presented in the twodimensional space and they include the clusters of different shapes. Fig. 5 shows the character of the studied data distribution.
Other datasets were the Fisher's iris [26] and gene expression profiles of the patients, which were investigated on lung cancer [27]. The data of the gene expression profiles was presented as a matrix, where the number of rows is the number of the studied genes (2000) and the number of columns is the number of the studied objects or the conditions of the experiment performing (96). The gene expression profile in this case is a vector of gene expressions, which were determined for the different conditions of the experiment performing. To estimate the proximity level between the studied vectors we used Euclidean distance in the case of lowdimensional data and the correlation distance in the case of the gene expression profiles use. In accordance with algorithm presented in Fig. 3, the studied data were normalized and divided into two equal power subsets with the use of the hereinbefore presented algorithm. Then, the sorted k-dist graphs were created within the boundary range of the MinPts value change from 3 to 8. These kdist graphs were used to determine the range of the EPS value change. The date clustering on the two equal power subsets within the range of the EPS value change for each MinPts value were performed at the next step of data processing.  As the results, we have obtained the charts of the complex balance clustering quality criterion versus the EPS for each MinPts value. The analysis of these charts allows us to determine the best clustering in terms of both the used criteria and goal of the current task. Fig. 6 shows the sorted k-dist graphs for Aggregation data. The similar graphs were obtained in the case of the other data use.

RESULTS
The analysis of the obtained results allows us to determine both the ranges and steps of the EPS value changes for each type of the investigated data. These parameters are presented in the Table 1. Charts of the complex balance criterion versus the EPS for different MinPts value in the case of the "Aggregation" data use are presented in Fig. 7. The similar charts were obtained for the other investigated data. The analysis of the obtained charts allows us to select the subset of the intermediate solutions (new less ranges and steps of the EPS value change), which correspond to the maximum values of the complex balance criterion.
Then, the detail analysis of the selected solutions is performed in order to determine the optimal clustering in Figure 6 -Sorted k-dist graph for "Aggregation" data terms of the goal of the current task. The optimal parameters of DBSCAN clustering algorithm operation, which were determined within the framework of the proposed technology for the investigated data are presented in Table 2. Fig. 8 presents the results of the two-dimensional data clustering. Fig. 9 and Fig. 10 presents the same results in the cases of the use of both "iris" data and the gene expression profiles.

DISCUSSION
The analysis of the obtained results allows us to conclude that the objects were distributed into clusters correctly in all cases. So, in the case of "Aggregation data" (Fig. 8a) we have as the result seven clusters. Several objects were identified as noise since the density of their distribution in the feature space is less than the density of the other objects distribution within the obtained clusters. It should be noted that in this case the connected clusters were divided correctly. The result of "Compound" data clustering is presented in Fig. 8b. As it can be seen, in this case the objects are distributed into clusters correctly too. We have as the result five clusters that corresponds to the character of the objects distribution in the feature space. A lot of the objects are identified as noise. It is naturally, since the density of these objects distribution in the feature space is significantly less in comparison with density of the other objects distribution. The same results are observed in the case of "Multishapes" (Fig. 8c) and "Jain" (Fig. 8d) data use. "Multishapes" data contains clusters different shapes and sizes. As it can be seen, the studied objects are distributed into clusters correctly. Five clusters were allocated in this case. The objects of "Jain" data were distributed into three clusters. It should be noted that the little change of the DBSCAN algorithm parameters decreases of the clustering quality in all cases. The intersected or non-divided clusters are appeared in this case.
The analysis of the result of "Iris" data clustering ( Fig.  9) allows us to conclude that the objects of "Setosa" class were allocated in the first cluster. This cluster has no any intersection with the other clusters. However, the four objects of "Setosa" class were identified as the noise. The detail analysis of the parallel coordinates plot for objects of "Setosa" class has shown that this class contains several objects, the profiles of which are distinguished from the profiles of other objects of this class. Thus, the obtained result is adequate. The analysis of the parallel coordinates plot for the objects of "Virginica" and "Versicolor" classes have shown that these classes have some intersection a priory. Fifty percent of the objects of "Versicolor" class and Forty-four percent of the objects of "Virginica" class were distributed into the second and the third clusters accordingly. Ten percent of the objects of "Versicolor" class and twenty percent of the objects of "Virginica" class were identified as the noise. The analysis of the parallel coordinates plot has shown that these classes contained the objects, profiles of which have distinguishes from the profiles of the other objects of these classes. Moreover, the results of the analysis have shown also that the second and the third clusters have thirty-eight percent of intersection in this case. However, this is correctly in view of the type of the studied data.
In the case of the use of the gene expression profiles of the patients, which were investigated on lung cancer, the data were distributed into three clusters. The first cluster contained 81.6% of the gene expression profiles, which determine the main of the processes in the investigated object. The second cluster contained only 0.9% of the investigated gene expression profiles. 17.5% of the gene expression profiles were identified as the noise. The first cluster in this case presents the best interest for the following processing since this cluster contains the genes, which define the main functions in the studied object.
As the results it should be noted that implementation of the proposed technology allows us to determine the optimal parameters of DBSCAN clustering algorithm in terms of the clustering quality. The analysis of the obtained results has shown that the investigated data were distributed into clusters correctly in the case of the use of different types of the data.

CONCLUSIONS
The relevant problem concerning increase of the complex data clustering quality is solved based on the use of DBSCAN clustering algorithm within the framework of the objective clustering inductive technology.
The scientific novelty of the proposed hybrid model is the following: -the data clustering is performed on the two equal power subsets concurrently within the range of the algorithm parameters change; -the optimal parameters of the clustering algorithm are determined based on the maximum values of the complex balance criterion, which contain as the components both the internal and external clustering quality criteria; -the final solution concerning selection of the optimal clustering is performed based on the comparison analysis of the best intermediate solutions takes into account the goal of the current task.
The implementation of the proposed information technology allows us to increase the quality of the data clustering due to the paralleling of the data processing and the use of both the internal and the external clustering quality criteria.
The practical significance of the obtained results is the practical implementation of the proposed hybrid model based on the complex use of the R and KNIME tools. The hybrid model was tested on the different types of the investigated data. The analysis of the obtained results has shown the high effectiveness of the proposed technology since the investigated data were distributed into clusters correctly in all cases.