OUTLIER DETECTION TECHNIQUE FOR HETEROGENEOUS DATA USING TRIMMED-MEAN ROBUST ESTIMATORS

Context. Fortunately, the most commonly used in parametric statistics assumptions such as such as normality, linearity, inde-pendence, are not always fulfilled in real practice. The main reason for this is the appearance of observations in data samples that differ from the bulk of the data, as a result of which the sample becomes heterogeneous. The application in such conditions of generally accepted estimation procedures, for example, the sample mean, entails the bias increasing and the effectiveness decreasing of the estimates obtained. This, in turn, raises the problem of finding possible solutions to the problem of processing data sets that include outliers, especially in small samples. The object of the study is the process of detecting and excluding anomalous objects from the heterogeneous data sets. Objective. The goal of the work is to develop a procedure for anomaly detection in heterogeneous data sets, and the rationale for using a number of trimmed-mean robust estimators as a statistical measure of the location parameter of distorted parametric distribution models. Method. The problems of analysis (processing) of heterogeneous data containing outliers, sharply distinguished, suspicious observations are considered. The possibilities of using robust estimation methods for processing heterogeneous data have been analyzed. A procedure for identification and extraction of outliers caused by measurement errors, hidden equipment defects, experimen-tal conditions, etc. has been proposed. The proposed approach is based on the procedure of symmetric and asymmetric truncation of the ranked set obtained from the initial sample of measurement data, based on the methods of robust statistics. For a reasonable choice of the value of the truncation coefficient, it is proposed to use adaptive robust procedures. Observations that fell into the zone of smallest and lowest ordinal statistics are considered outliers. Results. The proposed approach allows, in contrast to the traditional criteria for identifying outlying observations, such as the Smirnov (Grubbs) criterion, the Dixon criterion, etc., to split the analyzed set of data into a homogeneous component and identify the set of outlying observations, assuming that their share in the total set of analyzed data is unknown. Conclusions. The article proposes the use of robust statistics methods for the formation of supposed zones containing homogeneous and outlying observations in the ranked set, built on the basis of the initial sample of the analyzed data. It is proposed to use a complex of adaptive robust procedures to establish the expected truncation levels that form the zones of outlying observations in the region of the lowest and smallest order statistics of the ranked dataset. The final level of truncation of the ranked dataset is refined on the basis of existing criteria that allow checking the boundary observations (minimum and maximum) for outliers.


NOMENCLATURE
X is an initial data set; X* is an ordered sample, constructed on the basis of X; n is a data sample size; x (i) is an order statistics from the data set X*; α is a trimming proportion; α L is a trimming proportion of smallest order statistics of X* (an asymmetric case); α U is a trimming proportion of largest order statistics of X* (an asymmetric case); ε is a proportion of outliers; L(α) is an average of [αn] smallest order statistics of X*; U(α) is an average of [αn] largest order statistics of X*; M(α) is an average of [αn] middle order statistics of X*; Es sym is a group of robust estimators based on symmetric trimming; Es asym is a group of robust estimators based on asymmetric trimming; T j (α) is a symmetric trimmed mean obtained in accordance with the selected adaptive robust estimator; T j (α L , α U ) is an asymmetric trimmed mean obtained in accordance with the selected adaptive robust estimator; is a standard error of the symmetric trimmed mean; is a standard error of the asymmetric trimmed mean; Al is a set of possible truncation levels; u (i) is a statistics of Smirnov criterion; u α is a critical value of the Smirnov criterion; G(x; x ; σ 2 ) is a normal distribution with mean x and variance σ 2 ; [y] is an integer part of y.

INTRODUCTION
In the act of processing of real data that are collected on the basis of registration (observation) methods, measurements (experiments, tests) or the participation of third parties (interviews, focus groups, expert assessment methods), analysts are often faced with a situation when data sets (samples) have observations that, to one degree or another, differ (stand out) from the rest in terms of the analyzed attribute.
In statistical analysis, such observations were called "abnormal", "outliers", "sharply distinguished", "suspicious", etc. The gross errors, measurement errors, failures of measuring equipment, human factors (operator errors), as well as short-term abrupt changes in measurements (experiments), for example, vibration, can cause of them. The share of outliers is usually about 5% to 10% of observations in the total dataset [1][2][3][4], which disturbs their homogeneity. The appearance of abnormal data in the samples is the reason for so-called "heavy tails", multimodality, pronounced asymmetry, kurtosis, etc. This, in turn, does not allow such data samples to be represented by rigorous parametric models that are described by the well-known probability distributions (e. g. Normal, Poisson, Student distributions, etc.) and characterized by two main parameters: the location parameter, which is the population or sample mean, median, mode; the scale parameter, which can be represented by variance, standard deviation, peak-to-peak, etc.
The use of generally accepted assessment procedures in such conditions, which are based on an explicit or implicit assumption of normality, entails the bias increasing and reducing the effectiveness of the obtained estimates.
In the context of the analysis of expert information, the use of survey data processing methods, which are based on the procedure of their averaging, will be justified only if there is a sufficiently high consistency (proximity) of expert assessments.
The object of study is the process of detecting and excluding anomalous objects from the heterogeneous data sets.
The subject of study is procedures, methods and technique for finding an unusual observations (outliers) in analyzed data sets.
The purpose of the work is to develop a procedure for anomaly detection in heterogeneous data sets, and the rationale for using a number of trimmed-mean robust estimators as a statistical measure of the location parameter of distorted parametric distribution models.

PROBLEM STATEMENT
Let, X = (x 1 , x 2 ,…, x n ) be a set of values measured by some parameter, where n is a sample X size.
The task is to identify the X X   region, that contains homogeneous data and the ) is more then 0.5, the further analysis is impractical; if the level of "clogging" is 0.5, only a sample median can be recommended as a robust estimate for finding the average of such a sample.

REVIEW OF THE LITERATURE
To process data sets containing heterogeneous observations, the following approaches are used [1,2,4,5,6]: the identification and exclusion of outliers; the application of robust methods for statistical analysis of data with outliers. To solve the first problem, there is currently a whole class of parametric and nonparametric methods for anomalous observations identification [5][6][7][8][9]. Parametric methods are based on complete a priori information about observations, their application presupposes a priori knowledge of the theoretical distribution of the investigated values or its determination from empirical data. Nonparametric methods do not use detailed a priori information about observations and can be used when the distribution of the indicator under study is unknown and there is no need for its analytical description. An important condition for using nonparametric methods is that the distribution functions of analyzed measurements must be continuous. The effectiveness of the application of these procedures largely depends on the size of the sample under study and the power of the selected criteria.
One aim of robust statistics is to develop evaluation procedures that are resistant to the appearance of outliers in data sets, and to obtain unbiased (or slightly biased) and effective estimates. Currently, there are next classes of robust estimates [1][2][3][4]10]: robust estimators based on Maximum-likelihood argument (M-estimators); robust estimators based on rank statistics (R-estimators); robust estimators based on a linear combination of order statistics (L-estimators). Robust L-estimators are most widespread due to the simplicity of their computational implementation [10][11][12][13]. These estimates include truncated, censored, Winsorized means, sample median, etc. The main problem of the considered estimators is the choice of the trimming proportion α, which can be sufficiently solved by using adaptive robust procedures for statistical data estimation [13][14][15][16].

MATERIALS AND METHODS
Most of the existing criteria for testing outlying observations are based on the assumption that the distribution of measurements corresponds to the normal distribution [5][6][7][8][9]. To search for and filter out sharply distinguished observations in small samples, the most widespread and theoretical justification was obtained by Grubbs-type statistics such us Smirnov (Grubbs) test, Tietjen-Moore test, Dixon test [6,8,9]. These criteria provide checking either one outlier (smallest or largest), or two (two smallest or two largest in the sample). At the same time, there is a problem of searching for outliers if their share in the total set of measurement data is unknown.
In this regard, the problem of finding a homogeneous component of the set of measurement data is urgent. It is assumed that the measurement results have approximately normal distribution. Modern research has shown that the procedure for checking the analyzed data samples for compliance with the Gaussian distribution is a rather difficult task, especially for analyzing samples with limited data volume (n ≤ 50). Currently, there is a fairly extensive class of goodness-of-fit tests [17][18][19], applicable for small data samples, for example, the nonparametric Shapiro-Wilk test [18], the Sarkadi test [19]. At the same time, it was proved that in small samples case it is not always possible to distinguish the normal distribution from other types of distributions.
Under these conditions, the paper proposes to use a complex of adaptive robust estimates to identify a homogeneous component of the analyzed data set. However, in this case, the problem arises of choosing robust estimates that can recommend different levels of truncation of the ordered sample built on the basis of the analyzed sample of measurement data. The trimming proportion has a direct impact on the size of the area containing homogeneous data, i.e. an area that does not contain outliers. To clarify (expand, or narrow) the area of homogeneous data, the Grubbs-type test was used.
Let us consider the main stages of the proposed procedure for searching and eliminating outliers in the studied data set. Stage 1. Formation of truncation levels of the ordered sample X*.
1.1. Construction on the basis of a set of measurement data X the ordered sample values X*: Denote by x (i) the order statistics from the data set X*.
2.1 Formulation of the null and alternative hypothesis: 2.1.1 the E sym group has been selected. H 0 : the boundary observation (x (g) or x (n-g+1) ) belongs to the same general population as the rest (q = n -2g -1) of the central values of the ordered sample.
H 1 : the boundary observation (x (g) or x (n-g+1) ) is outlier. 2.1.2 the E asym group has been selected. H 0 : the boundary observation (x (g1) or x (n-g2+1) ) belongs to the same general population as the rest (q = n -g 1g 2 -1) of the central values of the ordered sample.
The elements x (g1) and x (n-g2+1) are checked alternately. Stage 3. Outliers exclusion, Fig. 2-3. 3.1 IF H 0 is accepted: 3.1.1 for x (n-g+1) (or x (n-g2+1) respectively), THEN the next senior member of the ordered sample is tested until the element x (s) is found, for which H 0 will be rejected.
Then the group of senior members of the ordered sample x (s) ≤…≤ x (n) is considered as outliers.
3.1.2 for x (g) (or x (g1) respectively), THEN the previous junior member of the ordered sample is tested until the element x (s) is found, for which H 0 will be rejected. respectively) are consistently selected.
Repeat the procedure of items 2.1-3.2.   (volume n = 500) has been constructed.
The constructed model is a two-component symmetric mixture of normal distributions of the form: Using the Tukey contamination model (8), a mixture was generated with the following parameters: -the main Gaussian distribution G(x, x , σ 0 2 ) with parameters x = 0, σ 0 = 0.7; -the contaminated Gaussian distribution G(x, x , σ 1 2 ) with parameters x = 0, σ 1 = 1.2; -the proportion of outliers ε = 25%. The test for symmetry of distribution was performed on the basis of the HeQ 1 robust estimator (6). Based on the results of such verification, a group of symmetric truncation estimators Est = Es sym was selected [10,[13][14][15][16].

RESULTS
The results of the analysis of the Est-group estimators are presented in the Table 1.
. Based on the results of the verification of the boundary values x (g) , x (n-g+1) , at α = 12.5% (NH 1 estimator) according to Smirnov test at 0.05 significance level H 0 was accepted, the limit observations belong to the same general population, as well as other central values of the ordered sample.
Thus, two regions [X * 1 ; X * 62 ] and [X * 439 ; X * 500 ] were formed for testing the limit values for the anomaly by Smirnov criterion in accordance with the scheme shown in Fig. 3.

DISCUSSION
The procedure for outliers searching in analyzed data set paper has been proposed. To highlight a homogeneous component of the set of measurement data, it was proposed to use robust estimates based on a linear combination of order statistics, which are used the procedure of both symmetric and asymmetric truncation. In the symmetric case, [αn] smallest and [αn] largest observations are considered as outliers, which violate homogeneity of analyzed data set. In the asymmetric case, the truncation ratio α is additionally divided into the proportions α L and α U , which corresponds to the truncation levels of the [αn] smallest and [αn] largest observations. The main problem of such type estimators is the choice of the value of the truncation coefficient α, which can be sufficiently solved by using adaptive robust procedures. With the adaptive approach, a specific type of auxiliary measures of taillength, skewness and kurtosis. To choice the initial (reference) level of truncation of ordered sample, an adaptive estimate was chosen that minimizes the value of the standard error of the truncated mean.

CONCLUSIONS
The problem of outliers detection in heterogeneous data sets has been studied.
The scientific novelty of obtained results is that the methods of detection and exclusion of observations distorted by measurement errors (anomalous observations) due to hidden defects of the equipment, operating conditions of the equipment, and other conditions, are received the further development. The mathematical apparatus of nonparametric statistics was used to process the results of observations and search for outliers. The proposed approach is based on the procedure of truncation of ordered samples obtained on the basis of the initial sample of measurement data. Observations that fall into the region of [α L n] smallest and [α U n] largest order statistics are considered outliers. To form possible levels of truncation of the ordered sample, data processing algorithms using adaptive robust statistical estimation procedures were used, which allowed to formalize the procedure for selecting the level of truncation.
The practical significance of the obtained results lies in the fact that the proposed approach allows, along with the traditional tests for outliers detection, to single out a set of abnormal measurement results, if their share in the total set of measurement data is unknown. This, in turn, expands the capabilities of existing algorithms for searching the outliers, which is ultimately aimed at increasing the reliability of statistical processing the results of observations (primary measurements).