NONLINEAR REGRESSION MODELS FOR ESTIMATING THE DURATION OF SOFTWARE DEVELOPMENT IN JAVA FOR PC BASED ON THE 2021 ISBSG DATA

Context. The problem of estimating the duration of software development in Java for personal computers (PC) is important because, first, failed duration estimating is often the main contributor to failed software projects, second, Java is a popular language, and, third, a personal computer is a widespread multi-purpose computer. The object of the study is the process of estimating the duration of software development in Java for PC. The subject of the study is the nonlinear regression models to estimate the duration of software development in Java for PC. Objective. The goal of the work is to build nonlinear regression models for estimating the duration of software development in Java for PC based on the normalizing transformations and deleting outliers in data to increase the confidence of the estimation in comparison to the ISBSG model for the PC platform. Method. The models, confidence, and prediction intervals of nonlinear regressions to estimate the duration of software development in Java for PC are constructed based on the normalizing transformations for non-Gaussian data with the help of appropriate techniques. The techniques to build the models, confidence, and prediction intervals of nonlinear regressions are based on normalizing transformations. Also, we apply outlier removal for model construction. In general, the above leads to a reduction of the mean magnitude of relative error, the widths of the confidence, and prediction intervals in comparison to nonlinear models constructed without outlier removal application in the model construction process. Results. A comparison of the model based on the decimal logarithm transformation with the nonlinear regression models based on the Johnson (for the S B family) and Box-Cox transformations as both univariate and bivariate ones has been performed. Conclusions. The nonlinear regression model to estimate the duration of software development in Java for PC is constructed based on the decimal logarithm transformation. This model, in comparison with other nonlinear regression models, has smaller widths of the confidence and prediction intervals for effort values that are bigger than 900 person-hours. The prospects for further research may include the application of bivariate normalizing transformations and data sets to construct the nonlinear regression models for estimating the duration of software development in other languages for PC and other platforms, for example, mainframe.


ABBREVIATIONS
COCOMO is a constructive cost model; ISBSG is the International Software Benchmarking Standards Group; KLOC is kilo lines of code (one thousand lines of code); MMRE is a mean magnitude of relative error; MRE is a magnitude of relative error; PC is a personal computer; PRED is a percentage of prediction; SMD is a squared Mahalanobis distance. P is a non-Gaussian random vector; R 2 is a multiple coefficient of determination; T is a Gaussian random vector; 2 , 2   N t is a quantile of student's t-distribution with 2  N degrees of freedom and 2  significance level; X 1 is an effort of software development; Y is the duration of software development; 1 Z is a Gaussian variable that is obtained by transforming variable X 1 ; Y Z is a Gaussian variable that is obtained by transforming variable Y; Y Z is a sample mean of the Y Z values; Y Ẑ is a prediction result by linear regression equation for normalized data;

INTRODUCTION
Estimation of duration, effort, and the cost is a very important and integral part of the software development life cycle [1][2][3]. It is important to do an accurate estimation as much as possible because failed estimation (including duration estimation) is often the main contributor to failed software projects.
Today estimation of duration in software development is mostly based on heuristic approaches like expert judgment and planning poker. In absence of the experts for estimating, it is very difficult to estimate software development duration. That is why there is a need for algorithmic methods and mathematical models that can do accurate estimates.
For many years the most famous models are regression equations such as COCOMO and ISBSG. These models are similar in structure (both are effort dependent and constructed based on decimal logarithm). Wherein there only is one ISBSG model for estimating the duration of software development for the PC platform. However, there are no models which additionally take into account the programming language. In this paper, we demonstrated the need to take into account the programming language for the ISBSG model. We practiced the calibration of the ISBSG model using the ISBSG data set (D&E Corporate Release May 2021 R1) collected from the software development projects in Java for PC. We used the ISBSG data set because for many years the ISBSG repository is applied as a foundation of the software project estimation process [4]. Also, we constructed other models based on the normalizing transformations such as the Box-Cox and Johnson using the above data set.
The object of study is the process of estimating the duration of software development in Java for PC.
The subject of study is the regression models to estimate the duration of software development in Java for PC.
The purpose of the work is to increase the confidence in estimating the duration of software development in Java for PC.

PROBLEM STATEMENT
Suppose given the original sample as the bivariate non-Gaussian data set: actual duration (in months) Y and effort (in person-hours) 1 X of software development in Java for PC. Suppose that there is a mutually inverse normalizing transformation of non-Gaussian random vec- and the inverse transformation for (1) It is required to build the nonlinear regression model in the form based on transformations (1) and (2).

REVIEW OF THE LITERATURE
Although the first models for estimating the duration of software development were built in the 1970-1980 years [1,5], research in this area is still ongoing [4,[6][7][8][9][10].
Most often these models enable estimating the duration of software development depending on the development effort. Building such models requires the presence of corresponding datasets. Firstly it was government organization datasets (NASA etc.). For at least 25 years many such researchers are used data from the different ISBSG repository releases [6][7][8][9][10].
The COCOMO models were built using project size as the data clustering criteria [1]. Software development projects were split by their size into 3 types: organic (2-50 KLOC), semi-detached (50-300 KLOC), and embedded (larger than 300 KLOC). Then each of these types was built in separate models.
The ISBSG models are similar in structure to COCOMO models the only difference is that they were built for such platforms as mainframe, mid-range, and personal computers based on the 1996 ISBSG repository data.
In all models from [1,2,6] the decimal logarithm transformation was used to normalize empirical data. But as it was clear from [6], the above transformation is not always acceptable for empirical data normalization. In [6] a linear regression was performed on the Log10transformed values of duration and effort for the 39 PC software development projects 140 .
. It is very low and means that there is no correlation between dependent and independent variables.
In the nonlinear regression model for estimating the duration of software development for PC [7], the Johnson univariate transformation was used to normalize empirical data values of duration and effort. This transformation enables to build of valid models in some cases but as will be shown in this research this transformation gives average model quality with the 2021 ISBSG repository data for software developed in Java for PC. Therefore, it is also required to apply bivariate transformation and remove outliers from the empirical data to build a highquality model according to [11].
A normalizing transformation is often a good way to construct nonlinear regression models [11][12][13][14][15][16][17]. According to [14], transformations are made for essentially four purposes, two of which are: firstly, to obtain approximate normality for the distribution of the error term (residuals), secondly, to transform the response and/or the predictor in such a way that the strength of the linear relationship between new variables (normalized variables) is better than the linear relationship between initial dependent and independent variables.
According to [11], there may be data sets on which the results of building nonlinear regression models depend, firstly, which normalizing transformation is used, univariate, or multivariate, and, secondly, are there any outliers in the data set. That is why in [11] the technique was considered to build nonlinear regression models based on the multivariate normalizing transformations and prediction intervals. In this technique, in addition to the technique for detecting outliers in multivariate non-Gaussian data [18], the prediction intervals of nonlinear regressions are used to detect the outliers in the process of constructing the nonlinear regression models. We apply the above technique [11] for building the nonlinear regression models with one predictor (effort) to estimate the duration of software development in Java for PC.

MATERIALS AND METHODS
According to [11], the technique to build nonlinear regression models based on the normalizing transformations and prediction intervals consists of four steps. In the first step, multivariate non-Gaussian data are normalized using a multivariate normalizing transformation (1).
In the second step, the nonlinear regression model is constructed based on the multivariate normalizing transformation (1). Before that, we first determine whether one data point of a multivariate non-Gaussian data set is a multidimensional outlier. To do this, we apply the statistical technique based on the normalizing transformations and the Mahalanobis squared distance (MSD) as in [18,19]. If there is a two-dimensional outlier in a bivariate non-Gaussian data set, then we discard the one, and return to step 1, else build the linear regression model for normalized data based on the transformation (1) in the form  is a Gaussian random variable that defines residuals, After that, the nonlinear regression model is built based on the linear regression model (3) and the transformations (1) and (2) as In the third step, the prediction interval of nonlinear regression is defined [11]   In the fourth step, we check if there are data that are out of the bounds of the prediction interval. And if we detect the outliers, we discard them and repeat all the steps starting with the first for new data without discarded outliers, else nonlinear regression model construction is completed.
To normalize the data according to (1), we applied the decimal logarithm transformation with components 1 Z Also, to normalize the data, we used the univariate and bivariate Box-Cox transformations [16] with components and Y Z , which is defined analogously to (8) with the only difference that instead of 1 Z , 1 X , and 1  should be put respectively Y Z , Y, and Y  . Here 1 Z and Y Z are Gaussian variables, 1  and Y  parameters of the bivariate Box-Cox transformation. Furthermore, to normalize the data, we used the univariate and bivariate Jonson transformations for the S B family [11] with component 1 Z and Y Z , which is defined analogously to (9) with the only difference that instead of 1 Here 1 Z and Y Z are Gaussian variables with zero mathematical expectation and unit variance; 1 φ , and 1 λ are parameters of the Johnson transformation for the S B family.
The nonlinear regression model based on the linear regression model (3) for the normalized data and the decimal logarithm transformation for (6) and (7) has the form 1 01 The nonlinear regression model based on the bivariate Box-Cox transformation has the form [20] According to [20], the nonlinear regression model based on the Johnson bivariate transformation for the S B family has the form In (10)- (12) as and in (3),  is a Gaussian random variable which defines residuals, The confidence interval of nonlinear regression is defined analogously to (5) with the only difference that in the sum under the square root, there will not be leading 1.

EXPERIMENTS
Before building a nonlinear regression model based on the multivariate normalizing transformation, we constructed a nonlinear regression equation to estimate the duration Y (in months) of software development for the PC platform depending on the effort X 1 (in person-hours) based on the decimal logarithm transformation of 243 software projects data with Data Quality Rating A from the 2021 ISBSG database (see Fig. 1).
Also, Fig. 1 contains prediction intervals bounds (dash lines) of nonlinear regression of the duration depending on the effort, which was constructed using the decimal logarithm (Log10) by (5) for a significance level of 0.05.
The values of 2 R , MMRE, and PRED(0.25) equal respectively 0.2971, 0.4840, and 0.3457 for equation (13). These values are less than acceptable ones and indicate the unsatisfactory accuracy of duration prediction by equation (1). That is why we apply the appropriate technique [11] to build a nonlinear regression model for estimating the duration of software development in Java for PC.
To construct a nonlinear regression model for estimating the duration of software development in Java for PC we use the above technique for the 39 software projects data with Data Quality Rating A from the ISBSG database (D&E Corporate Release May 2021 R1). The above data are shown in Fig. 2 as dots. We checked the bivariate data from Fig. 2 for multivariate outliers. But before that, we tested the normality of multivariate data from Fig. 2 because well-known statistical methods (for example, multivariate outlier detection based on the squared Mahalanobis distance (SMD)) are used to detect outliers in multivariate data under the assumption that the data is described by a multivariate Gaussian distribution [16,18,19]. We applied a multivariate normality test proposed by Mardia Fig. 2 based on the multivariate normalizing transformations and the SMD for normalized data. To normalize the data from Fig. 2, we applied three univariate and two bivariate transformations (see Table 1).
The parameter estimates of the univariate and bivariate Box-Cox transformations for the data from Fig. 2 are calculated by the maximum likelihood method according to [16]. The parameter estimates of the univariate and bivariate Jonson transformation for the S B family for the data from Fig. 2 are calculated by the maximum likelihood method according to [20].  Table 1 indicate there is one multivariate outlier in bivariate non-Gaussian data for four transformations (all univariate transformations and the bivariate Box-Cox transformation) since the SMD values for row 3 for decimal logarithm and row 2 for two univariate transformations (Box-Cox and Jonson) and the bivariate Box-Cox transformation are greater than the quantile of the Chi-Square distribution, which equals to 10.60 for the 0.005 significance level and 2 degrees of freedom. In Table 1, the SMD values, which are greater than the above quantile, are highlighted in bold.
For example, a scatter plot of normalized effort Z 1 vs. normalized duration Z Y (using the bivariate Box-Cox transformation) for the data from Fig. 2 is shown in Fig. 3. Here the above outlier (Project 10868) is marked as an "outlier". Only the SMD values from Table 1 for bivariate Jonson transformations for the S B family indicate there are no multivariate outliers in bivariate non-Gaussian data from Fig. 2 since all SMD values, in this case, are less than the above quantile value.
The reason for such different results in outlier detection is that only the data normalized using the bivariate Jonson transformation for the S B family passes a multivariate normality test proposed by Mardia [21]. As a note, the above, Mardia's test is based on measures of the multivariate skewness 1  and kurtosis 2  [21].
According to Mardia's test, the bivariate distribution of data (from Fig. 2) normalized using the bivariate Jonson transformation for the S B family is approximately Gaussian since the test statistic for multivariate skewness 6 1  N of this data, which equals 2.08, is less than the quantile of the Chi-Square distribution, which is 14.86 for 4 degrees of freedom and 0.005 significance level. Also, the test statistic for multivariate kurtosis 2  , which equals 10.71, is less than the value of the Gaussian distribution quantile, which is 11.30 for 8 mean, 1.641 variance, and 0.005 significance level. Therefore, we decide, that there are no multivariate outliers in bivariate non-Gaussian data from Fig. 2 (39  data points). And we go to step 2 of the first iteration.
We constructed the nonlinear regression model (10) There are two outliers (data for software projects 10868 and 11641) since their Y values are out of the prediction interval computed by (5) for a significance level of 0.05. We discarded data of software projects 10868 and 11641. The first iteration is completed. The above 37 data points are shown in Fig. 4.
In the second iteration, there are no multivariate outliers in bivariate non-Gaussian data from Fig. 4 (37 data points). And we go to step 2 of the second iteration. . Next, we calculated the nonlinear regression prediction interval by (5) for the data normalized by the Log10 transformation of 37 data points from Fig. 4. There are two outliers (data for software projects 12636 and 31895) since their Y values are out of the prediction interval computed by (5) for a significance level of 0.05. We discarded data from software projects 12636 and 31895. The second iteration is completed.
In the third iteration, we used data from the remaining 35 projects (see Fig. 5). There are no multivariate outliers in bivariate non-Gaussian data from Fig. 5 (35 data  points). And we go to step 2 of the third iteration.
Next, we used 35 data points from Fig. 5 to construct the model in form (10) with the following parameters There is one outlier (data for software project 14487) since its Y value is out of the prediction interval computed by (5) for a significance level of 0.05. We discarded data from software project 14487. The third iteration is completed.
In the fourth iteration, we used data from the remaining 34 projects (see Fig. 6). There are no multivariate outliers in bivariate non-Gaussian data from Fig. 6 (34 data points). And we go to step 2 of the fourth iteration.
We used 34 data points from Fig. 6 to construct the model in form (10) with the following parameters esti-  Also, we calculated the confidence intervals of nonlinear regression Yˆ constructed by the decimal logarithm transformation of 34 data points (see Fig. 6).
The computer program implementing the constructed models (10), (11), and (12) was developed to conduct experiments. The program was written in the sci-language for the Scilab system. Scilab (https://www.scilab.org/) is free and open-source software, the alternative to commercial packages for system modeling and simulation packages such as MATLAB and MATRIXx [23].

RESULTS
The prediction results Yˆ (solid line) of nonlinear regression models (11) and (12), its confidence (dot lines) and prediction (dash lines) intervals of the duration (in months) depending on the effort (in person-hours) are defined for both univariate and multivariate transformations (see figures 7-10) to compare with prediction results for model (10).  These metrics are applied in software engineering too [24,25]. The acceptable values of MMRE and PRED(0.25) are not more than 0.25 and not less than 0.75 respectively. The values of 2 R , MMRE and PRED(0.25) are shown in Table 2 for models (10)- (12) for both univariate and multivariate transformations. The values of these metrics are acceptable and approximately the same for all models. These values indicate good prediction accuracy of the models (10)- (12) for estimating the duration of software development in Java for PC.  (12) based on the Johnson bivariate transformation are smaller than for the model (12) with parameter estimates for the Johnson univariate transformation for 19 from 34 data points. Also, the last result indicates the advantage of using the bivariate transformation in comparison to the univariate one.

DISCUSSION
We apply bivariate normalizing transformations to build the nonlinear regression model for estimating the duration of software development in Java for PC by appropriate techniques [11] since the error distribution of the linear regression model is not Gaussian what the chisquared test result indicates. Also, there are no outliers in the data. Moreover, the bivariate distribution of the data is not Gaussian which the Mardia multivariate normality test based on measures of the multivariate skewness and kurtosis indicates. Because we use the statistical technique [18] to detect multivariate outliers in the bivariate non-Gaussian data based on the bivariate normalizing transformations and the SMD for normalized data. Note, that we have other bivariate outliers for the data from Table 1 without applying normalization compared to outlier detection results using the above technique [18].
Also note that in our case, the poor normalization of bivariate non-Gaussian data using the Box-Cox and Johnson univariate transformations lead to an increase in the widths of the confidence and prediction intervals of nonlinear regression for a larger number of data rows compared to the Box-Cox and Johnson bivariate transformations. The above indicates the advantage of using the bivariate transformation in comparison to the univariate one.
The nonlinear regression model (10), in comparison with other nonlinear regression models (11) and (12), has smaller widths of the confidence and prediction intervals for effort values that are bigger than 900 person-hours. These results and the values of the prediction accuracy metrics from Table 2 indicate the preference for using a more simple model (10) for estimating the duration of software development in Java for PC.

CONCLUSIONS
The important problem of increase of confidence in estimating the duration of software development in Java for PC is solved.
The scientific novelty of obtained results is that nonlinear regression models to estimate the duration of software development in Java for PC are firstly constructed based on the Box-Cox and Johnson bivariate transformations. These models, in comparison with other nonlinear regression models, have smaller widths of the confidence and prediction intervals for effort values that are smaller than 900 person-hours.
The practical significance of obtained results is that the software realizing the constructed model is developed in the sci-language for Scilab. The experimental results allow for the recommendation of the constructed model for use in practice.
Prospects for further research may include the application of bivariate normalizing transformations and data sets to construct the nonlinear regression models for estimating the duration of software development in other languages for PC and other platforms, for example, mainframe.