THE METHODS FOR QUANTITATIVE SOLVING THE CLASS IMBALANCE PROBLEM
DOI:
https://doi.org/10.15588/1607-3274-2018-1-10Keywords:
sample, example, quality metric, cluster, classificatory, majority class, minority class.Abstract
Context. The problem of recovery the classes’ balance in imbalanced samples is solved to increase the efficiency of diagnostic andrecognition models.
Objective. The purpose of the work is to modify the existing method of recovery classes’ balance and to conduct comparative analysis
of performance indicators with some modern methods.
Method. The proposed data preprocessing method is based on combining the undersampling and cluster-analysis technologies. The
method has allowed restoring the balance and reducing the sample while maintaining important topological properties of the sample, high
accuracy and acceptable operating time.
Results. The software that implements in proposed method has been developed and used in the computational experiments on the study
of method’s properties and comparative analysis with other methods of restoring classes’ balance.
Conclusions. The experiments confirmed the efficiency of the proposed method and its implemented software. The method has allowed
reducing the majority class to the size of the minority class, thus reducing the training sample (the sample is considered imbalanced if the size of the minority class is less than 10% of the original sample size), while demonstrating the best indicators of model accuracy and comparable sampling speed. It can be recommended for the practical application in solving problems of imbalance data for diagnostic and recognition models.
References
He H., Garcia E. A. Learning from Imbalanced Data, IEEE
Transactions on Knowledge and Data Engineering, 2009,
Vol. 21, pp. 1263–1284. DOI: 10.1109/TKDE.2008.239
Paklin N. B., Ulanov S. V., Car’kov S. V. Postroenie klassifikatorov
na nesbalansirovannykh vyborkakh na primere kreditnogo
skoringa, Iskusstvennyjj intellekt, 2010, No. 3, pp. 528–534.
Sun Y., Wong A. K. C., Kamel M. S. Classification of imbalanced
data: a review, International Journal of Pattern Recognition and
Artificial Intelligence, 2009, Vol. 23, Issue 4, pp. 687–719.
DOI: 10.1142/S0218001409007326
Batista G. E. A. P. A., Prati R. C., Monard M. C. A study of the
behavior of several methods for balancing machine learning
training data, SIGKDD Explorations, 2004, Vol. 6, Issue 1,
pp. 20–29. DOI: 10.1145/1007730.1007735
Lin W. C., Tsai C. F., Hu Y. H., Jhang J. S. Clustering-based
undersampling in class-imbalanced data, Information Sciences,
, Vol. 409–410, pp. 17–26. DOI: 10.1016/j.ins.2017.05.008
Yen S. J., Lee Y. S. Cluster-based under-sampling approaches for
imbalanced data distributions, Expert Systems with Applications,
, Vol. 36, Issue 3, pp. 5718–5727. DOI: 10.1016/
j.eswa.2008.06.108
Chawla N. V., Bowyer K. W., Hall L. O., Kegelmeyer W. P.
SMOTE: Synthetic minority over-sampling technique, Journal
of Artificial Intelligence Research, 2002, Vol. 16, pp. 321–357.
DOI: 10.1613/jair.953
Wang B. X., Japkowicz N. Imbalanced Data Set Learning with
Synthetic Samples [Electronic resource]. Access mode: http://
www.iro.umontreal.ca/~lisa/workshop2004/ program.html
Subbotіn S. O., Olіjjnik A. O. Іntelektual’nі sistemi : navch. posіb.
pіd zag. red. prof. S. O. Subbotіna. Zaporіzhzhja, ZNTU, 2014,
p.
Elkan C. The foundations of cost-sensitive learning, 17th
international joint conference on Artificial intelligence, Seattle,
–10 August 2001 : Proceedings. San Francisco, Morgan
Kaufmann Publishers Inc., 2001, Vol. 2, pp. 973–978.
Fawcett T. An Introduction to ROC Analysis, Pattern Recognition
Letters, 2006, Vol. 27, Issue 8, pp. 861–874. DOI: 10.1016/
j.patrec.2005.10.010
Cover T., Hart P. Nearest neighbor pattern classification, IEEE
Transactions on Information Theory, 1967, Vol. 13, Issue 1,
P. 21–27. DOI: 10.1109/TIT.1967.1053964
Zagorujjko N. G. Prikladnye metody analiza dannykh i znanijj.
Novosibirsk, IIM, 1999, 270 p.
Lloyd S. P. Least Squares Quantization in PCM, IEEE Transactions
on Information Theory, 1982, Vol. 28, pp. 129–137.
DOI: 10.1109/TIT.1982.1056489
Subotіn S. O., Kavrіn D. A. Avtomatizovana sistema vіdboru
optimal’nogo metodu vіdnovlennja balansu klasіv pri formuvannі
navchal’noї vibіrki, Іnformatika, upravlіnnja ta shtuchnijj іntelekt.
Materіali chetvertoї mіzhnarodnoї naukovotekhnіchnoї
konferencії studentіv, magіstrіv ta aspіrantіv. Kharkіv, NTU
“KhPІ”, 2017, P. 94.
Kokren U. Metody vyborochnogo issledovanija. Mosсow,
Statistika, 1976, 440 p.
Downloads
How to Cite
Issue
Section
License
Copyright (c) 2018 D. А. Kavrin, S. A. Subbotin
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.