THE METHODS FOR QUANTITATIVE SOLVING THE CLASS IMBALANCE PROBLEM

D. А. Kavrin, S. A. Subbotin

Abstract


Context. The problem of recovery the classes’ balance in imbalanced samples is solved to increase the efficiency of diagnostic and
recognition models.
Objective. The purpose of the work is to modify the existing method of recovery classes’ balance and to conduct comparative analysis
of performance indicators with some modern methods.
Method. The proposed data preprocessing method is based on combining the undersampling and cluster-analysis technologies. The
method has allowed restoring the balance and reducing the sample while maintaining important topological properties of the sample, high
accuracy and acceptable operating time.
Results. The software that implements in proposed method has been developed and used in the computational experiments on the study
of method’s properties and comparative analysis with other methods of restoring classes’ balance.
Conclusions. The experiments confirmed the efficiency of the proposed method and its implemented software. The method has allowed
reducing the majority class to the size of the minority class, thus reducing the training sample (the sample is considered imbalanced if the size of the minority class is less than 10% of the original sample size), while demonstrating the best indicators of model accuracy and comparable sampling speed. It can be recommended for the practical application in solving problems of imbalance data for diagnostic and recognition models.

Keywords


sample; example; quality metric; cluster; classificatory; majority class; minority class.

References


He H., Garcia E. A. Learning from Imbalanced Data, IEEE

Transactions on Knowledge and Data Engineering, 2009,

Vol. 21, pp. 1263–1284. DOI: 10.1109/TKDE.2008.239

Paklin N. B., Ulanov S. V., Car’kov S. V. Postroenie klassifikatorov

na nesbalansirovannykh vyborkakh na primere kreditnogo

skoringa, Iskusstvennyjj intellekt, 2010, No. 3, pp. 528–534.

Sun Y., Wong A. K. C., Kamel M. S. Classification of imbalanced

data: a review, International Journal of Pattern Recognition and

Artificial Intelligence, 2009, Vol. 23, Issue 4, pp. 687–719.

DOI: 10.1142/S0218001409007326

Batista G. E. A. P. A., Prati R. C., Monard M. C. A study of the

behavior of several methods for balancing machine learning

training data, SIGKDD Explorations, 2004, Vol. 6, Issue 1,

pp. 20–29. DOI: 10.1145/1007730.1007735

Lin W. C., Tsai C. F., Hu Y. H., Jhang J. S. Clustering-based

undersampling in class-imbalanced data, Information Sciences,

, Vol. 409–410, pp. 17–26. DOI: 10.1016/j.ins.2017.05.008

Yen S. J., Lee Y. S. Cluster-based under-sampling approaches for

imbalanced data distributions, Expert Systems with Applications,

, Vol. 36, Issue 3, pp. 5718–5727. DOI: 10.1016/

j.eswa.2008.06.108

Chawla N. V., Bowyer K. W., Hall L. O., Kegelmeyer W. P.

SMOTE: Synthetic minority over-sampling technique, Journal

of Artificial Intelligence Research, 2002, Vol. 16, pp. 321–357.

DOI: 10.1613/jair.953

Wang B. X., Japkowicz N. Imbalanced Data Set Learning with

Synthetic Samples [Electronic resource]. Access mode: http://

www.iro.umontreal.ca/~lisa/workshop2004/ program.html

Subbotіn S. O., Olіjjnik A. O. Іntelektual’nі sistemi : navch. posіb.

pіd zag. red. prof. S. O. Subbotіna. Zaporіzhzhja, ZNTU, 2014,

p.

Elkan C. The foundations of cost-sensitive learning, 17th

international joint conference on Artificial intelligence, Seattle,

–10 August 2001 : Proceedings. San Francisco, Morgan

Kaufmann Publishers Inc., 2001, Vol. 2, pp. 973–978.

Fawcett T. An Introduction to ROC Analysis, Pattern Recognition

Letters, 2006, Vol. 27, Issue 8, pp. 861–874. DOI: 10.1016/

j.patrec.2005.10.010

Cover T., Hart P. Nearest neighbor pattern classification, IEEE

Transactions on Information Theory, 1967, Vol. 13, Issue 1,

P. 21–27. DOI: 10.1109/TIT.1967.1053964

Zagorujjko N. G. Prikladnye metody analiza dannykh i znanijj.

Novosibirsk, IIM, 1999, 270 p.

Lloyd S. P. Least Squares Quantization in PCM, IEEE Transactions

on Information Theory, 1982, Vol. 28, pp. 129–137.

DOI: 10.1109/TIT.1982.1056489

Subotіn S. O., Kavrіn D. A. Avtomatizovana sistema vіdboru

optimal’nogo metodu vіdnovlennja balansu klasіv pri formuvannі

navchal’noї vibіrki, Іnformatika, upravlіnnja ta shtuchnijj іntelekt.

Materіali chetvertoї mіzhnarodnoї naukovotekhnіchnoї

konferencії studentіv, magіstrіv ta aspіrantіv. Kharkіv, NTU

“KhPІ”, 2017, P. 94.

Kokren U. Metody vyborochnogo issledovanija. Mosсow,

Statistika, 1976, 440 p.


GOST Style Citations


1. He H. Learning from Imbalanced Data / H. He, E. A. Garcia // IEEE Transactions on Knowledge and Data Engineering. – 2009. – Vol. 21. – P. 1263–1284. DOI: 10.1109/TKDE.2008.239
2. Паклин Н. Б. Построение классификаторов на несбалансированных выборках на примере кредитного скоринга / Н. Б. Паклин, С. В. Уланов, С. В. Царьков // Искусственный интеллект. – 2010. – № 3. – С. 528–534.
3. Sun Y. Classification of imbalanced data: a review / Y. Sun,
A. K. C. Wong, M. S. Kamel // International Journal of Pattern
Recognition and Artificial Intelligence. – 2009. – Vol. 23, Issue
4. – P. 687–719. DOI: 10.1142/S0218001409007326
4. Batista G. E. A. P. A. A study of the behavior of several methods for balancing machine learning training data /
G. E. A .P. A. Batista, R. C. Prati, M. C. Monard // SIGKDD
Explorations. – 2004. – Vol. 6, Issue 1. – P. 20–29. DOI: 10.1145/1007730.1007735
5. Clustering-based undersampling in class-imbalanced data /
[W. C. Lin, C. F. Tsai, Y. H. Hu, J. S. Jhang] // Information
Sciences. – 2017. – Vol. 409–410. – P. 17–26. DOI: 10.1016/
j.ins.2017.05.008
6. Yen S. J. Cluster-based under-sampling approaches for imbalanced data distributions / S. J. Yen, Y. S. Lee // Expert Systems with Applications. – 2009. – Vol. 36, Issue 3. – P. 5718–5727. DOI:10.1016/j.eswa.2008.06.108
7. Chawla N.V. SMOTE: Synthetic minority over-sampling technique / N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer // Journal of Artificial Intelligence Research. – 2002. –Vol. 16. – P. 321–357. DOI: 10.1613/jair.953
8. Wang B.X. Imbalanced Data Set Learning with Synthetic Samples [Electronic resource] / B. X.Wang, N. Japkowicz. – Access mode:http://www.iro.umontreal.ca/~lisa/workshop2004/ program.html                                                                   9. Субботін С. О. Інтелектуальні системи : навч. посіб. /
С. О. Субботін, А. О. Олійник; під заг. ред. проф. С. О. Суббо-
тіна. – Запоріжжя : ЗНТУ, 2014. – 218 с.
10. Elkan C. The foundations of cost-sensitive learning / C. Elkan // 17th international joint conference on Artificial intelligence, Seattle, 4-10 August 2001 : Proceedings. – San Francisco : Morgan Kaufmann Publishers Inc., 2001. – Vol. 2. – P. 973–978.
11. Fawcett T. An Introduction to ROC Analysis / T. Fawcett //
Pattern Recognition Letters. – 2006. – Vol. 27, Issue 8. –
P. 861–874. DOI: 10.1016/j.patrec.2005.10.010
12.Cover T. Nearest neighbor pattern classification / T. Cover,
P. Hart // IEEE Transactions on Information Theory. – 1967. –
Vol. 13, Issue 1. – P. 21–27. DOI: 10.1109/TIT.1967.1053964
13. Загоруйко Н. Г. Прикладные методы анализа данных и знаний / Н. Г. Загоруйко. – Новосибирск : ИИМ, 1999. – 270 с.
14. Lloyd S. P. Least Squares Quantization in PCM / S. P. Lloyd // IEEE Transactions on Information Theory. – 1982. – Vol. 28. – P. 129–137. DOI: 10.1109/TIT.1982.1056489
15. Суботін С. О. Автоматизована система відбору оптимального методу відновлення балансу класів при формуванні навчальної вибірки / С. О. Суботін, Д. А. Каврін // Інформатика, управління та штучний інтелект. Матеріали четвертої міжнародної науковотехнічної конференції студентів, магістрів та аспірантів. – Харків: НТУ «ХПІ», 2017. – С. 94.
16. Кокрен У. Методы выборочного исследования / У. Кокрен. – М. : Статистика, 1976. – 440 с.




DOI: https://doi.org/10.15588/1607-3274-2018-1-10



Copyright (c) 2018 D. А. Kavrin, S. A. Subbotin

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Address of the journal editorial office:
Editorial office of the journal «Radio Electronics, Computer Science, Control»,
Zaporizhzhya National Technical University, 
Zhukovskiy street, 64, Zaporizhzhya, 69063, Ukraine. 
Telephone: +38-061-769-82-96 – the Editing and Publishing Department.
E-mail: rvv@zntu.edu.ua

The reference to the journal is obligatory in the cases of complete or partial use of its materials.