RESEARCH OF METHODS FOR IMPROVING THE QUALITY OF CLASSIFICATION ON IMBALANCED DATA
DOI:
https://doi.org/10.26906/SUNZ.2023.2.087Keywords:
classification, imbalanced data, data balancing, Undersampling, Oversampling, SMOTEENN, SVMSMOTE, BorderlineSMOTE, ADASYN, SMOTE, KMeansSMOTEAbstract
The subject of the study is methods of balancing raw data. The purpose of the article is to improve the quality of intrusion detection in computer networks by using class balancing methods. Task: to investigate methods of balancing classes and to develop a classification method on imbalanced data to increase the level of network security. The methods used are: methods of artificial intelligence, machine learning. The following results were obtained: Class balancing methods based on Undersampling, Oversampling and their combinations were studied. The following methods were chosen for further research: SMOTEENN, SVMSMOTE, BorderlineSMOTE, ADASYN, SMOTE, KMeansSMOTE. The UNSW-NB 15 set was used as the source data, which contains information about the normal functioning of the network and during intrusions. A decision tree based on the CART (Classification And Regression Tree) algorithm was used as the basic classifier. According to the research results, it was found that the use of the SMOTEENN method provides an opportunity to improve the quality of detection of intrusions in the functioning of the network. Conclusions. The scientific novelty of the obtained results lies in the complex use of data balancing methods and the method of data classification based on decision trees to detect intrusions into computer networks, which made it possible to reduce the number of Type II errorsDownloads
References
S. Gavrylenko, V. Chelak, S. Semenov. Development of Method for Identification the Computer System State based on the Decision Tree with Multi-Dimensional Nodes. Radio Electronics, Computer Science, Control (RECSC). 2022, V.4, pp.113-121.
Krawczyk, Bartosz. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 2016, V.5, pp.221-232.
C. Wheelus, E. Bou-Harb and X. Zhu. Tackling Class Imbalance in Cyber Security Datasets. 2018 IEEE International Conference on Information Reuse and Integration (IRI), Salt Lake City, UT, USA. 2018, pp.229-232.
Abdi L, Sattar H. To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng. 2016, V.28, pp.238–251.
Will Badr. Having an Imbalanced Dataset? Here Is How You Can Fix It. [Електронний ресурс] – Режим доступу: https://towardsdatascience. com/ having-an-imbalanced-dataset-here-is-how-you-can-solve-it-1640568947eb.
Jason Brownlee. Cost-Sensitive Learning for Imbalanced Classification. [Електронний ресурс] – Режим доступу: https://machinelearningmastery. com/cost-sensitive-learning-for-imbalanced-classification/.
D. L. Wilson. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Cybernetics. 1972, V.3, pp.408-421.
Luque A, Carrasco A, Martin A, Heras de las A. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn. 2019, pp.216–231.
Batista, Gustavo EAPA, Ronaldo C. Prati, and Maria Carolina Monard. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter. 2004,V.6, pp.20-29.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over sampling technique. J Artif Intellig Res. 2002, pp.321–357.
Douzas G, Bacao F, Last F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci. 2018, V.465, pp.1–20.
Blagus R, Lusa L. SMOTE for High-dimensional class-imbalanced data. BMC Bioinf. 2013, V.14, pp.14-106.
Fu G.H., Xu F., Zhang B.Y., Yi L.Zh. Stable variable selection of class-imbalanced data with precision-recall criterion. Chemometrics and Intelligent Laboratory Systems. 2017, V.171, pp.241-250.
Haixiang G., Yijing L., Shang J., Mingyun G., Yuanyue H., Bing G. Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications. 2017, V.73, pp.220-239.
Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). Military Communications and Information Systems Conference (MilCIS). 2015, pp.1-6.
Douzas Georgios, Fernando Bacao, and Felix Last. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information Sciences. 2018, V. 465, pp.1-20.