PROCESSING INCOMPLETE DATA IN CLUSTER TASKS
DOI:
https://doi.org/10.26906/SUNZ.2019.5.045Keywords:
clustering, incomplete data, data processing and analysis, Data Mining, FCM, data recovery methods, programming language RAbstract
The subjects of research is the methods of preparation and processing of input data containing missing values for their further analysis and clustering. The goal is to consider existing methods of getting rid of data gaps in clustering problems and the appropriateness of their use in real situations. The tasks include: analysis of advantages and disadvantages of each of the methods aimed at recovering data, to determine the appropriateness of use in clustering tasks and highlighting the most suitable for use, comparing them with each other; performance evaluation by comparing the recovered data clustering with the clustering results of the reference data. The used methods: FCM method for direct data clustering, methods of deleting all lines containing omissions, filling in omissions with selective statistics, filling in omissions taking into account the structure of links. The obtained results: efficiency of applying the methods to preparing data for further clustering depends on the number of omissions in the original set. If there are few such lines, then each of the considered methods can be used to obtain the necessary results. But, if there are a lot of lines with omissions , for example, 30%, then the methods that are associated with the replacement of values can be called the most acceptable for use, however, it should be borne in mind that this replacement can lead to distortion of the data, and ultimately the results. Conclusions. Scientific novelty - investigation of the problem of incomplete data clustering and consideration of methods that can solve this problem. Conducting experiments and comparing the results of each of the methods, conclusions about the advisability of using one of them and side effects. The practical significance of the paper consists in determining the possibility to use it in real tasks, which are usually not ideal and most likely contain empty values, data processing methods for using them in clustering tasks.Downloads
References
Шумейко, А. А., & Сотник, С. Л. (2012). Интеллектуальный анализ данных. Днепропетровск: Белая ЕА, 212.
Жамбю, М., & Айвазян, С. А. (1988). Иерархический кластер-анализ и соответствия. Финансы и статистика.
Jain, A. K., Murty, M. N., Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3), 264-323.
Steinley, D. (2006). К means clustering a half century synthesis. British Journal of Math. and Stat. Psychology, 59(1), 1-34.
Huang, Z., & Ng, M. K. (1999). A fuzzy k-modes algorithm for clustering categorical data. IEEE Transactions on Fuzzy Systems, 7(4), 446-452..
Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological methods, 7(2), 147.
Bodyanskiy, Y., Vynokurova, O., Kobylin, I., & Kobylin, O. (2016). Adaptive fuzzy clustering of short time series with unevenly distributed observations in Data Stream Mining tasks. Information Technology and Management Science, 19(1), 23-28.
Rabotiahov, A., Kobylin, O., Dudar, Z., & Lyashenko, V. (2018, February). Bionic image segmentation of cytology samples method. In 2018 14th International Conference on Advanced Trends in Radioelecrtronics, Telecommunications and Computer Engineering (TCSET) (pp. 665-670). IEEE.
Oleg, K., Sergii, M., & Mykhailo, S. (2017, October). Video Clustering via Multidimensional Time-Series Analysis. In Proceedings of the 9th International Conference on Information Management and Engineering (pp. 60-63). ACM.