The sub-concepts within the minority classes are the prime reason for this performance degradation of Machine Learning Algorithms while classification. This minority class can be separated into four categories based on their neighborhood: safe, borderline

Md. Mahin; Md. Jahidul Islam; Ayesha Khatun; Sumaiya Kabir; Biplab Chandra Debnath; Misbah Ul Hoque

Published: Apr 28, 2022

Keywords:

Sub-category Imbalance data Minority data KNN Local accuracy Classification

Md. Mahin

Md. Jahidul Islam

Ayesha Khatun

Sumaiya Kabir

Biplab Chandra Debnath

Misbah Ul Hoque

Abstract

The sub-concepts within the minority classes are the prime reason for this performance degradation of Machine Learning Algorithms while classification. This minority class can be separated into four categories based on their neighborhood: safe, borderline, rare, and outlier. The main aim of this research is to improve the categorization of minority class by incorporating a parameter distance measure dynamically within the previous methodology. This research categorize the imbalanced minority data by tuning the distance measure provides best G-mean performance for any dataset. For the evolution of the performance of different sub-categories n repeated k fold stratified cross validation is employed that will consider the low number of samples within each sub-categories and reduce the bias and variance. The improved methodology of this research has been applied on five data sets from UCL digital library. It is observed that classifiers recognize safe data easily, while performance degrades increasingly for borderline, more degrade on rare samples and for outlier samples it is mostly poor.

Issue:

Vol. 4 No. 1 (2017): Volume 04, Issue 01, December-2017

Area :

Articles

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

References

J. Błaszczynski, J. Stefanowski, Local data characteristics in learning classifiers from imbalanced data, in: Advances in Data Analysis with Computational Intelligence Methods, Springer, 2018, pp. 51–85.

J. Stefanowski, Dealing with data difficulty factors while learning from imbalanced data, in: Challenges in Computational Statistics and Data Mining, Springer, 2016, pp. 333–363.

K. Napierala, J. Stefanowski, Types of minority class examples and their influence on learning classifiers from imbalanced data, Journal of Intelligent Information Systems 46 (3) (2016) 563–597.

V. Lopez,´ A. Fernandez,´ S. Garc´ıa, V. Palade, F. Herrera, An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics, Information Sciences 250 (2013) 113–141.

N. Japkowicz, ―Concept-learning in the presence of betweenclass and within-class imbalances‖, proceeding on Conference of the Canadian Society for Computational Studies of Intelligence, Springer, 2001, pp. 67–77.

N. Japkowicz, Class imbalances: are we focusing on the right issue, in: Workshop on Learning from Imbalanced Data Sets II, Vol. 1723, 2003, p. 63.

K. M. Ting, The problem of small disjuncts: Its remedy in decision trees.

G. M. Weiss, H. Hirsh, A quantitative study of small disjuncts, in: AAAI/IAAI, 2000, pp. 665–670.

R. C. Prati, G. E. Batista, M. C. Monard, Learning with class skews and small disjuncts, in: Brazilian Symposium on Artificial Intelligence, Springer, 2004, pp. 296–306.

R. C. Prati, G. E. Batista, M. C. Monard, ―Class imbalances versus class overlapping: an analysis of a learning system behavior‖, presented at Mexican international conference on artificial intelligence, Springer, 2004, pp. 312–321.

K. Napierała, J. Stefanowski, S. Wilk, ―Learning from imbalanced data in presence of noisy and borderline examples‖, presented at International Conference on Rough Sets and Current Trends in Computing, Springer, 2010, pp. 158–167.

H. He, E. A. Garcia, Learning from imbalanced data, IEEE Transactions on Knowledge & Data Engineering (9) (2008) 1263–1284.

J. W. Grzymala-Busse, J. Stefanowski, S. Wilk, A comparison of two approaches to data mining from imbalanced data, presented at International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, Springer, 2004, pp. 757–763.

N. Japkowicz, M. Shah, Evaluating learning algorithms: a classifica-tion perspective, Cambridge University Press, 2011.

G. M. Weiss, F. Provost, Learning when training data are costly: The effect of class distribution on tree induction, Journal of Artificial Intelligence Research 19 (2003) 315–354.

G. E. Batista, R. C. Prati, M. C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD explorations newsletter 6 (1) (2004) 20–29.

J. Błaszczynski,´ M. Lango, ―Diversity analysis on imbalanced data using neighborhood and roughly balanced bagging ensembles‖, presented at International Conference on Artificial Intelligence and Soft Computing, Springer, 2016, pp. 552–562.

N. Japkowicz, S. Stephen, The class imbalance problem: A systematic study, Intelligent data analysis 6 (5) (2002) 429–449.

T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, ACM Sigkdd Explorations Newsletter 6 (1) (2004) 40–49.

R. C. Holte, L. Acker, B. W. Porter, et al., Concept learning and the problem of small disjuncts., IJCAI, Vol. 89, Citeseer, 1989, pp. 813–818.

V. Garc´ıa, J. Sanchez,´ R. Mollineda, An empirical study of the behavior of classifiers on imbalanced and overlapped data sets, Iberoamerican Congress on Pattern Recognition, Springer, 2007, pp. 397–406.

J. A. Saez,´ J. Luengo, J. Stefanowski, F. Herrera, Smote–ipf: Ad-dressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291 (2015) 184–203.

J. Stefanowski, S. Wilk, ―Selective pre-processing of imbalanced data for improving classification performance‖, presented at International Conference on Data Warehousing and Knowledge Discovery, Springer, 2008, pp. 283–292.

K. Napierala, J. Stefanowski, The influence of minority class distribution on learning from imbalance data, presented at 7th Conf. HAIS, 2012, pp. 139–150.

R. Wilson and T. R. Martinez, ―Improved heterogeneous distance functions,‖ Journal of artificial intelligence research , vol. 6, pp. 1–34, 1997.

C. Cheadle, M. P. Vawter, W. J. Freed, K. G. Becker, Analysis of microarray data using z score transformation, The Journal of molecular diagnostics 5 (2) (2003) 73–81.

D. Krstajic, L. J. Buturovic, D. E. Leahy, S. Thomas, Crossvalidation pitfalls when selecting and assessing regression and classification models, Journal of cheminformatics 6 (1) (2014) 10.

R. Kohavi, et al., A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Ijcai, Vol. 14, Montreal, Canada, 1995, pp. 1137–1145.

J. D. Rodriguez, A. Perez, J. A. Lozano, Sensitivity analysis of k-fold cross validation in prediction error estimation, IEEE transactions on pattern analysis and machine intelligence 32 (3) (2010) 569–575.

Article

##plugins.themes.bootstrap3.article.sidebar##

##plugins.themes.bootstrap3.article.main##

Abstract

##plugins.themes.bootstrap3.article.details##

References