Imbalanced Data set in machine Learning : A Comparative Study

Authors(3) :-Paarth Gupta, Pratyush Kumar, Manoj Kumar

A system should be termed as intelligent only when it has the capability of self-learning. Machine learning being one of the most prominent field of computer science can help the system being able to get into the self-learning mode without the need of explicit programming efforts. The major challenge faced by Machine Learning experts in real life scenarios is uneven data distribution leading to imbalanced data set. Thus the proper distribution of elements in the form of sets plays a major role in achieving the self-learning goal. The uneven distribution of elements can be broadly categorized in the majority (negative) class and the minority (positive) class. The distribution of elements in nearly equal proportions is called as balanced data set. A data set is imbalanced when we have a minority class (I.e. the class which is rarer than the other classes namely the majority class). Dealing minority class is becoming more complex as classification rules tend to be fewer and weaker as compared to majority classes. Recent research findings in the area of machine learning along with the data mining have provided deeper insight into the nature of imbalanced learning along with the newer emerging challenges. Thus, this area of research is still popular among research community. In this paper we are focusing on the challenges and its best fit solutions available. Our aim to find the best fit solution by using different machine learning techniques or algorithms. These algorithms may vary in their approaches to solve the given problem. These approaches can be sampling, clustering, Graphical techniques, and statistical techniques or even with the help of classifiers. This paper provides a discussion on the complication of imbalanced data set and solutions concerning lines of future research for each of them.

Authors and Affiliations

Paarth Gupta
DoCSE,SMVDU, Katra, Jammu and Kashmir, India
Pratyush Kumar
DoCSE,SMVDU, Katra, Jammu and Kashmir, India
Manoj Kumar
Assistant Professor DoCSE, Katra, Jammu and Kashmir, India

Machine learning, Imbalanced data, Imbalanced clustering, Sampling, Classifiers, Self-Learning

  1. Krawczyk, Bartosz. "Learning from imbalanced data: open challenges and future directions." Progress in Artificial Intelligence 5, no.4(2016):221-232.
  2. Sun, Yanmin, Andrew KC Wong, and Mohamed S. Kamel. "Classification of imbalanced data: A review." International Journal of Pattern Recognition and Artificial Intelligence 23,no.04(2009):687-719.
  3. Batista, Gustavo EAPA, Ronaldo C. Prati, and Maria Carolina Monard. "A study of the behavior of several methods for balancing machine learning training data." ACM Sigkdd ExplorationsNewsletter 6,no.1(2004):20-29.
  4. FernaNdez, Alberto, Victoria LoPez, Mikel Galar, MaríA Jose Del Jesus, and Francisco Herrera. "Analysing the classification of imbalanced data-sets with multiple classes: Binarizationtechniquesandad-hoc approaches." Knowledge-based systems 42 (2013):97-110.
  5. Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16(2002):321-357.
  6. Barandela, Ricardo, Jose Salvador Sanchez, Vicente Garc?a, and Edgar Rangel. "Strategies for learning in class imbalance problems." Pattern Recognition 36, no. 3 (2003): 849-851.
  7. Domingos, Pedro. "Metacost: A general method for making classifiers cost-sensitive." In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 155-164. ACM, 1999.
  8. Wozniak, Michal, Manuel Graña, and Emilio Corchado. "A survey of multiple classifier systems as hybrid systems." Information Fusion 16(2014):3-17.
  9. Krawczyk, Bartosz. "Learning from imbalanced data: open challenges and future directions." Progress in Artificial Intelligence 5, no.4(2016):221-232.
  10. Bunkhumpornpat, Chumphol, Krung Sinapiromsaran, and Chidchanok Lursinsap. "Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem." Advances in knowledge discoveryanddatamining (2009):475-482.
  11. Tax, David MJ, and Robert PW Duin. "Using two-class classifiers for multiclass classification." In Pattern Recognition, 2002. Proceedings. 16th International Conference on, vol. 2, pp. 124-127. IEEE,2002.
  12. Wang, Shuo, and Xin Yao. "Multiclass imbalance problems: Analysis and potential solutions." IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)42, no. 4 (2012):1119-1130.
  13. Sun, Yanmin, Andrew KC Wong, and Mohamed S. Kamel. "Classification of imbalanced data: A review." International Journal of Pattern Recognition and Artificial Intelligence 23,no.04(2009):687-719.
  14. Fernandez, Alberto, Mara Jose Del Jesus, and Francisco Herrera. "Multi-class imbalanced data-sets with linguistic fuzzy rule based classification systems based on pairwise learning." In International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, pp. 89-98.Springer,Berlin,Heidelberg,2010.
  15. FernaNdez, Alberto, Victoria LoPez, Mikel Galar, MaríA Jose Del Jesus, and Francisco Herrera. "Analysing the classification of imbalanced data-sets with multiple classes: Binarizationtechniquesandad-hoc approaches." Knowledge-based systems 42 (2013):97-110.
  16. Hastie, Trevor, and Robert Tibshirani. "Classification by pairwise coupling." In Advances in neural information processing systems,pp.507-513.1998.
  17. Cieslak, David A., T. Ryan Hoens, Nitesh V. Chawla, and W. Philip Kegelmeyer. "Hellinger distance decision trees are robust

Publication Details

Published in : Volume 2 | Issue 7 | September 2017
Date of Publication : 2017-09-30
License:  This work is licensed under a Creative Commons Attribution 4.0 International License.
Page(s) : 272-279
Manuscript Number : CSEIT174432
Publisher : Technoscience Academy

ISSN : 2456-3307

Cite This Article :

Paarth Gupta, Pratyush Kumar, Manoj Kumar, "Imbalanced Data set in machine Learning : A Comparative Study", International Journal of Scientific Research in Computer Science, Engineering and Information Technology (IJSRCSEIT), ISSN : 2456-3307, Volume 2, Issue 7, pp.272-279, September-2017.
Journal URL : http://ijsrcseit.com/CSEIT174432

Article Preview

Follow Us

Contact Us