Imbalanced Data set in machine Learning : A Comparative Study

Authors

  • Paarth Gupta  DoCSE,SMVDU, Katra, Jammu and Kashmir, India
  • Pratyush Kumar  DoCSE,SMVDU, Katra, Jammu and Kashmir, India
  • Manoj Kumar  Assistant Professor DoCSE, Katra, Jammu and Kashmir, India

Keywords:

Machine learning, Imbalanced data, Imbalanced clustering, Sampling, Classifiers, Self-Learning

Abstract

A system should be termed as intelligent only when it has the capability of self-learning. Machine learning being one of the most prominent field of computer science can help the system being able to get into the self-learning mode without the need of explicit programming efforts. The major challenge faced by Machine Learning experts in real life scenarios is uneven data distribution leading to imbalanced data set. Thus the proper distribution of elements in the form of sets plays a major role in achieving the self-learning goal. The uneven distribution of elements can be broadly categorized in the majority (negative) class and the minority (positive) class. The distribution of elements in nearly equal proportions is called as balanced data set. A data set is imbalanced when we have a minority class (I.e. the class which is rarer than the other classes namely the majority class). Dealing minority class is becoming more complex as classification rules tend to be fewer and weaker as compared to majority classes. Recent research findings in the area of machine learning along with the data mining have provided deeper insight into the nature of imbalanced learning along with the newer emerging challenges. Thus, this area of research is still popular among research community. In this paper we are focusing on the challenges and its best fit solutions available. Our aim to find the best fit solution by using different machine learning techniques or algorithms. These algorithms may vary in their approaches to solve the given problem. These approaches can be sampling, clustering, Graphical techniques, and statistical techniques or even with the help of classifiers. This paper provides a discussion on the complication of imbalanced data set and solutions concerning lines of future research for each of them.

References

  1. Krawczyk, Bartosz. "Learning from imbalanced data: open challenges and future directions." Progress in Artificial Intelligence 5, no.4(2016):221-232.
  2. Sun, Yanmin, Andrew KC Wong, and Mohamed S. Kamel. "Classification of imbalanced data: A review." International Journal of Pattern Recognition and Artificial Intelligence 23,no.04(2009):687-719.
  3. Batista, Gustavo EAPA, Ronaldo C. Prati, and Maria Carolina Monard. "A study of the behavior of several methods for balancing machine learning training data." ACM Sigkdd ExplorationsNewsletter 6,no.1(2004):20-29.
  4. FernaNdez, Alberto, Victoria LoPez, Mikel Galar, MaríA Jose Del Jesus, and Francisco Herrera. "Analysing the classification of imbalanced data-sets with multiple classes: Binarizationtechniquesandad-hoc approaches." Knowledge-based systems 42 (2013):97-110.
  5. Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16(2002):321-357.
  6. Barandela, Ricardo, Jose Salvador Sanchez, Vicente Garcıa, and Edgar Rangel. "Strategies for learning in class imbalance problems." Pattern Recognition 36, no. 3 (2003): 849-851.
  7. Domingos, Pedro. "Metacost: A general method for making classifiers cost-sensitive." In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 155-164. ACM, 1999.
  8. Wozniak, Michal, Manuel Graña, and Emilio Corchado. "A survey of multiple classifier systems as hybrid systems." Information Fusion 16(2014):3-17.
  9. Krawczyk, Bartosz. "Learning from imbalanced data: open challenges and future directions." Progress in Artificial Intelligence 5, no.4(2016):221-232.
  10. Bunkhumpornpat, Chumphol, Krung Sinapiromsaran, and Chidchanok Lursinsap. "Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem." Advances in knowledge discoveryanddatamining (2009):475-482.
  11. Tax, David MJ, and Robert PW Duin. "Using two-class classifiers for multiclass classification." In Pattern Recognition, 2002. Proceedings. 16th International Conference on, vol. 2, pp. 124-127. IEEE,2002.
  12. Wang, Shuo, and Xin Yao. "Multiclass imbalance problems: Analysis and potential solutions." IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)42, no. 4 (2012):1119-1130.
  13. Sun, Yanmin, Andrew KC Wong, and Mohamed S. Kamel. "Classification of imbalanced data: A review." International Journal of Pattern Recognition and Artificial Intelligence 23,no.04(2009):687-719.
  14. Fernandez, Alberto, Mara Jose Del Jesus, and Francisco Herrera. "Multi-class imbalanced data-sets with linguistic fuzzy rule based classification systems based on pairwise learning." In International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, pp. 89-98.Springer,Berlin,Heidelberg,2010.
  15. FernaNdez, Alberto, Victoria LoPez, Mikel Galar, MaríA Jose Del Jesus, and Francisco Herrera. "Analysing the classification of imbalanced data-sets with multiple classes: Binarizationtechniquesandad-hoc approaches." Knowledge-based systems 42 (2013):97-110.
  16. Hastie, Trevor, and Robert Tibshirani. "Classification by pairwise coupling." In Advances in neural information processing systems,pp.507-513.1998.
  17. Cieslak, David A., T. Ryan Hoens, Nitesh V. Chawla, and W. Philip Kegelmeyer. "Hellinger distance decision trees are robust

Downloads

Published

2017-09-30

Issue

Section

Research Articles

How to Cite

[1]
Paarth Gupta, Pratyush Kumar, Manoj Kumar, " Imbalanced Data set in machine Learning : A Comparative Study, IInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology(IJSRCSEIT), ISSN : 2456-3307, Volume 2, Issue 7, pp.272-279, September-2017.