Review of Big Data Pre-processing

Authors

  • V. Maria Antoniate Martin  Research Scholar, Department of Computer Science, Research and Development Centre, Bharathiar University, Coimbatore, Tamil Nadu, India
  • Dr. K. David  Assistant Professor, Department of Computer Science, The Rajah’s College, Pudukkottai, Tamil Nadu, India
  • N. Bala Sankar  Student, Department of Information Technology, St. Joseph’s College, Trichy, Tamil Nadu, India

Keywords:

Big Data, Pre-processing, Data Quality

Abstract

The massive growth in the scale of data has been observed in recent years being a key factor of the Big Data scenario. Big Data can be defined as high volume, velocity and variety of data that require a new high-performance processing. Addressing big data is a challenging and time-demanding task that requires a large computational infrastructure to ensure successful data processing and analysis. The presence of data pre-processing methods for data mining in big data is reviewed in this paper. The definition, characteristics, and categorization of data pre-processing approaches in big data are introduced. The connection between big data and data pre-processing throughout all families of methods and big data technologies are also examined, including a review of the state-of-the-art. In addition, research challenges are discussed, with focus on developments on different big data framework, such as Hadoop, Spark and Flink and the encouragement in devoting substantial research efforts in some families of data pre-processing methods and applications on new big data learning paradigms.

References

  1. Aggarwal CC. Data Mining: The Textbook. Berlin, Germany: Springer; 2015
  2. .Pyle D. Data Preparation for Data Mining. San Francisco: Morgan Kaufmann Publishers Inc.; 1999.
  3. Li Z, Tang J. Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans Image Process. 2015; 24(12):5343–355.
  4. A. Fern´andez, S. R´ıo, V. L´opez, A. Bawakid, M. J. del Jesus, J. M. Ben´ıtez and F. Herrera, “Big Data with Cloud Computing: An Insight on the Computing Environment, MapReduce and Programming Frame- works,”WIREs Data Mining and Knowledge Discovery, vol. 4, no. 5, pp. 380–409, 2014.
  5. H. He and E. A. Garc´ıa, “Learning from imbalanced data,” IEEE Transac- tionsonKnowledgeandDataEngineering,vol.21,no.9,pp.1263–1284, 2009.
  6. V. L´opez, A. Fern´andez, S. Garc´ıa, V. Palade and F. Herrera, “An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics,” Information Sciences, vol. 250, pp. 113–141, 2013.
  7. Xindong Wu, Xingquan Zhu, Gong-Qing Wu, Wei Ding, “Data Mining with Big Data”, (In Press) IEEE Transactions on Knowledge and Data Engineering, 2013.
  8. Park SH, Ha YG. Large imbalance data classification based on mapreduce for traffic accident prediction. In: 8th International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS). Birmingham: 2014.
  9. Triguero I, Peralta D, Bacardit J, García S, Herrera F. MRPR: A mapreduce solution for prototype reduction in big data classification. Neurocomputing. 2015; 150 Part A: 331–45.
  10. Bellman RE. Adaptive Control Processes - A Guided Tour. Princeton, NJ: Princeton University Press; 1961.
  11. Hu F, Li H, Lou H, Dai J. A parallel oversampling algorithm based on NRSBoundary-SMOTE. J InfComput Sci. 2014; 11(13):4655–665
  12. .Hall MA. Correlation-based feature selection for machine learning.Waikato University, Department of Computer Science. 1999.
  13. Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003; 3:1157–82.
  14. .Chandrashekar G, Sahin F. A survey on feature selection methods.ComputElectr Eng. 2014; 40(1):16–28.
  15. Kim JO, Mueller CW. Factor Analysis: Statistical Methods and Practical Issues (Quantitative Applications in the Social Sciences). New York: Sage Publications, Inc; 1978.

Downloads

Published

2018-04-30

Issue

Section

Research Articles

How to Cite

[1]
V. Maria Antoniate Martin, Dr. K. David, N. Bala Sankar, " Review of Big Data Pre-processing, IInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology(IJSRCSEIT), ISSN : 2456-3307, Volume 3, Issue 3, pp.1499-1503, March-April-2018.