A Survey on Different Techniques for Handling Missing Values in Dataset

Sukanya Gupta; Dr. Manoj Kumar Gupta

doi:10.32628/CSEIT411849

Authors

Sukanya Gupta Department of computer and science, Shri Mata Vaishno Devi University, Katra, J&K, India
Dr. Manoj Kumar Gupta Department of computer and science, Shri Mata Vaishno Devi University, Katra, J&K, India

Keywords:

Data Pre Processing, Imputation, Mean, Mode, Data Pre Processing, Categorical Data, Numerical Data

Abstract

Abundant of information is being collected and stored every day. That data can be used to extract interesting patterns. The data that we collect is incomplete normally. Now, using that data to extract any information may give misleading results. So, before using that we need to pre process the data to eradicate the abnormalities. In case of small percentage of missing values, those instances can be ignored but in case of large amounts, ignoring them won’t give desired results. Large amount of missing spaces in a dataset is a big problem faced by researchers as it can lead to many problems in quantitative research. So, before performing any data mining techniques to extract some valuable information out of a dataset some pre processing of data can be done to avoid such fallacies and thereby improving the quality of data. To handle such missing values many techniques have been proposed since 1980.The simplest technique is to ignore the records containing missing values other technique include imputation, which involves replacing those missing spaces with some estimates by doing certain computations. This would increase the quality of data and would improvise prediction results. This paper gives a review on different types of techniques available for handle missing data like k nearest neighbor (KNN), multiple imputation, case deletion, most common method (MC) etc.

References

EndersCK,”Using the expectation maximization algorithm to estimate coefficient alpha for scales with item- level missing data”,psycho meth. 2003,8(3):322-337
Schafer JL,”Multiple Imputation: a primer”, state methods in med.1999;8(1):3-15.
Schneider T,”Analysis of incomplete climate data : estimate of mean values and co-variance matricesand imputation of missing values”, Journal of Climate, vol 14, pp.853-871
S.Aslan, C.Yozgatligil, C. Iyigun, I.Batmaz,M.Turkes,H.Tatli ,”Comparison of Missing value imputation methods for Turkish monthly total precipitation data”
Little, R. J,”Regression with missing X’s: a review”, Journal of the American Statistical Association, 87, 1227-1237, (1992).
Rajnik L.Vaishnav, Dr. K.M.Patel, “Analysis of various techniques to handling missing value in dataset”, International Journal of Innovative and Emerging Research in Engineering, vol 2,Issue 2, May 2015.
Marsh, H. W, ”Pairwise deletion for missing data in structural equation models: Nonpositive definite matrices, parameter estimates, goodness of fit, and adjusted sample sizes”, Structural Equation Modeling: A Multidisciplinary Journal, 5, 22-36, (1998).
Peugh, J. L., & Enders, C. K,” Missing data in educational research: A review of reporting practices and suggestions for improvement.”,Review of Educational Research, 74, 525-556,2004.
Wothke, W.,”Nonpositive definite matrices in structural modeling. In K.A. Bollen & J.S. Long (Eds.)”,Testing structural equation models (pp. 256-293), Newbury Park, CA: Sage,1993.
Jerzy W.Grzymala-Busse, Linda K.Goodwin, Witold J. Grzymala-Busse,Xinquin Zheng, “Handling Missing Attribute Values in Preterm Birth Data Sets”,UNAI 3642,pp 342-351,2005.
Bairagi,. And Suchindran C.M,”An estimator of the cutoff point maximization sum of sensitivity ” Indian Journal of Statistics 51, 263-269,1989.
Geeta Chhabra,Vasudha Vashisht, and Jayanthi Rajan, “A comparison of multiple imputation methods for data with missing values,” Indian Journal of Science and Technology, vol 10(19), May 2017.
Kaiser J,”Dealing with missing values in data”,Journal of Systems Integration, 5(1):42–51.,2014.
Young W, Weckman G, Holland W.,”A survey of methodologies for the treatment of missing values within datasets: Limitations and benefits”, 12(1):15–43,Jun 2010.
Pigott TD.,”A review of missing data treatment methods”,Educational Research and Evaluation.,7(4):353–83, 2001.
Rezvan PH, Lee KJ, Simpson JA,”The rise of multiple imputation: A review of the reporting and implementation of the method in medical research”,. BMC Medical Research Methodology,p. 1–67. 2015.
Nookhong J, Kaewrattanapat N,” Efficiency comparison of data mining techniques for missing-value imputation’,Journal of Industrial and Intelligent Information.,Suan Sunandha Rajabhat University, Bangkok, Thailand, 3(4):1–5, 2015 Dec 2015.
Chih-Fong Tsai, Fu-Yu Chang,”Combining instance selection for better missing value imputation”, The Journal of Systems and software, 2016.
Rohollah Ramezani, Mansoureh Maadi, Seyedeh Malihe Khatami,”A novel hybrid intelligent system with missing value imputation for diabetes diagnosis”, Alexandra Engineering Journal, 2017
RupamDeb, Alan Wee-Chung Liew,”Missing value imputation for the analysis of incomplete traffic accident data”, Elsevier Journal, 2016.
Archana purwar and sandeep kumar singh,” Hybrid prediction model with missing value imputation for medical data ” , Elsevier Journal, 2017.
Naresh Rameshrao Pimplikar, Asheesh Kumar, Apurva Mohan Gupta ., “ Study of Missing Value Imputation Methods” International Journal of Advanced Research in Computer Science and Software Engineering 4(3), pp. 1487 -1491, March 2014.
Bennett DA.,”How can I deal with missing data in my study? “Aust N Z J Public Health., 25(5):464–469,2001.
Hammouda, K. Karray, F. Department of Systems University of Waterloo, Ontario, Canada. “A Comparative Study of Data Clustering Techniques”. Unpublished, November 2004.
Hímer, Z. Wertz, V. Kovács, J. Kortela, U. University of Oulu, Systems Engineering Laboratory.Neuro-Fuzzy Model of Flue Gas Oxygen Content http://www.supelec.fr/lss/CTS/WWW/preprint-himer-01.pdf - Last Accessed 13 Aug. 2008.
Vink G, Frank LE, Pannekoek J, Buuren SV. Predictive mean matching imputation of semicontinuous variables. Statistica Neerlandica. Wiley Publishing. 2014; 68(1):61–90.
Stekhoven DJ, B’uhlmann P. Missforest non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012; 28(1):112–8. PMid: 22039212.
Lichman M. UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science; 2013. PMid: 24373753.
LI XB. A Bayesian approach for estimating and replacing missing categorical data. ACM Journal of Data and Information Quality. 2009 Jun; 1(1):1.
Yu X, Lim ZJS,”Replace missing values with EM algorithm based on GMM and Naive Bayesian”,International Journal of Software Engineering and its Applications., 8(5):177–88,2014.

A Survey on Different Techniques for Handling Missing Values in Dataset

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite