The Impact of Data Preprocessing on Machine Learning Model Performance: A Comprehensive Examination

Everleen Nekesa Wanyonyi; Newton Wafula Masinde

doi:10.32628/CSEIT25112854

Authors

Everleen Nekesa Wanyonyi Department of Computer Science, Murang’a University of Technology, Murang’a, Kenya Author
Newton Wafula Masinde Department of Computer Science, Jaramogi Oginga, Odinga University of Science and Technology, Bondo, Kenya Author

DOI:

https://doi.org/10.32628/CSEIT25112854

Keywords:

Data preprocessing, Machine Learning, feature engineering, data imbalance, data cleaning, transformation

Abstract

Machine Learning (ML) models have been extensively applied in various fields to enhance prediction. For instance, in cybersecurity, they examine large amounts of data, establish trends in the data, and draw insights from previous events, to enhance detection and respond to cyber threats. Random Forest, Logistic Regression, K-Nearest Neighbor and LSTM are some of the popular ML models vastly used for anomaly detection. The accuracy of these models is therefore the cornerstone of the organization’s information systems security since wrong predictions result to false positives and negatives which significantly reduce employees’ output and may result into workers’ frustrations when interacting with the information systems. Among the many factors that affect ML model performance, data pre-processing has been underscored. Using the various publicly available datasets, this paper examines the impact of data preprocessing techniques on selected ML model architectures’ performance. Training time, Accuracy, Precision, Recall and F1 scores are used for evaluating the ML models’ performance.

Downloads

Download data is not yet available.

References

Ailyn, D. (2024). (PDF) Feature Engineering for Financial Market Prediction: From Historical Data to Actionable Insights. https://www.researchgate.net/publication/383908810_Feature_Engineering_for_Financial_Market_Prediction_From_Historical_Data_to_Actionable_Insights

Amato, A., & Di Lecce, V. (2024). (PDF) Data preprocessing impact on machine learning algorithm performance. ResearchGate. https://doi.org/10.1515/comp-2022-0278

Angelovič, M., Krištof, K., Jobbágy, J., & Findura, P. (2018). (PDF) The effect of conditions and storage time on course of moisture and temperature of maize grains. BIO Web of Conferences. https://doi.org/10.1051/bioconf/20181002001

Balla, A., Habaebi, M. H., Elsheikh, E. A. A., Islam, M. R., & Suliman, F. M. (2023). The Effect of Dataset Imbalance on the Performance of SCADA Intrusion Detection Systems. Sensors, 23(2), Article 2. https://doi.org/10.3390/s23020758

Borodkin, K., Nurtas, M., Altaibek, A., Daineko, Y., & Otepov, T. (2023). Data Pre-processing and Visualization for Machine Learning Models and its Applications in Education. 8th International Conference on Digital Technologies in Education, Science and Industry.

Boyko, N., Omeliukh, R., & Duliaba, N. (2022). The Random Forest Algorithm as an Element of Statistical Learning for Disease Prediction.

Brijith, A. (2023). (PDF) Data Preprocessing for Machine Learning. ResearchGate. https://www.researchgate.net/publication/375003512_Data_Preprocessing_for_Machine_Learning

BÜYÜKKEÇECİ, M., & Okur, M. (2022). (PDF) A Comprehensive Review of Feature Selection and Feature Selection Stability in Machine Learning. ResearchGate. https://doi.org/10.35378/gujs.993763

Dandu, M. M. K., Jain, J., Vijayabaskar, S., & Goel, P. (2024). Assessing the Impact of Data Imbalance on the Predictive Performance of Machine Learning Models | Request PDF. ResearchGate. https://doi.org/10.1109/IC3I61595.2024.10829313

Data Science Horizons. (2023). Data Cleaning and Preprocessing for Data Science Beginners. Data Science Horizons.

Fan, C., Chen, M., Wang, X., Wang, J., & Bufu, H. (2021). (PDF) A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge Discovery From Building Operational Data. ResearchGate. https://doi.org/10.3389/fenrg.2021.652801

Frye, M., Mohren, J., & Schmitt, R. H. (2021). Benchmarking of Data Preprocessing Methods for Machine LearningApplications in Production. 54th CIRP Conference on Manufacturing Systems. https://doi.org/doi.org/10.1016/j.procir.2021.11.009

Houdt, G. V., Carlos Mosquera, & Nápoles, G. (2020). (PDF) A Review on the Long Short-Term Memory Model. https://doi.org/DOI:10.1007/s10462-020-09838-1

Jamshed, H., Khan, M. S. A., Khurram, M., Inayatullah, S., & Athar, S. (2019). Data Preprocessing: A preliminary step for web data mining. 3C Tecnología_Glosas de Innovación Aplicadas a La Pyme, 206–221. https://doi.org/10.17993/3ctecno.2019.specialissue2.206-221

Jones, H. R., Mu, T., Andrei C., P., & Yusuf, S. (2023). (PDF) Adapting Data-Driven Techniques to Improve Surrogate Machine Learning Model Performance. ResearchGate. https://doi.org/10.1109/ACCESS.2023.3253429

Koresh, H. J. D. (2024). Impact of the Preprocessing Steps in Deep Learning-Based Image Classifications. National Academy Science Letters, 47(6), 645–647. https://doi.org/10.1007/s40009-023-01372-2

Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539

Lee, G. Y., Alzamil, L., Doskenov, B., & Termehchy, A. (2021). (PDF) A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance. https://doi.org/DOI:10.48550/arXiv.2109.07127

Li, P., Rao, X., Blase, J., Zhang, Y., Chu, X., & Zhang, C. (2021). CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. 2021 IEEE 37th International Conference on Data Engineering (ICDE), 13–24. 2021 IEEE 37th International Conference on Data Engineering (ICDE). https://doi.org/10.1109/ICDE51399.2021.00009

Liew, Y. C., Chuan, Y., Lim, T. Y., Tan, C. J., Chai, K. K., & Deng, X. (2024). The Effect of Data Transformation Techniques on Machine Learning Performance: A Case Study on Student Dropout Prediction | IEEE Conference Publication | IEEE Xplore. https://doi.org/DOI: 10.1109/PRML62565.2024.10779714

Portl, S. U. (2021). Logistic Regression. In Categorical Data Analysis. Newsom.

Prakash, Dr. A. A. (2024). Pre-processing techniques for preparing clean and high-quality data for diabetes prediction. International Journal of Research Publication and Reviews, 5(2), 458–465. https://doi.org/10.55248/gengpi.5.0224.0412

Salman, H. A., Kalakech, A., & Steiti, A. (2024). (PDF) Random Forest Algorithm Overview. https://doi.org/DOI:10.58496/BJML/2024/007

Strasser, S., & Klettke, M. (2024). Transparent Data Preprocessing for Machine Learning. Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics, 1–6. https://doi.org/10.1145/3665939.3665960

Wanyonyi, E. N., Masinde, N. W., & Abeka, S. O. (2024). A Theory-Based Deep Learning Approach for Insider Threat Detection and Classification. International Journal of Computer Applications Technology and Research. https://doi.org/10.7753/IJCATR1310.1004

Werner de Vargas, V., Schneider Aranda, J. A., dos Santos Costa, R., da Silva Pereira, P. R., & Victória Barbosa, J. L. (2023). Imbalanced data preprocessing techniques for machine learning: A systematic mapping study. Knowledge and Information Systems, 65(1), 31–57. https://doi.org/10.1007/s10115-022-01772-8

Zhang, M., Lu, J., Ma, N., Cheng, T. C. E., & Hua, G. (2022). (PDF) A Feature Engineering and Ensemble Learning Based Approach for Repeated Buyers Prediction. ResearchGate. https://doi.org/10.15837/ijccc.2022.6.4988

Zhang, Z. (2016). (PDF) Introduction to machine learning: K-nearest neighbors. ResearchGate. https://doi.org/10.21037/atm.2016.03.37

Zhao, Y., Huang, Z., Gong, L., & Zhu, Y. (2023). (PDF) Evaluating the Impact of Data Transformation Techniques on the Performance and Interpretability of Software Defect Prediction Models. https://doi.org/DOI:10.1049/2023/6293074

Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists | Guide books | ACM Digital Library. O’Reilly Media, Inc. https://dl.acm.org/doi/10.5555/3239815

Zheng, M., Wang, F., Hu, X., Miao, Y., Cao, H., & Tang, M. (2022). A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models. Axioms, 11(11), Article 11. https://doi.org/10.3390/axioms11110607

The Impact of Data Preprocessing on Machine Learning Model Performance: A Comprehensive Examination

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

IssueDate

RightSideBlock

Latest publications