The Impact of Data Preprocessing on Machine Learning Model Performance: A Comprehensive Examination
DOI:
https://doi.org/10.32628/CSEIT25112854Keywords:
Data preprocessing, Machine Learning, feature engineering, data imbalance, data cleaning, transformationAbstract
Machine Learning (ML) models have been extensively applied in various fields to enhance prediction. For instance, in cybersecurity, they examine large amounts of data, establish trends in the data, and draw insights from previous events, to enhance detection and respond to cyber threats. Random Forest, Logistic Regression, K-Nearest Neighbor and LSTM are some of the popular ML models vastly used for anomaly detection. The accuracy of these models is therefore the cornerstone of the organization’s information systems security since wrong predictions result to false positives and negatives which significantly reduce employees’ output and may result into workers’ frustrations when interacting with the information systems. Among the many factors that affect ML model performance, data pre-processing has been underscored. Using the various publicly available datasets, this paper examines the impact of data preprocessing techniques on selected ML model architectures’ performance. Training time, Accuracy, Precision, Recall and F1 scores are used for evaluating the ML models’ performance.
Downloads
References
Ailyn, D. (2024). (PDF) Feature Engineering for Financial Market Prediction: From Historical Data to Actionable Insights. https://www.researchgate.net/publication/383908810_Feature_Engineering_for_Financial_Market_Prediction_From_Historical_Data_to_Actionable_Insights
Amato, A., & Di Lecce, V. (2024). (PDF) Data preprocessing impact on machine learning algorithm performance. ResearchGate. https://doi.org/10.1515/comp-2022-0278
Angelovič, M., Krištof, K., Jobbágy, J., & Findura, P. (2018). (PDF) The effect of conditions and storage time on course of moisture and temperature of maize grains. BIO Web of Conferences. https://doi.org/10.1051/bioconf/20181002001
Balla, A., Habaebi, M. H., Elsheikh, E. A. A., Islam, M. R., & Suliman, F. M. (2023). The Effect of Dataset Imbalance on the Performance of SCADA Intrusion Detection Systems. Sensors, 23(2), Article 2. https://doi.org/10.3390/s23020758
Borodkin, K., Nurtas, M., Altaibek, A., Daineko, Y., & Otepov, T. (2023). Data Pre-processing and Visualization for Machine Learning Models and its Applications in Education. 8th International Conference on Digital Technologies in Education, Science and Industry.
Boyko, N., Omeliukh, R., & Duliaba, N. (2022). The Random Forest Algorithm as an Element of Statistical Learning for Disease Prediction.
Brijith, A. (2023). (PDF) Data Preprocessing for Machine Learning. ResearchGate. https://www.researchgate.net/publication/375003512_Data_Preprocessing_for_Machine_Learning
BÜYÜKKEÇECİ, M., & Okur, M. (2022). (PDF) A Comprehensive Review of Feature Selection and Feature Selection Stability in Machine Learning. ResearchGate. https://doi.org/10.35378/gujs.993763
Dandu, M. M. K., Jain, J., Vijayabaskar, S., & Goel, P. (2024). Assessing the Impact of Data Imbalance on the Predictive Performance of Machine Learning Models | Request PDF. ResearchGate. https://doi.org/10.1109/IC3I61595.2024.10829313
Data Science Horizons. (2023). Data Cleaning and Preprocessing for Data Science Beginners. Data Science Horizons.
Fan, C., Chen, M., Wang, X., Wang, J., & Bufu, H. (2021). (PDF) A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge Discovery From Building Operational Data. ResearchGate. https://doi.org/10.3389/fenrg.2021.652801
Frye, M., Mohren, J., & Schmitt, R. H. (2021). Benchmarking of Data Preprocessing Methods for Machine LearningApplications in Production. 54th CIRP Conference on Manufacturing Systems. https://doi.org/doi.org/10.1016/j.procir.2021.11.009
Houdt, G. V., Carlos Mosquera, & Nápoles, G. (2020). (PDF) A Review on the Long Short-Term Memory Model. https://doi.org/DOI:10.1007/s10462-020-09838-1
Jamshed, H., Khan, M. S. A., Khurram, M., Inayatullah, S., & Athar, S. (2019). Data Preprocessing: A preliminary step for web data mining. 3C Tecnología_Glosas de Innovación Aplicadas a La Pyme, 206–221. https://doi.org/10.17993/3ctecno.2019.specialissue2.206-221
Jones, H. R., Mu, T., Andrei C., P., & Yusuf, S. (2023). (PDF) Adapting Data-Driven Techniques to Improve Surrogate Machine Learning Model Performance. ResearchGate. https://doi.org/10.1109/ACCESS.2023.3253429
Koresh, H. J. D. (2024). Impact of the Preprocessing Steps in Deep Learning-Based Image Classifications. National Academy Science Letters, 47(6), 645–647. https://doi.org/10.1007/s40009-023-01372-2
Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539
Lee, G. Y., Alzamil, L., Doskenov, B., & Termehchy, A. (2021). (PDF) A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance. https://doi.org/DOI:10.48550/arXiv.2109.07127
Li, P., Rao, X., Blase, J., Zhang, Y., Chu, X., & Zhang, C. (2021). CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. 2021 IEEE 37th International Conference on Data Engineering (ICDE), 13–24. 2021 IEEE 37th International Conference on Data Engineering (ICDE). https://doi.org/10.1109/ICDE51399.2021.00009
Liew, Y. C., Chuan, Y., Lim, T. Y., Tan, C. J., Chai, K. K., & Deng, X. (2024). The Effect of Data Transformation Techniques on Machine Learning Performance: A Case Study on Student Dropout Prediction | IEEE Conference Publication | IEEE Xplore. https://doi.org/DOI: 10.1109/PRML62565.2024.10779714
Portl, S. U. (2021). Logistic Regression. In Categorical Data Analysis. Newsom.
Prakash, Dr. A. A. (2024). Pre-processing techniques for preparing clean and high-quality data for diabetes prediction. International Journal of Research Publication and Reviews, 5(2), 458–465. https://doi.org/10.55248/gengpi.5.0224.0412
Salman, H. A., Kalakech, A., & Steiti, A. (2024). (PDF) Random Forest Algorithm Overview. https://doi.org/DOI:10.58496/BJML/2024/007
Strasser, S., & Klettke, M. (2024). Transparent Data Preprocessing for Machine Learning. Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics, 1–6. https://doi.org/10.1145/3665939.3665960
Wanyonyi, E. N., Masinde, N. W., & Abeka, S. O. (2024). A Theory-Based Deep Learning Approach for Insider Threat Detection and Classification. International Journal of Computer Applications Technology and Research. https://doi.org/10.7753/IJCATR1310.1004
Werner de Vargas, V., Schneider Aranda, J. A., dos Santos Costa, R., da Silva Pereira, P. R., & Victória Barbosa, J. L. (2023). Imbalanced data preprocessing techniques for machine learning: A systematic mapping study. Knowledge and Information Systems, 65(1), 31–57. https://doi.org/10.1007/s10115-022-01772-8
Zhang, M., Lu, J., Ma, N., Cheng, T. C. E., & Hua, G. (2022). (PDF) A Feature Engineering and Ensemble Learning Based Approach for Repeated Buyers Prediction. ResearchGate. https://doi.org/10.15837/ijccc.2022.6.4988
Zhang, Z. (2016). (PDF) Introduction to machine learning: K-nearest neighbors. ResearchGate. https://doi.org/10.21037/atm.2016.03.37
Zhao, Y., Huang, Z., Gong, L., & Zhu, Y. (2023). (PDF) Evaluating the Impact of Data Transformation Techniques on the Performance and Interpretability of Software Defect Prediction Models. https://doi.org/DOI:10.1049/2023/6293074
Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists | Guide books | ACM Digital Library. O’Reilly Media, Inc. https://dl.acm.org/doi/10.5555/3239815
Zheng, M., Wang, F., Hu, X., Miao, Y., Cao, H., & Tang, M. (2022). A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models. Axioms, 11(11), Article 11. https://doi.org/10.3390/axioms11110607
Downloads
Published
Issue
Section
License
Copyright (c) 2025 International Journal of Scientific Research in Computer Science, Engineering and Information Technology

This work is licensed under a Creative Commons Attribution 4.0 International License.