Bias and Its Consequences : A Study of Machine Learning Performance
DOI:
https://doi.org/10.32628/CSEIT241051088Keywords:
Machine Learning, Fairness in AI, Class imbalance and Bias Impact AnalysisAbstract
This paper addresses the concern about bias affecting the results of machine learning models. For this purpose, it uses the Adult Income dataset from OpenML for income classification. The conditions for bias are induced by underrepresenting people that earn <= $50K in training data, thus checking the behavior of different models when encountering such a skewed distribution. Key metrics, namely accuracy and specificity (True Negative Rate), were analyzed for unbiased and biased training scenarios. The results show that Naive Bayes and Random Forest models were resistant to bias, but others, including SVM and Logistic Regression, suffered major performance drops. This study throws light on the robustness of different classifiers when exposed to biased data, requiring further bias mitigation strategies in real-world applications. This paper actually examines critically how bias in training data can significantly affect the performance of prediction, fairness, and model selection in income classification tasks.
Downloads
References
Vega-Gonzalo M, Christidis P. Fair Models for Impartial Policies: Controlling Algorithmic Bias in Transport Behavioural Modelling. Sustainability. 2022; 14(14):8416. https://doi.org/10.3390/su14148416 DOI: https://doi.org/10.3390/su14148416
Siddique S, Haque MA, George R, Gupta KD, Gupta D, Faruk MJH. Survey on Machine Learning Biases and Mitigation Techniques. Digital. 2024; 4(1):1-68. https://doi.org/10.3390/digital4010001 DOI: https://doi.org/10.3390/digital4010001
G. Khandelwal, B. Nemade, N. Badhe, D. Mali, K. Gaikwad, and N. Ansari, "Designing and Developing novel methods for Enhancing the Accuracy of Water Quality Prediction for Aquaponic Farming," Advances in Nonlinear Variational Inequalities, vol. 27, no. 3, pp. 302-316, Aug. 2024, ISSN: 1092-910X. DOI: https://doi.org/10.52783/anvi.v27.1375
B. Nemade, S. S. Alegavi, N. B. Badhe, and A. Desai, “Enhancing information security in multimedia streams through logic learning machine assisted moth-flame optimization,” ICTACT Journal of Communication Technology, vol. 14, no. 3, 2023. DOI: https://doi.org/10.21917/ijct.2023.0449
S. S. Alegavi, B. Nemade, V. Bharadi, S. Gupta, V. Singh, and A. Belge, “Revolutionizing Healthcare through Health Monitoring Applications with Wearable Biomedical Devices,” International Journal of Recent Innovations and Trends in Computing and Communication, vol. 11, no. 9s, pp. 752–766, 2023. [Online]. Available: https://doi.org/10.17762/ijritcc.v11i9s.7890. DOI: https://doi.org/10.17762/ijritcc.v11i9s.7890
Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014. https://www.openml.org/d/1590 DOI: https://doi.org/10.1145/2641190.2641198
Pagano TP, Loureiro RB, Lisboa FVN, Peixoto RM, Guimarães GAS, Cruz GOR, Araujo MM, Santos LL, Cruz MAS, Oliveira ELS, et al. “Bias and Unfairness in Machine Learning Models: A Systematic Review on Datasets, Tools, Fairness Metrics, and Identification and Mitigation Methods.”, Big Data and Cognitive Computing. 2023; 7(1):15. https://doi.org/10.3390/bdcc7010015 DOI: https://doi.org/10.3390/bdcc7010015
Gabe Barcelos, “Understanding Bias in Machine Learning Models”, arize.com, (Mar 15 2022). https://arize.com/blog/understanding-bias-in-ml-models/
Reinier H. Stribos, “The Impact of Data Noise on a Naive Bayes Classifier”, (Jan 29 2021). https://essay.utwente.nl/85678/
Tiago Palma Pagano, Rafael Bessa Loureiro, Fernanda Vitória Nascimento Lisboa, Gustavo Oliveira Ramos Cruz, Rodrigo Matos Peixoto, Guilherme Aragão de Sousa Guimarães, Lucas Lisboa dos Santos, Maira Matos Araujo, Marco Cruz, Ewerton Lopes Silva de Oliveira, Ingrid Winkler, Erick Giovani Sperandio Nascimento, “Bias and unfairness in machine learning models: a systematic literature review”, (2022). https://arxiv.org/abs/2202.08176
Alelyani S, “Detection and Evaluation of Machine Learning Bias.”, Applied Sciences. 2021; 11(14):6271. https://doi.org/10.3390/app11146271 DOI: https://doi.org/10.3390/app11146271
B. Nemade, N. Phadnis, A. Desai, and K. K. Mungekar, "Enhancing connectivity and intelligence through embedded Internet of Things devices," ICTACT Journal on Microelectronics, vol. 9, no. 4, pp. 1670-1674, Jan. 2024, doi: 10.21917/ijme.2024.0289.
B. C. Surve, B. Nemade, and V. Kaul, "Nano-electronic devices with machine learning capabilities," ICTACT Journal on Microelectronics, vol. 9, no. 3, pp. 1601-1606, Oct. 2023, doi: 10.21917/ijme.2023.0277.
Ansari Danish, “Exploring the Impact of Bias in Machine Learning: Causes, Consequences, and Potential Solutions”, LinkedIn, (May 16 2023). https://www.linkedin.com/pulse/exploring-impact-bias-machine-learning-causes-potential-ansari-danish
Cox, D. R., "The Regression Analysis of Binary Sequences," Journal of the Royal Statistical Society: Series B (Methodological), (1960). https://www.jstor.org/stable/2983890
Breiman, L., "Random Forests," Machine Learning, (October 01, 2001). https://doi.org/10.1023/A:1010933404324 DOI: https://doi.org/10.1023/A:1010933404324
Friedman, J. H., "Greedy Function Approximation: A Gradient Boosting Machine," Annals of Statistics, (October 2001). https://doi.org/10.1214/aos/1013203451 DOI: https://doi.org/10.1214/aos/1013203451
Chen, T., Guestrin, C., "XGBoost: A Scalable Tree Boosting System," Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (August 2016). https://doi.org/10.1145/2939672.2939785 DOI: https://doi.org/10.1145/2939672.2939785
Cover, T. M., Hart, P. E., "Nearest Neighbor Pattern Classification," IEEE Transactions on Information Theory, (January 1967). https://doi.org/10.1109/TIT.1967.1053964 DOI: https://doi.org/10.1109/TIT.1967.1053964
Quinlan, J. R., "Induction of Decision Trees," Machine Learning, (March 1986). https://doi.org/10.1007/BF00116251 DOI: https://doi.org/10.1007/BF00116251
Rumelhart, D. E., Hinton, G. E., Williams, R. J., "Learning Representations by Back-Propagating Errors," Nature, (October 1986). https://doi.org/10.1038/323533a0 DOI: https://doi.org/10.1038/323533a0
Lewis, D. D., "Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval," Proceedings of the 10th European Conference on Machine Learning, (April 1998). https://doi.org/10.1007/BFb0026666 DOI: https://doi.org/10.1007/BFb0026666
Fisher, R. A., "The Use of Multiple Measurements in Taxonomic Problems," Annals of Eugenics, (July 1936). https://doi.org/10.1111/j.1469-1809.1936.tb02137.x DOI: https://doi.org/10.1111/j.1469-1809.1936.tb02137.x
Rao, C. R., "The Utilization of Multiple Measurements in Problems of Biological Classification," Journal of the Royal Statistical Society, (1948). https://doi.org/10.2307/2983771 DOI: https://doi.org/10.1111/j.2517-6161.1948.tb00008.x
Bhavesh Kataria, "The Challenges of Utilizing Information Communication Technologies (ICTs) in Agriculture Extension, International Journal of Scientific Research in Science, Engineering and Technology, Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 1, Issue 1, pp.380-384, January-February-2015. Available at : https://doi.org/10.32628/ijsrset1511103 DOI: https://doi.org/10.32628/IJSRSET1511103
Patil, P., Kataria, B., Redkar, V., Banait, A., Shilpa, C., Patil, & Khetani, V. (08 2024). Automated Detection of Tuberculosis Using Deep Learning Algorithms on Chest X-rays. Frontiers in Health Informatics, 13, 218–229. https://healthinformaticsjournal.com/index.php/IJMI/article/view/20
Cortes, C., Vapnik, V., "Support-Vector Networks," Machine Learning, (September 1995). https://doi.org/10.1007/BF00994018 DOI: https://doi.org/10.1007/BF00994018
Downloads
Published
Issue
Section
License
Copyright (c) 2024 International Journal of Scientific Research in Computer Science, Engineering and Information Technology
This work is licensed under a Creative Commons Attribution 4.0 International License.