Efficient Email Spam Classification with N-gram Features and Ensemble Learning

Authors

  • Prachi Bhatnagar Research Scholar, Department of Computer Engineering, Sigma Institute of Engineering, Gujarat, India Author
  • Dr. Sheshang Degadwala Professor & Head of Department, Department of Computer Engineering, Sigma University, Gujarat, India Author

DOI:

https://doi.org/10.32628/CSEIT2410220

Keywords:

N-gram features, TF-IDF weighting, SMOTE oversampling, Decision Trees, Random Forests, Ensemble Extra Trees

Abstract

In this paper, we present an innovative approach to enhancing email spam classification using N-gram features, TF-IDF weighting, SMOTE oversampling, and ensemble learning techniques such as Decision Trees, Random Forests, and Ensemble Extra Trees. Our methodology involves preprocessing the dataset to extract N-gram features, applying TF-IDF weighting to highlight important terms, and addressing class imbalance through SMOTE. We then train and evaluate multiple classification models and find that the Ensemble Extra Trees algorithm outperforms others in terms of accuracy, precision, recall, and F1-score. Our experiments on benchmark datasets confirm the efficacy of our approach, showcasing significant improvements in spam detection accuracy and highlighting the potential of ensemble learning for email spam classification. This research contributes to the advancement of spam filtering technologies, providing a robust and efficient solution for accurately identifying and categorizing spam emails.

Downloads

Download data is not yet available.

References

K. Taghandiki, “Building an Effective Email Spam Classification Model with spaCy,” pp. 1–5, 2023, [Online]. Available: http://arxiv.org/abs/2303.08792

R. Fatima et al., “An Optimized Approach For Detection and Classification of Spam Email’s Using Ensemble Methods,” 2023. DOI: https://doi.org/10.21203/rs.3.rs-2051142/v1

L. Jeeva and I. S. Khan, “Enhancing Email Spam Filter ’ s Accuracy Using Machine Learning,” vol. 5, no. 4, pp. 1–12, 2023. DOI: https://doi.org/10.36948/ijfmr.2023.v05i04.4786

M. A. Bouke, A. Abdullah, and M. T. Abdullah, “A Lightweight Machine Learning-Based Email Spam Detection Model Using Word Frequency Pattern,” vol. 4, no. 1, pp. 15–28, 2023, doi: 10.48185/jitc.v4i1.653. DOI: https://doi.org/10.48185/jitc.v4i1.653

H. Takci and F. Nusrat, “Highly Accurate Spam Detection with the Help of Feature Selection and Data Transformation,” International Arab Journal of Information Technology, vol. 20, no. 1, pp. 29–37, 2023, doi: 10.34028/iajit/20/1/4. DOI: https://doi.org/10.34028/iajit/20/1/4

K. Iqbal and M. S. Khan, “Email classification analysis using machine learning techniques,” Applied Computing and Informatics, 2022, doi: 10.1108/ACI-01-2022-0012. DOI: https://doi.org/10.1108/ACI-01-2022-0012

H. Lee, S. Jeong, S. Cho, and E. Choi, “Visualization Technology and Deep-Learning for Multilingual Spam Message Detection,” Electronics (Switzerland), vol. 12, no. 3, 2023, doi: 10.3390/electronics12030582. DOI: https://doi.org/10.3390/electronics12030582

T. S. Dhivya, S. G. Priya, Bt. Student, and T. Fellow, “Email Spam Detection and Data Optimization using NLP Techniques,” International Journal of Engineering Research & Technology, vol. 10, no. 08, pp. 38–49, 2021, [Online]. Available: www.ijert.org

A. Masri and M. Al-Jabi, “A novel approach for Arabic business email classification based on deep learning machines,” PeerJ Computer Science, vol. 9, no. 2017, p. e1221, 2023, doi: 10.7717/peerj-cs.1221. DOI: https://doi.org/10.7717/peerj-cs.1221

A. Junnarkar, S. Adhikari, J. Fagania, P. Chimurkar, and D. Karia, “E-mail spam classification via machine learning and natural language processing,” Proceedings of the 3rd International Conference on Intelligent Communication Technologies and Virtual Mobile Networks, ICICV 2021, no. Icicv, pp. 693–699, 2021, doi: 10.1109/ICICV50876.2021.9388530. DOI: https://doi.org/10.1109/ICICV50876.2021.9388530

M. Crawford, T. M. Khoshgoftaar, J. D. Prusa, A. N. Richter, and H. Al Najada, “Survey of review spam detection using machine learning techniques,” Journal of Big Data, vol. 2, no. 1, 2015, doi: 10.1186/s40537-015-0029-9. DOI: https://doi.org/10.1186/s40537-015-0029-9

S. Cheng, “Classification of Spam E-mail based on Naïve Bayes Classification Model,” Highlights in Science, Engineering and Technology, vol. 39, pp. 749–753, 2023, doi: 10.54097/hset.v39i.6640. DOI: https://doi.org/10.54097/hset.v39i.6640

N. Ahmed, R. Amin, H. Aldabbas, D. Koundal, B. Alouffi, and T. Shah, “Machine Learning Techniques for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges,” Security and Communication Networks, vol. 2022, 2022, doi: 10.1155/2022/1862888. DOI: https://doi.org/10.1155/2022/1862888

I. AbdulNabi and Q. Yaseen, “Spam email detection using deep learning techniques,” Procedia Computer Science, vol. 184, no. 2019, pp. 853–858, 2021, doi: 10.1016/j.procs.2021.03.107. DOI: https://doi.org/10.1016/j.procs.2021.03.107

E. G. Dada, J. S. Bassi, H. Chiroma, S. M. Abdulhamid, A. O. Adetunmbi, and O. E. Ajibuwa, “Machine learning for email spam filtering: review, approaches and open research problems,” Heliyon, vol. 5, no. 6, 2019, doi: 10.1016/j.heliyon.2019.e01802. DOI: https://doi.org/10.1016/j.heliyon.2019.e01802

Downloads

Published

28-03-2024

Issue

Section

Research Articles

How to Cite

[1]
Prachi Bhatnagar and D. S. D. Degadwala, “Efficient Email Spam Classification with N-gram Features and Ensemble Learning”, Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol, vol. 10, no. 2, pp. 278–284, Mar. 2024, doi: 10.32628/CSEIT2410220.

Most read articles by the same author(s)

Similar Articles

1-10 of 141

You may also start an advanced similarity search for this article.