SMS Spam Filteration Using Text Features and Supervised Machine Learning Algorithms
DOI:
https://doi.org/10.32628/CSEIT2410452Keywords:
Spam, SMS, Preprocessing, TF-IDF, BOW, Supervised MLAbstract
Over time, technological advancements have had an immense effect on every aspect of life, including travel, office work, music, healthcare, and communication. In the past, people communicated using telephone lines. With far more functionality than telephone cable technology, wireless technology already prevails. SMS is mostly used by spammers and advertising firms to communicate with the general public and distribute company pamphlets. This explains why over 60% of spam SMS are sent and received every day. Although these spam communications irritate users and occasionally con unsuspecting users, the spammers and ad businesses benefit handsomely from them. This paper suggested a method for categorizing ham and spam SMS using supervised machine learning approaches. Features are extracted from data using feature extraction techniques like bag-of- words and Term Frequency-Inverse Document Frequency (TF-IDF). The imbalance in the SMS dataset we used was addressed by applying both oversampling and under sampling techniques. The support vector classifier, gradient boosting machine, random forest, Gaussian Naive Bayes, and logistics regression are implemented on the using spam SMS and ham SMS data sets, evaluated by F1 score, accuracy, precision and recall are used to assess performance. According to the experiment's findings, the random forest diagnoses spam and ham SMS more precisely-99% of the time.
Downloads
References
Alkhazi B, DiStasi A, Aljedaani W, Alrubaye H, Ye X, Mkaouer MW (2020) Learning to rank developers for bug report assignment. Appl Soft Comput 106667:95
AlOmar EA, Aljedaani W, Tamjeed M, Mkaouer MW, El-Glaly YN (2021) Finding the needle in a haystack: On the automatic identification of accessibility user reviews. In: Proceedings of the 2021 CHI conference on human factors in computing systems, pp 1–15
Angeli A, Filliat D, Doncieux S, Meyer JA (2008) Fast and incremental method for loop-closure detection using bags of visual words. IEEE Trans Robot 24(5):1027–1037
Cernian A, Carstoiu D, Olteanu A, Sgarciu V (2016) Assessing the performance of compression based clustering for text mining. Econ Comput Econ Cybern Stud Res 50:2
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Dittman DJ, Khoshgoftaar TM, Wald R, Napolitano A (2014) Comparison of data sampling approaches for imbalanced bioinformatics data. In: The twenty-seventh international FLAIRS conference
Doma V, Kendre S, Bhagwat L (2018) Detecting hate speech and offensive language on twitter using machine learning: An n-gram and tfidf based approach. arXiv:180908651
Fang F. Wu J. Li Y. Ye X. Aljedaani W. Mkaouer MW (2021) On the classification of bug reports to improve localization- Soft Comput 25(1 1): 7307—7323 Faris H. Ala'm AZ Heidari AA. Aljarah I. Mafarja M. Hassonah MA. Fujita H (2019) An intelligent
system for spam detection and identification of the most relevant features based on evolutionary random weight networks- Information Fusion
Fernåndez A, Garcia S. Herrera F. Chawla NV (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61
Fraser JS, Wang WJ- He HS, Thompson FR (2019) Modeling post-fire tree mortality using a logistic regression method within a forest landscape model. Forests I
Gadde S, Lakshmanarao A, Satyanarayana S (2021) Stns spam detection using machine leaning and deep learning techniques 2021 7Th international conference on advanced computing and communication systems (ICACCS), vol 1. pp 358-362. 10.1109/1CACCSS1430.2021.9441783
Gayathri B, Sumathi C (2016) An automated technique using gaussian naive bayes classifier to classify breast cancer. Int J Comput Appl
Jamil R. Ashraf I, Rustam F. Saad E. Mehmood A. ChoiGS (2021) Detecting sarcasm in multidomain datasets using convolutional neural networks and long short term memory network model. PeerJ Computer Science e64S:7
Kaggle (2021) Spam mails dataset. https://www.kaggle.com/venky73/spam-mails-dataset. Accessed 24 Apr 2021
Ke G, Meng Q, Finley T, wang T, Chen W, Ma W, Yeo, Liu TY (2017) Lightgbm: a highly efficient gradient boosting 'Ecision tree. Advances in neural information processing systems SA
Lee MC, Chang JW. Hsieh TC- Chen HH, Chen CH (2012) A sentence similarity metric based on semantic patterns- Adv Inf Sci Serv Sci 4: 18
Lin WC. Tsai CF. Hu YH, Jhang JS (2017) Clustering-based undersampling in class-imbalanced data. Inf sci 409: 17-26
Mujahid M, Lee E. Rustam F, Washington PB, Ullah S, Reshi AA- Ashraf I (2021) Sentiment analysis and topic modeling on tweets about online education during covid-19. Appl Sci I
Ramsingh J, Bhuvaneswari V (2021) An efficient map reduce-based hybrid nbc-tfidf algorithm to mine the public sentiment on diabetes mellitus—a big data approach. J King Saud University Comput Inf Sci
Roy PK, Singh JP, Banerjee S (2020) Deep learning to filter sms spam. Futur Gener Comput Syst102,524-533
Rupapara V. Rustam F, Amaar A, Washington PB. Lee E. Ashraf I (202 la) Deepfake tweets classifica-tion using stacked bi-lstm and words embedding- PeerJ Computer Science 7:e745
Rupapara V, Rustam F, Shahzad HF, Mehmood A, Ashraf l, Choi GS (2021b) Impact Of smote on imbalanced text features for toxic comments classification using rvvc rncxlel. IEEE Access
Sisodia DS, Mahapatra S, Sharma A (2020) Automated sms classification and spam analysis using topic modeling. In: 2nd International Conference on Data, Engineering and Applications (IDEA), pp 1–6
Muhammad Adeel Abid, Saleem Ullah, Muhammad Abubakar Siddique, Muhammad Faheem Mushtaq, Wajdi Aljedaani, Furqan Rustam, Spam SMS filtering based on text features and supervised machine learning techniques, Springer
Xia T, Chen X (2020) A discrete hidden markov model for sms spam detection. Appl Sci 10(14):5011
Downloads
Published
Issue
Section
License
Copyright (c) 2024 International Journal of Scientific Research in Computer Science, Engineering and Information Technology

This work is licensed under a Creative Commons Attribution 4.0 International License.