SMS Spam Filteration Using Text Features and Supervised Machine Learning Algorithms
DOI:
https://doi.org/10.32628/CSEIT2410452Keywords:
Spam, SMS, Preprocessing, TF-IDF, BOW, Supervised MLAbstract
Over time, technological advancements have had an immense effect on every aspect of life, including travel, office work, music, healthcare, and communication. In the past, people communicated using telephone lines. With far more functionality than telephone cable technology, wireless technology already prevails. SMS is mostly used by spammers and advertising firms to communicate with the general public and distribute company pamphlets. This explains why over 60% of spam SMS are sent and received every day. Although these spam communications irritate users and occasionally con unsuspecting users, the spammers and ad businesses benefit handsomely from them. This paper suggested a method for categorizing ham and spam SMS using supervised machine learning approaches. Features are extracted from data using feature extraction techniques like bag-of- words and Term Frequency-Inverse Document Frequency (TF-IDF). The imbalance in the SMS dataset we used was addressed by applying both oversampling and under sampling techniques. The support vector classifier, gradient boosting machine, random forest, Gaussian Naive Bayes, and logistics regression are implemented on the using spam SMS and ham SMS data sets, evaluated by F1 score, accuracy, precision and recall are used to assess performance. According to the experiment's findings, the random forest diagnoses spam and ham SMS more precisely-99% of the time.
Downloads
References
Alkhazi B, DiStasi A, Aljedaani W, Alrubaye H, Ye X, Mkaouer MW (2020) Learning to rank developers for bug report assignment. Appl Soft Comput 106667:95 DOI: https://doi.org/10.1016/j.asoc.2020.106667
AlOmar EA, Aljedaani W, Tamjeed M, Mkaouer MW, El-Glaly YN (2021) Finding the needle in a haystack: On the automatic identification of accessibility user reviews. In: Proceedings of the 2021 CHI conference on human factors in computing systems, pp 1–15 DOI: https://doi.org/10.1145/3411764.3445281
Angeli A, Filliat D, Doncieux S, Meyer JA (2008) Fast and incremental method for loop-closure detection using bags of visual words. IEEE Trans Robot 24(5):1027–1037 DOI: https://doi.org/10.1109/TRO.2008.2004514
Cernian A, Carstoiu D, Olteanu A, Sgarciu V (2016) Assessing the performance of compression based clustering for text mining. Econ Comput Econ Cybern Stud Res 50:2
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357 DOI: https://doi.org/10.1613/jair.953
Dittman DJ, Khoshgoftaar TM, Wald R, Napolitano A (2014) Comparison of data sampling approaches for imbalanced bioinformatics data. In: The twenty-seventh international FLAIRS conference
Doma V, Kendre S, Bhagwat L (2018) Detecting hate speech and offensive language on twitter using machine learning: An n-gram and tfidf based approach. arXiv:180908651
Fang F. Wu J. Li Y. Ye X. Aljedaani W. Mkaouer MW (2021) On the classification of bug reports to improve localization- Soft Comput 25(1 1): 7307—7323 Faris H. Ala'm AZ Heidari AA. Aljarah I. Mafarja M. Hassonah MA. Fujita H (2019) An intelligent DOI: https://doi.org/10.1007/s00500-021-05689-2
system for spam detection and identification of the most relevant features based on evolutionary random weight networks- Information Fusion
Fernåndez A, Garcia S. Herrera F. Chawla NV (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61 DOI: https://doi.org/10.1613/jair.1.11192
Fraser JS, Wang WJ- He HS, Thompson FR (2019) Modeling post-fire tree mortality using a logistic regression method within a forest landscape model. Forests I DOI: https://doi.org/10.3390/f10010025
Gadde S, Lakshmanarao A, Satyanarayana S (2021) Stns spam detection using machine leaning and deep learning techniques 2021 7Th international conference on advanced computing and communication systems (ICACCS), vol 1. pp 358-362. 10.1109/1CACCSS1430.2021.9441783 DOI: https://doi.org/10.1109/ICACCS51430.2021.9441783
Gayathri B, Sumathi C (2016) An automated technique using gaussian naive bayes classifier to classify breast cancer. Int J Comput Appl
Jamil R. Ashraf I, Rustam F. Saad E. Mehmood A. ChoiGS (2021) Detecting sarcasm in multidomain datasets using convolutional neural networks and long short term memory network model. PeerJ Computer Science e64S:7 DOI: https://doi.org/10.7717/peerj-cs.645
Kaggle (2021) Spam mails dataset. https://www.kaggle.com/venky73/spam-mails-dataset. Accessed 24 Apr 2021
Ke G, Meng Q, Finley T, wang T, Chen W, Ma W, Yeo, Liu TY (2017) Lightgbm: a highly efficient gradient boosting 'Ecision tree. Advances in neural information processing systems SA
Lee MC, Chang JW. Hsieh TC- Chen HH, Chen CH (2012) A sentence similarity metric based on semantic patterns- Adv Inf Sci Serv Sci 4: 18 DOI: https://doi.org/10.4156/aiss.vol4.issue18.71
Lin WC. Tsai CF. Hu YH, Jhang JS (2017) Clustering-based undersampling in class-imbalanced data. Inf sci 409: 17-26 DOI: https://doi.org/10.1016/j.ins.2017.05.008
Mujahid M, Lee E. Rustam F, Washington PB, Ullah S, Reshi AA- Ashraf I (2021) Sentiment analysis and topic modeling on tweets about online education during covid-19. Appl Sci I DOI: https://doi.org/10.3390/app11188438
Ramsingh J, Bhuvaneswari V (2021) An efficient map reduce-based hybrid nbc-tfidf algorithm to mine the public sentiment on diabetes mellitus—a big data approach. J King Saud University Comput Inf Sci DOI: https://doi.org/10.1016/j.jksuci.2018.06.011
Roy PK, Singh JP, Banerjee S (2020) Deep learning to filter sms spam. Futur Gener Comput Syst102,524-533 DOI: https://doi.org/10.1016/j.future.2019.09.001
Rupapara V. Rustam F, Amaar A, Washington PB. Lee E. Ashraf I (202 la) Deepfake tweets classifica-tion using stacked bi-lstm and words embedding- PeerJ Computer Science 7:e745 DOI: https://doi.org/10.7717/peerj-cs.745
Rupapara V, Rustam F, Shahzad HF, Mehmood A, Ashraf l, Choi GS (2021b) Impact Of smote on imbalanced text features for toxic comments classification using rvvc rncxlel. IEEE Access DOI: https://doi.org/10.1109/ACCESS.2021.3083638
Sisodia DS, Mahapatra S, Sharma A (2020) Automated sms classification and spam analysis using topic modeling. In: 2nd International Conference on Data, Engineering and Applications (IDEA), pp 1–6 DOI: https://doi.org/10.1109/IDEA49133.2020.9170710
Muhammad Adeel Abid, Saleem Ullah, Muhammad Abubakar Siddique, Muhammad Faheem Mushtaq, Wajdi Aljedaani, Furqan Rustam, Spam SMS filtering based on text features and supervised machine learning techniques, Springer
Xia T, Chen X (2020) A discrete hidden markov model for sms spam detection. Appl Sci 10(14):5011 DOI: https://doi.org/10.3390/app10145011
Downloads
Published
Issue
Section
License
Copyright (c) 2024 International Journal of Scientific Research in Computer Science, Engineering and Information Technology
This work is licensed under a Creative Commons Attribution 4.0 International License.