SMS Spam Filteration Using Text Features and Supervised Machine Learning Algorithms

Rashmi Pandey; Pushpendra Prajapati; Vibhanshu Kumar Singh; Mayank Tyagi; Chetan Anand Amb

doi:10.32628/CSEIT2410452

Authors

Rashmi Pandey Department of MCA, Institute of Technology and Management, Gwalior, Madhya Pradesh, India Author
Pushpendra Prajapati Department of MCA, Institute of Technology and Management, Gwalior, Madhya Pradesh, India Author
Vibhanshu Kumar Singh Department of MCA, Institute of Technology and Management, Gwalior, Madhya Pradesh, India Author
Mayank Tyagi Department of MCA, Institute of Technology and Management, Gwalior, Madhya Pradesh, India Author
Chetan Anand Amb Department of MCA, Institute of Technology and Management, Gwalior, Madhya Pradesh, India Author

DOI:

https://doi.org/10.32628/CSEIT2410452

Keywords:

Spam, SMS, Preprocessing, TF-IDF, BOW, Supervised ML

Abstract

Over time, technological advancements have had an immense effect on every aspect of life, including travel, office work, music, healthcare, and communication. In the past, people communicated using telephone lines. With far more functionality than telephone cable technology, wireless technology already prevails. SMS is mostly used by spammers and advertising firms to communicate with the general public and distribute company pamphlets. This explains why over 60% of spam SMS are sent and received every day. Although these spam communications irritate users and occasionally con unsuspecting users, the spammers and ad businesses benefit handsomely from them. This paper suggested a method for categorizing ham and spam SMS using supervised machine learning approaches. Features are extracted from data using feature extraction techniques like bag-of- words and Term Frequency-Inverse Document Frequency (TF-IDF). The imbalance in the SMS dataset we used was addressed by applying both oversampling and under sampling techniques. The support vector classifier, gradient boosting machine, random forest, Gaussian Naive Bayes, and logistics regression are implemented on the using spam SMS and ham SMS data sets, evaluated by F1 score, accuracy, precision and recall are used to assess performance. According to the experiment's findings, the random forest diagnoses spam and ham SMS more precisely-99% of the time.

Downloads

Download data is not yet available.

References

Alkhazi B, DiStasi A, Aljedaani W, Alrubaye H, Ye X, Mkaouer MW (2020) Learning to rank developers for bug report assignment. Appl Soft Comput 106667:95 DOI: https://doi.org/10.1016/j.asoc.2020.106667

AlOmar EA, Aljedaani W, Tamjeed M, Mkaouer MW, El-Glaly YN (2021) Finding the needle in a haystack: On the automatic identification of accessibility user reviews. In: Proceedings of the 2021 CHI conference on human factors in computing systems, pp 1–15 DOI: https://doi.org/10.1145/3411764.3445281

Angeli A, Filliat D, Doncieux S, Meyer JA (2008) Fast and incremental method for loop-closure detection using bags of visual words. IEEE Trans Robot 24(5):1027–1037 DOI: https://doi.org/10.1109/TRO.2008.2004514

Cernian A, Carstoiu D, Olteanu A, Sgarciu V (2016) Assessing the performance of compression based clustering for text mining. Econ Comput Econ Cybern Stud Res 50:2

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357 DOI: https://doi.org/10.1613/jair.953

Dittman DJ, Khoshgoftaar TM, Wald R, Napolitano A (2014) Comparison of data sampling approaches for imbalanced bioinformatics data. In: The twenty-seventh international FLAIRS conference

Doma V, Kendre S, Bhagwat L (2018) Detecting hate speech and offensive language on twitter using machine learning: An n-gram and tfidf based approach. arXiv:180908651

Fang F. Wu J. Li Y. Ye X. Aljedaani W. Mkaouer MW (2021) On the classification of bug reports to improve localization- Soft Comput 25(1 1): 7307—7323 Faris H. Ala'm AZ Heidari AA. Aljarah I. Mafarja M. Hassonah MA. Fujita H (2019) An intelligent DOI: https://doi.org/10.1007/s00500-021-05689-2

system for spam detection and identification of the most relevant features based on evolutionary random weight networks- Information Fusion

Fernåndez A, Garcia S. Herrera F. Chawla NV (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61 DOI: https://doi.org/10.1613/jair.1.11192

Fraser JS, Wang WJ- He HS, Thompson FR (2019) Modeling post-fire tree mortality using a logistic regression method within a forest landscape model. Forests I DOI: https://doi.org/10.3390/f10010025

Gadde S, Lakshmanarao A, Satyanarayana S (2021) Stns spam detection using machine leaning and deep learning techniques 2021 7Th international conference on advanced computing and communication systems (ICACCS), vol 1. pp 358-362. 10.1109/1CACCSS1430.2021.9441783 DOI: https://doi.org/10.1109/ICACCS51430.2021.9441783

Gayathri B, Sumathi C (2016) An automated technique using gaussian naive bayes classifier to classify breast cancer. Int J Comput Appl

Jamil R. Ashraf I, Rustam F. Saad E. Mehmood A. ChoiGS (2021) Detecting sarcasm in multidomain datasets using convolutional neural networks and long short term memory network model. PeerJ Computer Science e64S:7 DOI: https://doi.org/10.7717/peerj-cs.645

Kaggle (2021) Spam mails dataset. https://www.kaggle.com/venky73/spam-mails-dataset. Accessed 24 Apr 2021

Ke G, Meng Q, Finley T, wang T, Chen W, Ma W, Yeo, Liu TY (2017) Lightgbm: a highly efficient gradient boosting 'Ecision tree. Advances in neural information processing systems SA

Lee MC, Chang JW. Hsieh TC- Chen HH, Chen CH (2012) A sentence similarity metric based on semantic patterns- Adv Inf Sci Serv Sci 4: 18 DOI: https://doi.org/10.4156/aiss.vol4.issue18.71

Lin WC. Tsai CF. Hu YH, Jhang JS (2017) Clustering-based undersampling in class-imbalanced data. Inf sci 409: 17-26 DOI: https://doi.org/10.1016/j.ins.2017.05.008

Mujahid M, Lee E. Rustam F, Washington PB, Ullah S, Reshi AA- Ashraf I (2021) Sentiment analysis and topic modeling on tweets about online education during covid-19. Appl Sci I DOI: https://doi.org/10.3390/app11188438

Ramsingh J, Bhuvaneswari V (2021) An efficient map reduce-based hybrid nbc-tfidf algorithm to mine the public sentiment on diabetes mellitus—a big data approach. J King Saud University Comput Inf Sci DOI: https://doi.org/10.1016/j.jksuci.2018.06.011

Roy PK, Singh JP, Banerjee S (2020) Deep learning to filter sms spam. Futur Gener Comput Syst102,524-533 DOI: https://doi.org/10.1016/j.future.2019.09.001

Rupapara V. Rustam F, Amaar A, Washington PB. Lee E. Ashraf I (202 la) Deepfake tweets classifica-tion using stacked bi-lstm and words embedding- PeerJ Computer Science 7:e745 DOI: https://doi.org/10.7717/peerj-cs.745

Rupapara V, Rustam F, Shahzad HF, Mehmood A, Ashraf l, Choi GS (2021b) Impact Of smote on imbalanced text features for toxic comments classification using rvvc rncxlel. IEEE Access DOI: https://doi.org/10.1109/ACCESS.2021.3083638

Sisodia DS, Mahapatra S, Sharma A (2020) Automated sms classification and spam analysis using topic modeling. In: 2nd International Conference on Data, Engineering and Applications (IDEA), pp 1–6 DOI: https://doi.org/10.1109/IDEA49133.2020.9170710

Muhammad Adeel Abid, Saleem Ullah, Muhammad Abubakar Siddique, Muhammad Faheem Mushtaq, Wajdi Aljedaani, Furqan Rustam, Spam SMS filtering based on text features and supervised machine learning techniques, Springer

Xia T, Chen X (2020) A discrete hidden markov model for sms spam detection. Appl Sci 10(14):5011 DOI: https://doi.org/10.3390/app10145011