Comprehensive Review of Multiclass Text Classification using the 20 Newsgroup Dataset

Authors

  • Michael Babatunde Adewoye Faculty of Technology, Department of Computer Science, University of Sunderland, United Kingdom Author
  • Dr. Safina Ara Faculty of Technology, Department of Computer Science, University of Sunderland, United Kingdom Author

DOI:

https://doi.org/10.32628/CSEIT241061166

Abstract

This dissertation showcases a comprehensive study of machine learning and deep learning algorithms on multiclass text classification using the 20Newsgroup dataset. The CRISP-DM methodology was followed, and a detailed step-by-step approach was taken to document every step from data collection to result evaluation. EDA was carried out before and after cleaning and preprocessing and the result was visualised in a word cloud to see the tokenized text. The dataset was pre-processed using tokenization, removal of stopwords, changing to lowercase and vectorization using TF-IDF. The models implemented are Naïve Bayes, KNN, XG Boost, Logistics Regression, Random Forest, Decision Tree, SVM, CNN, ANN, RNN and LSTM. It is interesting to note that SVM had the best accuracy and performed better than any of the Deep Learning models. The confusion matrices provided more details as to where each of the models struggled. Further investigation can be done on this work to find out why the deep learning models did not outperform the machine learning models. This work presents an accurate comparative analysis which can be validated by running the code attached.

Downloads

Download data is not yet available.

References

Taunk, K., De, S., Verma, S. and Swetapadma, A., 2019, May. A brief review of nearest neighbor algorithm for learning and classification. In 2019 international DOI: https://doi.org/10.1109/ICCS45141.2019.9065747

Bhavani, A. and Santhosh Kumar, B. (2021a) ‘A Review of State Art of Text Classification Algorithms’, in Proceedings - 5th International Conference on Computing Methodologies and Communication, ICCMC 2021. Institute of Electrical and Electronics Engineers Inc., pp. 1484–1490. Available at: https://doi.org/10.1109/ICCMC51019.2021.9418262. DOI: https://doi.org/10.1109/ICCMC51019.2021.9418262

Dönicke, T., Lux, F. and Damaschk, M. (no date) Multiclass Text Classification on Unbalanced, Sparse and Noisy Data. Available at: https://opennlp.apache.org/docs/1.8.

Khurana, D. et al. (2023) ‘Natural language processing: state of the art, current trends and challenges’, Multimedia Tools and Applications, 82(3), pp. 3713–3744. Available at: https://doi.org/10.1007/s11042-022-13428-4. DOI: https://doi.org/10.1007/s11042-022-13428-4

Li, Q. et al. (2022) ‘A Survey on Text Classification: From Traditional to Deep Learning’, ACM Transactions on Intelligent Systems and Technology. Association for Computing Machinery. Available at: https://doi.org/10.1145/3495162. DOI: https://doi.org/10.1145/3495162

Minaee, S. et al. (2021) ‘Deep Learning-Based Text Classification’, ACM Computing Surveys. Association for Computing Machinery. Available at: https://doi.org/10.1145/3439726. DOI: https://doi.org/10.1145/3439726

Parida, U., Nayak, M. and Nayak, A.K. (2021) ‘News text categorization using random forest and naïve bayes’, in 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology, ODICON 2021. Institute of Electrical and Electronics Engineers Inc. Available at: https://doi.org/10.1109/ODICON50556.2021.9428925.

Zhou, Y. (2020) ‘A Review of Text Classification Based on Deep Learning’, in ACM International Conference Proceeding Series. Association for Computing Machinery, pp. 132–136. Available at: https://doi.org/10.1145/3397056.3397082.

Ali, D., Missen, M.M.S. and Husnain, M. (2021) ‘Multiclass Event Classification from Text’, Scientific Programming, 2021. Available at: https://doi.org/10.1155/2021/6660651. DOI: https://doi.org/10.1155/2021/6660651

Barua, A., Sharif, O. and Hoque, M.M. (2021) ‘ScienceDirect Multi-class Sports News Categorization using Machine Learning Techniques: Resource Creation and Evaluation’, Procedia Computer Science, 193, pp. 112–121. Available at: https://doi.org/10.1016/j.procs.2021.11.002. DOI: https://doi.org/10.1016/j.procs.2021.11.002

Bouazizi, M. and Ohtsuki, T. (2019) ‘Multi-class sentiment analysis on twitter: Classification performance and challenges’, Big Data Mining and Analytics, 2(3), pp. 181–194. Available at: https://doi.org/10.26599/BDMA.2019.9020002. DOI: https://doi.org/10.26599/BDMA.2019.9020002

Deng, X. et al. (2019) ‘Feature selection for text classification: A review’, Multimedia Tools and Applications, 78(3), pp. 3797–3816. Available at: https://doi.org/10.1007/s11042-018-6083-5. DOI: https://doi.org/10.1007/s11042-018-6083-5

Dhar, A. et al. (2021) ‘Text categorization: past and present’, Artificial Intelligence Review, 54(4), pp. 3007–3054. Available at: https://doi.org/10.1007/s10462-020-09919-1. DOI: https://doi.org/10.1007/s10462-020-09919-1

Grandini, M., Bagli, E. and Visani, G. (2020) ‘Metrics for Multi-Class Classification: an Overview’. Available at: http://arxiv.org/abs/2008.05756.

Hota, S. and Pathak, S. (2018) ‘KNN classifier based approach for multi-class sentiment analysis of twitter data’, International Journal of Engineering & Technology, 7(3), pp. 1372–1375. Available at: https://doi.org/10.14419/ijet.v7i3.12656. DOI: https://doi.org/10.14419/ijet.v7i3.12656

Hsu, B.-M. (no date) ‘mathematics Comparison of Supervised Classification Models on Textual Data’. Available at: https://doi.org/10.3390/math8050851. DOI: https://doi.org/10.3390/math8050851

Huang, Y. (2009) ‘Advances in artificial neural networks - Methodological development and application’, Algorithms, pp. 973–1007. Available at: https://doi.org/10.3390/algor2030973. DOI: https://doi.org/10.3390/algor2030973

Islam, M.Z. et al. (2019) ‘A semantics aware random forest for text classification’, International Conference on Information and Knowledge Management, Proceedings, pp. 1061–1070. Available at: https://doi.org/10.1145/3357384.3357891. DOI: https://doi.org/10.1145/3357384.3357891

Jelodar, H. et al. (2020) ‘Deep Sentiment Classification and Topic Discovery on Novel Coronavirus or COVID-19 Online Discussions: NLP Using LSTM Recurrent Neural Network Approach’, IEEE Journal of Biomedical and Health Informatics, 24(10), pp. 2733–2742. Available at: https://doi.org/10.1109/JBHI.2020.3001216. DOI: https://doi.org/10.1109/JBHI.2020.3001216

Kamath, C.N., Bukhari, S.S. and Dengel, A. (2018) ‘Comparative study between traditional machine learning and deep learning approaches for text classification’, in Proceedings of the ACM Symposium on Document Engineering 2018, DocEng 2018. Association for Computing Machinery, Inc. Available at: https://doi.org/10.1145/3209280.3209526. DOI: https://doi.org/10.1145/3209280.3209526

Kowsari, K. et al. (2019) ‘Text classification algorithms: A survey’, Information (Switzerland). MDPI AG. Available at: https://doi.org/10.3390/info10040150. DOI: https://doi.org/10.3390/info10040150

Lee, E., Lee, C. and Ahn, S. (2022) ‘Comparative Study of Multiclass Text Classification in Research Proposals Using Pretrained Language Models’, Applied Sciences (Switzerland), 12(9). Available at: https://doi.org/10.3390/app12094522. DOI: https://doi.org/10.3390/app12094522

Malekzadeh, M. et al. (2021) ‘Review of Graph Neural Network in Text Classification’, in 2021 IEEE 12th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference, UEMCON 2021. Institute of Electrical and Electronics Engineers Inc., pp. 84–91. Available at: https://doi.org/10.1109/UEMCON53757.2021.9666633. DOI: https://doi.org/10.1109/UEMCON53757.2021.9666633

Manikandan, R. and Sivakumar, D.R. (2018) Machine learning algorithms for text-documents classification: A review, International Journal of Academic Research and Development. Available at: www.academicsjournal.com.

Mirończuk, M.M. and Protasiewicz, J. (2018) ‘A recent overview of the state-of-the-art elements of text classification’, Expert Systems with Applications. Elsevier Ltd, pp. 36–54. Available at: https://doi.org/10.1016/j.eswa.2018.03.058. DOI: https://doi.org/10.1016/j.eswa.2018.03.058

Ogaga, D. and Olalere, A. (no date) Evaluation and Comparison of SVM, Deep Learning, and Naïve Bayes Performances for Natural Language Processing Text Classification Task. Available at: https://www.researchgate.net/publication/375773207.

Parida, U., Nayak, M. and Nayak, A.K. (2021) ‘News text categorization using random forest and naïve bayes’, in 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology, ODICON 2021. Institute of Electrical and Electronics Engineers Inc. Available at: https://doi.org/10.1109/ODICON50556.2021.9428925. DOI: https://doi.org/10.1109/ODICON50556.2021.9428925

Patel, A., Pathak, S. and Khan, M.I. (2021) ‘Automated text categorization’, in 2021 3rd International Conference on Signal Processing and Communication, ICPSC 2021. Institute of Electrical and Electronics Engineers Inc., pp. 16–20. Available at: https://doi.org/10.1109/ICSPC51351.2021.9451670. DOI: https://doi.org/10.1109/ICSPC51351.2021.9451670

Pintas, J.T., Fernandes, L.A.F. and Garcia, A.C.B. (2021) ‘Feature selection methods for text classification: a systematic literature review’, Artificial Intelligence Review, 54(8), pp. 6149–6200. Available at: https://doi.org/10.1007/s10462-021-09970-6. DOI: https://doi.org/10.1007/s10462-021-09970-6

Rácz, A., Bajusz, D. and Héberger, K. (2021) ‘Effect of dataset size and train/test split ratios in qsar/qspr multiclass classification’, Molecules, 26(4). Available at: https://doi.org/10.3390/molecules26041111. DOI: https://doi.org/10.3390/molecules26041111

Rennie, J.D.M. (2017) Improving Multi-class Text Classification with Naive Bayes. Available at: https://www.researchgate.net/publication/279812722.

Saigal, P. and Khanna, V. (123AD) ‘Multi-category news classification using Support Vector Machine based classifiers’, SN Applied Sciences, 2. Available at: https://doi.org/10.1007/s42452-020-2266-6. DOI: https://doi.org/10.1007/s42452-020-2266-6

Solovyeva, E. and Abdullah, A. (2021) ‘Binary and multiclass text classification by means of separable convolutional neural network’, Inventions, 6(4). Available at: https://doi.org/10.3390/inventions6040070. DOI: https://doi.org/10.3390/inventions6040070

Soni, S., Chouhan, S.S. and Rathore, S.S. (2023) ‘TextConvoNet: a convolutional neural network based architecture for text classification’, Applied Intelligence, 53(11), pp. 14249–14268. Available at: https://doi.org/10.1007/s10489-022-04221-9. DOI: https://doi.org/10.1007/s10489-022-04221-9

Talla, S. et al. (2019) ‘Multiclass Classification Using Random Forest Classifier’, International Journal of Scientific Research in Computer Science, Engineering and Information Technology © 2019 IJSRCSEIT |, 5(2), pp. 2456–3307. Available at: https://doi.org/10.32628/CSEIT183821. DOI: https://doi.org/10.32628/CSEIT183821

Tanha, J. et al. (no date) ‘Boosting methods for multi-class imbalanced data classification: an experimental review’. Available at: https://doi.org/10.1186/s40537-020-00349-y. DOI: https://doi.org/10.1186/s40537-020-00349-y

Thangaraj, M. and Sivakami, M. (2018) ‘Text classification techniques: A literature review’, Interdisciplinary Journal of Information, Knowledge, and Management, 13, pp. 117–135. Available at: https://doi.org/10.28945/4066. DOI: https://doi.org/10.28945/4066

Yadav, B.P. et al. (2020) ‘Text categorization Performance examination Using Machine Learning Algorithms’, in IOP Conference Series: Materials Science and Engineering. IOP Publishing Ltd. Available at: https://doi.org/10.1088/1757-899X/981/2/022044. DOI: https://doi.org/10.1088/1757-899X/981/2/022044

Zhou, Y. (2020) ‘A Review of Text Classification Based on Deep Learning’, in ACM International Conference Proceeding Series. Association for Computing Machinery, pp. 132–136. Available at: https://doi.org/10.1145/3397056.3397082. DOI: https://doi.org/10.1145/3397056.3397082

Downloads

Published

30-11-2024

Issue

Section

Research Articles