A Two Steps Approach for Afan Oromo Nonfiction Text Categorization

Naol Bakala Defersha; Getachow Mamo

doi:10.32628/CSEIT1183115

Authors

Naol Bakala Defersha Department of Computer Science, Wollega University school of Graduate study, Post Graduate Coordinator, College of Engineering and Technology, Nekemte, Ethiopia, India
Getachow Mamo Assistant Professor, Department of Computer Science, Wollega University school of Graduate study, Post Graduate Coordinator, College of Engineering and Technology, Nekemte, Ethiopia, India

Keywords:

Afan Oromo, nonfiction text, text clustering, text classification, Natural language processing

Abstract

This study presents Afan Oromo text categorizations which use clustering & classification approaches. In natural language such as Afan Oromo, as amount of text documents in electronic format increases, it become difficult to filter, manage, store and process the desired content of information in natural language text. The solution of this problem is developing a tool that categorizes text documents according to their contents. The aim of this study was to design, and implement Afan Oromo nonfiction text categorization model & examining the application of machine learning techniques for automatic Afan Oromo nonfiction text categorization system. Data was collected from Oromia Culture and Tourism Bureau, Oromo cultural center, online electronic documents and other nonfiction books available. In current study, python programming language applied to tokenize, remove stop words and stem Afan Oromo nonfiction text words whereas R programming language was utilized for document indexing, Normalization, cosine similarity, and preparing documents for machine learning. Weka with java is utilized to split Afan Oromo nonfiction text document data set into train set and test set whereas weka tool was utilized for clustering and classification of Afan Oromo nonfiction texts. By using kmean algorithm Afan Oromo nonfiction text document clustering tasks were performed four times to get classes of documents. Among those clustering tasks, one clustering was resulted in cluster with 8 main categories were obtained as good clusters. J48, NaïveBayes, BayesNet, and SMO classifier algorithms were implemented for training text classification model depending on 8 main classes of documents. Among those classifications algorithms, J48 algorithm shows higher performance 94.3755% and hence it was utilized for constructing classification model. From this work it was possible to conclude that machine learning techniques can be applied for Afan Oromo nonfiction text categorization. Further researches also recommend for Afan Oromo nonfiction text Categorization to upgrade the findings.

References

Prof. Abera Diriba, (2009). Classification of Afan Oromo News Text: The Case of Radio Fana (master's thesis). Addis Ababa University, Addis Ababa.
Adel, D. S., (2007). Dimensionality Reduction Techniques for Enhancing Automatic Text Categorization (master’s Thesis).
Andrea Addis, (2010). Study and Development of Novel Techniques for Hierarchical Text Categorization. University of Cagliari, Italy.
Baker, L.D, & Kachites, A. M, (1998). Distributional Clustering of Words for Text Classification: ACM SIGIR, Cluster Quality. Journal of mathematics and computer science (2014).
Bouckaert, R.R., Frank,E., Kirkby, R.,Reutemann,P., seewald,.A., 7 Scuse,D., (2012). WEKA Manual for Version 3.6.1. University of Waikato, Hamilton, New Zealand.
Clsuterting, (n.d.). Clustering Example using RStudio (WIne example). (https://www.youtube.com/results?search_query=Clustering+Example+using+RStudio+%28WIne+example%29) Accessed on July, 2017.
Debela Tesfaye, (2011). A rule-based Afan Oromo Grammar Checker.IJACSA.Vol. 2, No. 8, 2011.
Edmunds, A., & Morris, A., (2000). The problem of information overload in business organizations: a review of the literature. International Journal of Information Management.
Faraz, A., (2015). An elaboration of text Categorization and automatic Text Classification through Mathematical and Graphical Modeling. Computer Science & Engineering: An International Journal (CSEIJ), Vol.5, No.2/3, June 2015.
Gebrehiwot Asefa, (2011). A two steps approach for tigrigna text categorization (master's thesis). Addis Ababa University, Addis Ababa.
Getachow Rabirra., (2016). Oromo Grammar 5th edition. Addis Ababa, Ethiopia.
Hannah, G.G., & Desika, K., (2014). Experimental Estimation of Number of Clusters Based on Cluster Quality. Journal of mathematics and computer science 12 (2014).
Jain, Y., & Kumar, A.N., (2014). A Theoretical Study of Text Document Clustering. Yogesh Jain et al, (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2), 2014, 2246-2251
Jain,S., Afshar, M. A., & Doja, M.N., (2010). K-Means Clustering Using Weka Interface. Proceedings of the 4th National Conference; INDIACom-2010. Computing For Nation Development, February 25 – 26, 2010.
John, C. (1998). Technical Paper Review Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. USA: Morgan Kaufmann.
Khan, A., Baharudin, B., Hong, L.L., & Khan, K., (2010). A Review of Machine Learning Algorithms for Text-Documents Classification. Journal of Advances in Information Technology, Vol. 1, No. 1, February 2010.
Maron, M.E & Kuhns, J.L (1960). Probabilistic Indexing and Information Retrieval. Tehe RAND Corporation. Sanat Monica, California.
McCallum, A., Nigam, K., Thrun, S. & Mitchell, T. (2000) Text Classification from Labeled and Unlabeled Documents Using EM. Boston: Kluwer Academic Publishers, 39(2),
Meenakshi & Singla, S., (2015). Review Paper on Text Categorization Techniques. SSRG International Journal of Computer Science and Engineering (SSRG-IJCSE) – EFES April 2015. ISSN: 2348 – 8387. Available at www.internationaljournalssrg.org
Mostak, K., & Hoq, G., (2014). Information overload: causes, consequences, and remedies: A study, vols LV-LV1.
Raj, B., S., & Paul, A., (2013). Clustering Algorithms: Study and Performance Evaluation Using Weka Tool. International Journal of Current Engineering and Technology ISSN 2277 – 4106 © 2013 INPRESSCO. Available at http://inpressco.com/category/ijcet
Ribeiro, A. A., Gottschalg, C. D, (2014). Automated text clustering of newspaper and scientific texts in brazilianportuguese: analysis and comparison of methods. University of Brasilia (Universidade de Brasilia–UnB), Brasilia, DF, Brazil. JISTEM J.Inf.Syst.Technol. Manag. vol.11 no.2 Sao Paulo May/Aug. 2014.
Rodriguez, (n.d.). Making predictions on new data using Weka. University of Alcala available at http://www.cc.uah.es/drg/courses/datamining/ClassifyingNewDataWeka.pdf
Sebastiani, F., (2002). Machine Learning in Automated Text Categorization.ACM Computing Surveys.ConsiglioNazionaledelleRicerche, Italy: ACM, p.10-15.
Seffi Gebeyehu & Sreenivasa, .V.R, (2014). A Two Step Data Mining Approach for Amharic Text Classification. American Journal of Engineering Research (AJER). e-ISSN : 2320-0847 p-ISSN : 2320-0936 Volume-03, Issue-04, available at www.ajer.org.
Sun, A., & Lim, E., (2001). Hierarchical Text Classification and Evaluation. Proceeding of International conference on data mining. Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM 2001), Pages 521--528, California, USA, November 2001.
Tiwari, M., & Singh, R., (2012). Comparative Investigation of KMeans and K-Medoid Algorithm of IRIS Data. In the International Journal of Engineering Research and Development.
Wakshum mekonnen. (2000), Development ofa stemming algorithm for Afaan Oromo University, Addis Ababa, Ethiopia.
Witten, H., (n.d.). More data mining with weka (3.6 Evaluating clusters). Department of computer science univrsity of Waikato New Zealand. (https://www.youtube.com/watch?v=9aODdNSAauI&t=21s) Accessed on June, 2017.
Zelalem Sintayehu, (2001). Amharic News Classification. MSc Thesis. Addis Ababa University, Addis Ababa, Ethipia

A Two Steps Approach for Afan Oromo Nonfiction Text Categorization

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite