A Comparison of Text Classification Techniques Applied to Indonesian Text Dataset

Umniy Salamah

doi:10.32628/CSEIT195629

Authors

Umniy Salamah Faculty of Computer Science, Universitas Mercu Buana, Jakarta Barat, Indonesia

DOI:

https://doi.org//10.32628/CSEIT195629

Keywords:

Indonesian Text, Logistic Regression, Text Classification, Xgboost

Abstract

In organization, statement contained opinion and complaint to a service or program by it organization. can be proceed using machine learning and the result can be used by organization to improve and enhance their quality. This research attempted to classify the reports from social media based on complaint and non-complaint using machine learning algorithm named Logistic regression (LR) and eXtreme Gradient Boosting (XGBoost). Logistic Regression model using CountVectorizer feature extraction and TfidfVectorizer. Moreover, the XGBoost algorithm uses multiple parameters so that it can be improved by tuning the parameters, i.e. eta or learning rate, gamma, max_depth, min_child_weight, subsample, colsample_bytree and alpha. As the result, the best value for XGBoost with parameter are 'reg_alpha': 0.01, 'colsample_bytree': 0.9, 'learning_rate': 0.5, 'min_child_weight': 1, 'subsample': 0.8, 'max_depth': 3, 'gamma': 0.0, in wich the computational time is 13870.012468 and the best accuracy that achieved is 0.927943760984. Furthermore, the performance evaluation results for Logistic Regression using TfidfVectorizer and CountVectorizer feature extraction are 0.9181 and 0.9356.

References

I. Nurhaida, A. Noviyanto, M. Manurung, and A. M. Arymurthi, “Automatic Indonesian’s Batik Pattern Recognition using SIFT Approach,” in ICCSCI - 1st International Conference on Computer Science and Computational Intelligence, Jakarta, 2015.
H. Noprisson, E. Hidayat, and N. Zulkarnaim, “A Preliminary Study of Modelling Interconnected Systems Initiatives for Preserving Indigenous Knowledge in Indonesia,” in 2015 International Conference on Information Technology Systems and Innovation (ICITSI), 2015, pp. 1-6.
W. P. Sari, E. Cahyaningsih, D. I. Sensuse, and H. Noprisson, “The welfare classification of Indonesian national civil servant using TOPSIS and k-Nearest Neighbour (KNN),” in Research and Development (SCOReD), 2016 IEEE Student Conference on, 2016, pp. 1-5.
D. Fitrianah, A. N. Hidayanto, R. A. Zen, and A. M. Arymurthy, “APDATI: E-Fishing Logbook for Integrated Tuna Fishing Data Management,” J. Theor. Appl. Inf. Technol., vol. 75, no. 2, 2015.
M. Sadikin, M. I. Fanany, and T. Basaruddin, “A New Data Representation Based on Training Data Characteristics to Extract Drug Name Entity in Medical Text,” Comput. Intell. Neurosci., vol. 2016, 2016.
M. O. Pratama, R. Meiyanti, H. Noprisson, A. Ramadhan, and A. N. Hidayanto, “Influencing factors of consumer purchase intention based on social commerce paradigm,” in Advanced Computer Science and Information Systems (ICACSIS), 2017 International Conference on, 2017, pp. 73-80.
H. Noprisson, N. Husin, M. Utami, Puji Rahayu, Y. G. Sucahyo, and D. I. Sensuse, “The Use of a Mixed Method Approach to Evaluate m-Government Implementation,” in 2016 International Conference on Information Technology Systems and Innovation (ICITSI), 2016.
V. Ayumi, “Pose-based Human Action Recognition with Extreme Gradient Boosting,” 2016.
I. Nurhaida, R. Manurung, and A. M. Arymurthy, “Performance comparison analysis features extraction methods for batik recognition,” in International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2012.
M. Sadikin and I. Wasito, “Translation and classification algorithm of FDA-Drugs to DOEN2011 class therapy to estimate drug-drug interaction,” in The 2nd International Conference on Information Systems for Business Competitiveness, 2013.
N. Azizah, M. Ivan, and I. Budi, “Twitter Sentiment to Analyze Net Brand Reputation of Mobile Phone Providers,” Procedia - Procedia Comput. Sci., vol. 72, pp. 519-526, 2015.
A. Mittal, “Stock Prediction Using Twitter Sentiment Analysis,” no. June, 2009.
B. J. Jansen, M. Zhang, K. Sobel, and A. Chowdury, “Twitter Power: Tweets as Electronic Word of Mouth,” J. Am. Soc. Inf. Sci. Technol., vol. 60, no. 11, pp. 2169-2188, 2009.
W. Cheng and H. Eyke, “Combining Instance-Based Learning and Logistic Regression for Multilabel Classification,” pp. 1-15, 2009.
J. Freyberger, N. T. Heffernan, and C. Ruiz, “Using Association Rules to Guide a Search for Best Fitting Transfer Models of Student Learning,” 2004.
V. Rus, M. Lintean, and R. Azevedo, “Automatic Detection of Student Mental Models During Prior Knowledge Activation in MetaTutor,” pp. 161-170, 2009.
M. Feng and J. Beck, “Back to the future : a non-automated method of constructing transfer models,” pp. 240-249, 2009.
S. B. Kotsiantis, C. J. Pierrakeas, and P. E. Pintelas, “Preventing Student Dropout in Distance Learning Using Machine Learning Techniques,” pp. 267-274, 2003.
W. Andangsari and M. N. Suprayogi, “Personality Prediction Based on Twitter Information in Bahasa Indonesia,” vol. 11, pp. 367-372, 2017.
M. Borg and K. P. Camilleri, “Towards a Transcription System of Sign Language Video Resources via Motion Trajectory Factorisation,” Proc. 2017 ACM Symp. Doc. Eng., pp. 163-172, 2017.

A Comparison of Text Classification Techniques Applied to Indonesian Text Dataset

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite