A Comparative Analysis of Clustering Methods on the 20 Newsgroups Dataset for Analytics
DOI:
https://doi.org/10.32628/CSEIT25112788Keywords:
Large Language Model (LLM), Clustering, Embedding, Analytics, Natural Language Processing (NLP), Dynamic Classification, Unsupervised K-Means ClusteringAbstract
This paper presents a comparative analysis of two different approaches for clustering textual data from the 20 Newsgroups dataset. The first approach leverages a Large Language Model (LLM) to classify each text into predefined categories using zero-shot classification. The second approach applies to the traditional K-Means clustering algorithm on text embeddings. We evaluate both methods by comparing their predicted clusters against true labels for assessment. For K-Means, we also explore a semi-supervised variant with centroid initialization based on true labels.
Downloads
References
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C., Silverman, R. and Wu, A.Y., 2000, May. The analysis of a simple k-means clustering algorithm. In Proceedings of the sixteenth annual symposium on Computational geometry (pp. 100-109).
Yu, B., Wang, M., Chen, D., Pan, Q. and Wen, Y., LLM-driven Interactive document classification through Keyword Feedback.
Ghosh, S., 2024. Natural Language Processing: Basics, Challenges, and Clustering Applications. In Federated learning for Internet of Vehicles: IoV Image Processing, Vision and Intelligent Systems (pp. 61-82). Bentham Science Publishers.
K. P. Sinaga and M. -S. Yang, "Unsupervised K-Means Clustering Algorithm," in IEEE Access, vol. 8, pp. 80716-80727, 2020, doi: 10.1109/ACCESS.2020.2988796.
Bendou, Y., Lioi, G., Pasdeloup, B., Mauch, L., Hacene, G.B., Cardinaux, F. and Gripon, V., 2024. LLM meets Vision-Language Models for Zero-Shot One-Class Classification. arXiv preprint arXiv:2404.00675.
Zhang, R., Wang, Y.S. and Yang, Y., 2023. Generation-driven Contrastive Self-training for Zero-shot Text Classification with Instruction-following LLM. arXiv preprint arXiv:2304.11872.
Na, S., Xumin, L. and Yong, G., 2010, April. Research on k-means clustering algorithm: An improved k-means clustering algorithm. In 2010 Third International Symposium on intelligent information technology and security informatics (pp. 63-67). Ieee.
Probierz, B., Kozak, J. and Hrabia, A., 2022. Clustering of scientific articles using natural language processing. Procedia computer science, 207, pp.3449-3458.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 International Journal of Scientific Research in Computer Science, Engineering and Information Technology

This work is licensed under a Creative Commons Attribution 4.0 International License.