A Comparative Analysis of Clustering Methods on the 20 Newsgroups Dataset for Analytics

Authors

  • Yanas Rajindran Lead Engineer, AT&T, Dallas, United States Author
  • Hanza Parayil Salim Staff Engineer, Neiman Marcus, Dallas, United States Author

DOI:

https://doi.org/10.32628/CSEIT25112788

Keywords:

Large Language Model (LLM), Clustering, Embedding, Analytics, Natural Language Processing (NLP), Dynamic Classification, Unsupervised K-Means Clustering

Abstract

This paper presents a comparative analysis of two different approaches for clustering textual data from the 20 Newsgroups dataset. The first approach leverages a Large Language Model (LLM) to classify each text into predefined categories using zero-shot classification. The second approach applies to the traditional K-Means clustering algorithm on text embeddings. We evaluate both methods by comparing their predicted clusters against true labels for assessment. For K-Means, we also explore a semi-supervised variant with centroid initialization based on true labels.

Downloads

Download data is not yet available.

References

Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C., Silverman, R. and Wu, A.Y., 2000, May. The analysis of a simple k-means clustering algorithm. In Proceedings of the sixteenth annual symposium on Computational geometry (pp. 100-109).

Yu, B., Wang, M., Chen, D., Pan, Q. and Wen, Y., LLM-driven Interactive document classification through Keyword Feedback.

Ghosh, S., 2024. Natural Language Processing: Basics, Challenges, and Clustering Applications. In Federated learning for Internet of Vehicles: IoV Image Processing, Vision and Intelligent Systems (pp. 61-82). Bentham Science Publishers.

K. P. Sinaga and M. -S. Yang, "Unsupervised K-Means Clustering Algorithm," in IEEE Access, vol. 8, pp. 80716-80727, 2020, doi: 10.1109/ACCESS.2020.2988796.

Bendou, Y., Lioi, G., Pasdeloup, B., Mauch, L., Hacene, G.B., Cardinaux, F. and Gripon, V., 2024. LLM meets Vision-Language Models for Zero-Shot One-Class Classification. arXiv preprint arXiv:2404.00675.

Zhang, R., Wang, Y.S. and Yang, Y., 2023. Generation-driven Contrastive Self-training for Zero-shot Text Classification with Instruction-following LLM. arXiv preprint arXiv:2304.11872.

Na, S., Xumin, L. and Yong, G., 2010, April. Research on k-means clustering algorithm: An improved k-means clustering algorithm. In 2010 Third International Symposium on intelligent information technology and security informatics (pp. 63-67). Ieee.

Probierz, B., Kozak, J. and Hrabia, A., 2022. Clustering of scientific articles using natural language processing. Procedia computer science, 207, pp.3449-3458.

Downloads

Published

04-04-2025

Issue

Section

Research Articles