A Comparative Analysis of Clustering Methods on the 20 Newsgroups Dataset for Analytics

Yanas Rajindran; Hanza Parayil Salim

doi:10.32628/CSEIT25112788

Authors

Yanas Rajindran Lead Engineer, AT&T, Dallas, United States Author
Hanza Parayil Salim Staff Engineer, Neiman Marcus, Dallas, United States Author

DOI:

https://doi.org/10.32628/CSEIT25112788

Keywords:

Large Language Model (LLM), Clustering, Embedding, Analytics, Natural Language Processing (NLP), Dynamic Classification, Unsupervised K-Means Clustering

Abstract

This paper presents a comparative analysis of two different approaches for clustering textual data from the 20 Newsgroups dataset. The first approach leverages a Large Language Model (LLM) to classify each text into predefined categories using zero-shot classification. The second approach applies to the traditional K-Means clustering algorithm on text embeddings. We evaluate both methods by comparing their predicted clusters against true labels for assessment. For K-Means, we also explore a semi-supervised variant with centroid initialization based on true labels.

📊 Article Downloads

References

Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C., Silverman, R. and Wu, A.Y., 2000, May. The analysis of a simple k-means clustering algorithm. In Proceedings of the sixteenth annual symposium on Computational geometry (pp. 100-109).

Yu, B., Wang, M., Chen, D., Pan, Q. and Wen, Y., LLM-driven Interactive document classification through Keyword Feedback.

Ghosh, S., 2024. Natural Language Processing: Basics, Challenges, and Clustering Applications. In Federated learning for Internet of Vehicles: IoV Image Processing, Vision and Intelligent Systems (pp. 61-82). Bentham Science Publishers.

K. P. Sinaga and M. -S. Yang, "Unsupervised K-Means Clustering Algorithm," in IEEE Access, vol. 8, pp. 80716-80727, 2020, doi: 10.1109/ACCESS.2020.2988796.

Bendou, Y., Lioi, G., Pasdeloup, B., Mauch, L., Hacene, G.B., Cardinaux, F. and Gripon, V., 2024. LLM meets Vision-Language Models for Zero-Shot One-Class Classification. arXiv preprint arXiv:2404.00675.

Zhang, R., Wang, Y.S. and Yang, Y., 2023. Generation-driven Contrastive Self-training for Zero-shot Text Classification with Instruction-following LLM. arXiv preprint arXiv:2304.11872.

Na, S., Xumin, L. and Yong, G., 2010, April. Research on k-means clustering algorithm: An improved k-means clustering algorithm. In 2010 Third International Symposium on intelligent information technology and security informatics (pp. 63-67). Ieee.

Probierz, B., Kozak, J. and Hrabia, A., 2022. Clustering of scientific articles using natural language processing. Procedia computer science, 207, pp.3449-3458.

A Comparative Analysis of Clustering Methods on the 20 Newsgroups Dataset for Analytics

Authors

DOI:

Keywords:

Abstract

📊 Article Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

IssueDate

RightSideBlock

Latest publications