Efficient and Scalable Distributed Clustering for Distributed Data Mining : A Hybrid Approach

Hitesh Ninama

doi:10.32628/CSEIT1831275

Authors

Hitesh Ninama Department of School of Computer Science & Information Technology, DAVV, Indore, M.P., India

Keywords:

Distributed Clustering, Data Mining, K-Means, DBSCAN, Apache Spark, Scalability, Efficiency, Hybrid Clustering

Abstract

The exponential growth of data generated by various applications necessitates the development of efficient and scalable distributed clustering algorithms. Traditional clustering methods often fail to handle large-scale datasets efficiently, leading to a critical research gap. This paper proposes a hybrid approach integrating K-Means, DBSCAN, and in-memory computing frameworks like Apache Spark to achieve efficient and scalable distributed clustering. The proposed methodology leverages the strengths of both density-based and partitioning methods, ensuring robust clustering in distributed environments. Experimental results on the Iris dataset demonstrate the superiority of the proposed approach compared to traditional methods, as evaluated by various clustering metrics.

References

A. K. Jain and R. C. Dubes, "A Fast Parallel Clustering Algorithm for Large Spatial Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 325-344, 2000.
L. Kaufman and P. J. Rousseeuw, "Clustering Validity Checking Methods: Part II," SIGMOD Record, vol. 31, no. 3, pp. 19-27, 2002.
S. Guha, R. Rastogi, and K. Shim, "CURE: An Efficient Clustering Algorithm for Large Databases," Information Systems, vol. 26, no. 1, pp. 35-58, 2001.
M. Ester, H. P. Kriegel, J. Sander, and X. Xu, "Scaling Clustering Algorithms to Large Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 2, pp. 320-334, 2000.
A. B. Rodrigues, A. N. Neto, and D. P. Meira, "Efficient Parallel K-Means Clustering for Large Data Sets in MapReduce," in Proc. 2012 IEEE 26th Int. Parallel and Distributed Processing Symp. Workshops & PhD Forum (IPDPSW), Shanghai, China, 2012, pp. 1748-1757.
J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008.
M. Ester, H. P. Kriegel, J. Sander, and X. Xu, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," in Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining (KDD), Portland, Oregon, 1996, pp. 226-231.
A. Topchy, A. K. Jain, and W. Punch, "Combining Multiple Clusterings Using Evidence Accumulation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 835-850, 2005.
N. Birant and A. Kut, "ST-DBSCAN: An Algorithm for Clustering Spatial–Temporal Data," Data & Knowledge Engineering, vol. 60, no. 1, pp. 208-221, 2007.
Y. Zheng, L. Liu, and L. Chen, "Distributed Subspace Clustering on Spark," in Proc. 2016 IEEE Int. Conf. Big Data (Big Data), Washington, DC, USA, 2016, pp. 1945-1950.
Y. Chen and G. Agrawal, "A Distributed Clustering Algorithm for Data Streams," in Proc. 2005 IEEE Int. Conf. e-Technology, e-Commerce and e-Service (EEE), Hong Kong, China, 2005, pp. 44-47.
D. Jiang, G. Chen, and B. W. Ooi, "Hadoop Based k-means Clustering," in Proc. 2009 ACM SIGMOD Int. Conf. Management of Data (SIGMOD), Hong Kong, China, 2009, pp. 53-61.
X. Chen, Y. Xu, and X. Liu, "A Fast Density-Based Clustering Algorithm for Large Databases," in Proc. 2011 ACM Int. Conf. Management of Data (SIGMOD), New York, USA, 2011, pp. 73-84.
S. Papadimitriou, J. Sun, and P. S. Yu, "Scalable Clustering of Streaming Time Series Data," in Proc. 2015 IEEE Int. Conf. Big Data (Big Data), Santa Clara, CA, USA, 2015, pp. 2267-2272.
Z. Wang, X. Zhou, and G. Cong, "A Review of Distributed Clustering Algorithms," in Proc. 2015 Int. Conf. Cloud Computing and Big Data (CCBD), Shanghai, China, 2015, pp. 203-209.
D. Arthur and S. Vassilvitskii, "k-means++: The Advantages of Careful Seeding," in Proc. 2010 ACM-SIAM Symp. Discrete Algorithms (SODA), Austin, Texas, USA, 2010, pp. 1027-1035.
M. Charikar, S. Guha, and E. Tardos, "Streaming k-means clustering with fast seed selection," in Proc. 2012 ACM SIGMOD Int. Conf. Management of Data (SIGMOD), Scottsdale, Arizona, USA, 2012, pp. 1039-1044.
Z. Zhao, H. Jin, and D. Zhang, "Efficient Parallel DBSCAN Algorithm for Large Data Sets on Spark," in Proc. 2012 IEEE Int. Conf. Cloud Computing and Big Data (CloudCom), Taiwan, 2012.
M. Al-Jarrah and A. B. Hamza, "Challenges and Solutions for Distributed Clustering in Cloud Environments," in Proc. 2013 IEEE Int. Conf. Cloud Computing (CLOUD), Santa Clara, CA, USA, 2013, pp. 590-597.
H. Ninama, "Enhancing Efficiency and Scalability in Distributed Data Mining via Decision Tree Induction Algorithms," International Journal of Engineering, Science and Mathematics, vol. 6, no. 6, pp. 449-454, Oct. 2017.
H. Ninama, "Balancing Accuracy and Interpretability in Predictive Modeling: A Hybrid Ensemble Approach to Rule Extraction," International Journal of Research in IT & Management, vol. 3, no. 8, pp. 71-78, Aug. 2013.
H. Ninama, "Integrating Hybrid Feature-Weighted Rule Extraction and Explainable AI Techniques for Enhanced Model Transparency and Performance," International Journal of Research in IT & Management, vol. 3, no. 1, pp. 132-140, Mar. 2013.
H. Ninama, "Enhancing Computational Efficiency and Scalability in Data Mining through Distributed Data Mining Using MapReduce," International Journal of Engineering, Science and Mathematics, vol. 4, no. 1, pp. 209-220, Mar. 2015.
H. Ninama, "Hybrid Integration of OpenMP and PVM for Enhanced Distributed Computing: Performance and Scalability Analysis," International Journal of Research in IT & Management, vol. 3, no. 5, pp. 101-110, May 2013.
H. Ninama, "Integration of SHMEM and Charm++ for Real-Time Data Analytics in Distributed Systems," International Journal of Engineering, Science and Mathematics, vol. 6, no. 2, pp. 239-248, June 2017.
H. Ninama, "Real-Time Data Processing in Distributed Data Mining Using Apache Hadoop," International Journal of Engineering, Science and Mathematics, vol. 5, no. 4, pp. 250-256, Dec. 2016.
H. Ninama, "Enhanced Resource Management and Scheduling in Apache Spark for Distributed Data Mining," International Journal of Research in IT & Management, vol. 7, no. 2, pp. 50-59, Feb. 2017.

Efficient and Scalable Distributed Clustering for Distributed Data Mining : A Hybrid Approach

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite