Efficient Handling of High-Dimensional Data in Distributed Association Rule Mining

Authors

  • Hitesh Ninama   Department of School of Computer Science & Information Technology, DAVV, Indore, M.P., India

Keywords:

High-dimensional data, Distributed Association Rule Mining, Dimensionality reduction, Principal Component Analysis, FP-tree, Parallel processing, MapReduce, Apache Spark

Abstract

High-dimensional data poses significant challenges in Distributed Association Rule Mining (DARM), including increased computational complexity and execution time. This paper proposes an integrated methodology combining Principal Component Analysis (PCA) for dimensionality reduction, FP-tree construction, and parallel processing using frameworks like MapReduce and Apache Spark. Experiments on synthetic datasets demonstrate that the proposed approach significantly reduces execution time and simplifies the rule set while retaining meaningful patterns. These findings highlight the effectiveness of the methodology in improving the scalability and efficiency of DARM.

References

  1. H. Han and W. Kamber, Data Mining: Concepts and Techniques, 3rd ed. San Francisco, CA, USA: Morgan Kaufmann, 2011.
  2. R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules in Large Databases," in Proc. 20th Int. Conf. Very Large Data Bases (VLDB), Santiago, Chile, 1994, pp. 487-499.
  3. J. Han, J. Pei, and Y. Yin, "Mining Frequent Patterns without Candidate Generation," in Proc. 2000 ACM SIGMOD Int. Conf. Management of Data, Dallas, TX, USA, 2000, pp. 1-12.
  4. G. Grahne and J. Zhu, "Efficiently Using Prefix-trees in Mining Frequent Itemsets," in Proc. 2003 ICDM Workshop Frequent Itemset Mining Implementations, Melbourne, FL, USA, 2003, pp. 123-132.
  5. R. Agrawal, T. ImieliƄski, and A. Swami, "Mining Association Rules between Sets of Items in Large Databases," in Proc. 1993 ACM SIGMOD Int. Conf. Management of Data, Washington, DC, USA, 1993, pp. 207-216.
  6. H. Mannila, H. Toivonen, and A. I. Verkamo, "Efficient Algorithms for Discovering Association Rules," in Proc. AAAI Workshop Knowledge Discovery in Databases (KDD), Seattle, WA, USA, 1994, pp. 181-192.
  7. M. J. Zaki, "Parallel and Distributed Association Mining: A Survey," IEEE Concurrency, vol. 7, no. 4, pp. 14-25, Oct. 1999.
  8. J. Li, D. He, S. Xu, and Y. Shi, "Efficient Parallel Algorithms for Mining Association Rules," in Proc. 2004 IEEE Int. Conf. Data Mining (ICDM), Brighton, UK, 2004, pp. 665-668.
  9. S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, "Dynamic Itemset Counting and Implication Rules for Market Basket Data," in Proc. 1997 ACM SIGMOD Int. Conf. Management of Data, Tucson, AZ, USA, 1997, pp. 255-264.
  10. M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, "New Algorithms for Fast Discovery of Association Rules," in Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining (KDD), Newport Beach, CA, USA, 1997, pp. 283-286.
  11. H. Toivonen, "Sampling Large Databases for Association Rules," in Proc. 22nd Int. Conf. Very Large Data Bases (VLDB), Mumbai, India, 1996, pp. 134-145.
  12. R. J. Bayardo Jr., "Efficiently Mining Long Patterns from Databases," in Proc. 1998 ACM SIGMOD Int. Conf. Management of Data, Seattle, WA, USA, 1998, pp. 85-93.
  13. J. Vaidya and C. Clifton, "Privacy-Preserving Association Rule Mining in Vertically Partitioned Data," in Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 2002, pp. 639-644.
  14. J. Hipp, U. Güntzer, and G. Nakhaeizadeh, "Algorithms for Association Rule Mining: A General Survey and Comparison," ACM SIGKDD Explorations, vol. 2, no. 1, pp. 58-64, 2000.
  15. A. Schuster and R. Wolff, "Communication-Efficient Distributed Mining of Association Rules," in Proc. 2001 ACM SIGMOD Int. Conf. Management of Data, Santa Barbara, CA, USA, 2001, pp. 473-484.
  16. R. Agrawal and J. C. Shafer, "Parallel Mining of Association Rules: Design, Implementation, and Experience," IEEE Trans. Knowledge Data Eng., vol. 8, no. 6, pp. 962-969, Dec. 1996.
  17. K. Wang, Y. He, and J. Han, "Pushing Support Constraints into Association Rules Mining," IEEE Trans. Knowledge Data Eng., vol. 15, no. 3, pp. 642-658, May 2003.
  18. S. Parthasarathy, M. J. Zaki, M. Ogihara, and S. Dwarkadas, "Incremental and Interactive Sequence Mining," in Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 2002, pp. 251-260.
  19. G. T. Chiu, D. Lee, and W. W. Chu, "A Constraint-Based Knowledge Discovery System for Large Databases," in Proc. 1997 IEEE Int. Conf. Data Engineering (ICDE), Birmingham, UK, 1997, pp. 400-407.
  20. H. Ninama, "Enhancing Efficiency and Scalability in Distributed Data Mining via Decision Tree Induction Algorithms," International Journal of Engineering, Science and Mathematics, vol. 6, no. 6, pp. 449-454, Oct. 2017.
  21. H. Ninama, "Balancing Accuracy and Interpretability in Predictive Modeling: A Hybrid Ensemble Approach to Rule Extraction," International Journal of Research in IT & Management, vol. 3, no. 8, pp. 71-78, Aug. 2013.
  22. H. Ninama, "Integrating Hybrid Feature-Weighted Rule Extraction and Explainable AI Techniques for Enhanced Model Transparency and Performance," International Journal of Research in IT & Management, vol. 3, no. 1, pp. 132-140, Mar. 2013.
  23. H. Ninama, "Enhancing Computational Efficiency and Scalability in Data Mining through Distributed Data Mining Using MapReduce," International Journal of Engineering, Science and Mathematics, vol. 4, no. 1, pp. 209-220, Mar. 2015.
  24. H. Ninama, "Hybrid Integration of OpenMP and PVM for Enhanced Distributed Computing: Performance and Scalability Analysis," International Journal of Research in IT & Management, vol. 3, no. 5, pp. 101-110, May 2013.
  25. H. Ninama, "Integration of SHMEM and Charm++ for Real-Time Data Analytics in Distributed Systems," International Journal of Engineering, Science and Mathematics, vol. 6, no. 2, pp. 239-248, June 2017.
  26. H. Ninama, "Real-Time Data Processing in Distributed Data Mining Using Apache Hadoop," International Journal of Engineering, Science and Mathematics, vol. 5, no. 4, pp. 250-256, Dec. 2016.
  27. H. Ninama, "Enhanced Resource Management and Scheduling in Apache Spark for Distributed Data Mining," International Journal of Research in IT & Management, vol. 7, no. 2, pp. 50-59, Feb. 2017.
  28. I. Jolliffe, Principal Component Analysis, 2nd ed., Springer, 2002.
  29. L. Eldeib, "A Comprehensive Guide to FP-Growth Algorithm for Mining Frequent Itemsets," Journal of Data Mining & Knowledge Discovery, vol. 5, no. 3, pp. 56-78, 2015.
  30. J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Proc. 6th Symp. Operating Systems Design and Implementation (OSDI), San Francisco, CA, USA, 2004, pp. 137-150.
  31. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, "Spark: Cluster Computing with Working Sets," in Proc. 2nd USENIX Conf. Hot Topics in Cloud Computing, Boston, MA, USA, 2010, pp. 10-10.

Downloads

Published

2018-04-30

Issue

Section

Research Articles

How to Cite

[1]
Hitesh Ninama , " Efficient Handling of High-Dimensional Data in Distributed Association Rule Mining" International Journal of Scientific Research in Computer Science, Engineering and Information Technology(IJSRCSEIT), ISSN : 2456-3307, Volume 3, Issue 3, pp.2178-2186, March-April-2018.