Hybrid Intelligent Similarity Measure for Effective Text Document Clustering Using Neural Network Algorithm

Authors

  • R. Preethi  R.M.K Engineering College, Kavaraipettai, Chennai, Tamil Nadu, India}Sathyabama University, Jeppiaar Nagar, Rajiv Gandhi Road, Chennai, Tamil Nadu, India
  • K. Selvi   

Keywords:

World Wide Web, WordSim353, Clustering, Neural Network, Cyber terrorism investigation, Verb-Argument Structures, DIG, DIGBC, LSI, PCA, SVD, RFP, Textmining, Document Clustering Similarity Measure, Artificial Intelligence

Abstract

Extensive use of World Wide Web for information search using popular search engines has turned many researchers to focus on text mining issues. Natural Language Processing required effective methods to capture the actual requirements of the user during Machine Learning. Application of genetic algorithm and similarity measure for text mining during document clustering yield significant results for WordSim353 data sets. Experiments show that application of Echo State Neural Network and Radial Basis Function to the training data set gives better clustering of text documents based on the stored weights in order to avoiding retrieval of irrelevant documents.

References

  1. F. Beil, M. Ester, and X. Xu, “Frequent term-based text clustering”, In Proceedings of 8th International Conference on Knowledge Discovery and Data Mining, 2002.
  2. A. Budanitsky, and G. Hirst, “Evaluating wordnet-based measures of semantic distance”, Comput. Linguistics, vol. 32, no. 1, pp. 13–47, 2006.
  3. C.M. Benjamin Fung, Wang Ke, and Ester Martin, “Hierarchical document clustering using frequent item sets”, In Proceedings SIAM International Conference on Data Mining, pp. 59-70, 2003.
  4. C. Cobos, J.  Andrade,W. Constain., M. Mendoza., and E. Leon, “Web document clustering based on global-best harmony search, k-means, frequent term sets and bayesian information criterion”, IEEE Congress on Evolutionary Computation, pp. 1-8, 2010.
  5. A.E. Eldesoky, M. Saleh, and N.A. Sakr, “Novel similarity measure for document clustering based on topic phrases”, International Conference on Networking and Media Convergence, pp. 92-96, 2009.
  6. Haojun Sun, Zhihui Liu, and Lingjun Kong, “A document clustering method based on hierarchical algorithm with model clustering”, 22nd International Conference on Advanced Information Networking and Applications, pp. 1229-1233, 2008.
  7. Han-Saem Park, Si-Ho Yoo, and Sung-Bae Cho, “Evolutionary Fuzzy Clustering Algorithm with Knowledge-Based Evaluation and Applications for Gene Expression Profiling”, Journal of Computational and Theoretical Nanoscience, vol. 11, no. 4, pp. 524-53, 2005.
  8. S. Karthick, S.M. Shalinie, A. Eswarimeena, P.Madhumitha, T.N. Abhinaya, “Effect of multi-word features on the hierarchical clustering of web documents”, Recent Trends in Information Technology(ICRTIT) Internaltional Conference, pp. 1 – 6, 2014.
  9. Ling Zhuang, and Honghua Dai, “A maximal frequent item set approach for web document clustering”, In Proceedings of the IEEE Fourth International Conference on Computer and Information Technology, 2004.
  10. B.F. Momin, P.J. Kulkarni, and A. Chaudhari, “Web document clustering using document index graph”, In Proceedings IEEE International Conference on Advanced Computing and Communications, 2006.
  11. L. Muflikhah, and B. Baharudin, “Document clustering using concept space and cosine similarity measurement”, International Conference on Computer Technology and Development, vol. 1, pp. 58-62, 2009.
  12.  N. Narayanan, J. E. Judith and J.  JayaKumari, “Enhanced distributed document clustering algorithm using different similarity measures”, Information & Communication Technologies (ICT), IEEE Conference, pp. 545-550, 2013.
  13. H.A. Nguyen, and H. Al-Mubaid, “New ontology-based semantic similarity measure for the biomedical domain”, In Proc. IEEE GrC, pp. 623–628, 2006.
  14. Peipei Li, Haixun Wang, K.Q. Zhu Zhongyuan Wang, Xuegang Hu and Xindong Wu, “A large probabilistic semantic network based approach to compute term similarity”, IEEE Transaction on Knowledge and Data Engineering, vol. 27, pp. 2604-2617, 2015.
  15. J. Prasannakumar, and P. Govindarajulu, “Duplicate and near duplicate documents detection”, A Review European Journal of Scientific Research ISSN 1450-216X vol. 32, no. 4, pp. 514-527, 2009.
  16. G.S. Reddy, T.V. Rajinikanth, A.A. Rao, “A frequent term based text clustering approach using novel similarity measure”, Advanced Computer Conference (IACC), IEEE International, pp. 495-499, 2014.
  17. Ruxixu and Donald Wunsch, “A survey of clustering algorithms”, IEEE Transactions on Neural Networks, vol. 16, no. 3, pp. 645-678, 2005.
  18. S. Satwardhan, Incorporating dictionary and corpus information into a context vector measure of semantic relatedness, Master’s thesis, Univ. Minnesota, Minneapolis, 2003.
  19. K. Selvi, and R.M. Suresh, “Context similarity measure  using fuzzy formal concept analysis”, In Proc. of  The Second Int’l conference On Computer Science and Engineering and Information Technology CCSEIT, pp. 416-423, 2012.
  20. K. Selvi, and R.M. Suresh, “An efficient technique to implement similarity measures in text document clustering using artificial neural network algorithm”, Research Journal of Applied Sciences Engineering and Technology, vol. 8(23), pp. 2320-2328, 2014.
  21. A. Sharma, and R. Dhir, “A wordsets based document clustering algorithm for large datasets”, In Proceeding of International Conference on Methods and Models in Computer Science, 2009.
  22. M.L. Shyu, S.C. Chen, M. Chen, and S.H. Rubin, “Affinity-based similarity measure for web document clustering”, IEEE International Conference on Information Reuse and Integration, pp. 247-252, 2004.
  23. K.M. Sim, and P.T. Wong, “Toward agency and ontology for web-based information retrieval”, IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 34, no. 3, pp. 257–269,2004.
  24. Y. Syed Mudhasir, and J. Deepika, “Near duplicate detection and elimination based on web provenance for efficient web search”, In the Proceedings of International Journal on Internet and Distributed Computing Systems, vol. 1, no. 1, pp. 22-32, 2011.
  25. Ted Pedersen, V.S. Serguei. Pakhomov, Siddharth Patwardhan, and Christopher G. Chute, “Measures of semantic similarity and relatedness in the biomedical domain”, Journal of Biomedical Informatics, vol. 40, pp.  288-299, 2007.
  26.  Thanh Van Le, Trong Nghia, Hong Nam Nguyen, Tran Vu Pham, “An efficient pretopological approach for document clustering”, Intelligent Neyworking and Collabrative Systems (INCoS), 5th International Conference , pp. 114 – 120, 2013.
  27. http://people.revoledu.com/kardi/tutorial/Similarity/Stringinstance.html#TextSimilarityCalculator.
  28. Xinjuan Peng, Lijun Cai, Bo Liao, Haowen Chen, and Wen Zhu, “Detecting the Maximum Similarity Bi-Clusters of Gene Expression Data with Evolutionary Computation”, Journal of Computational and Theoretical Nanoscience, vol. 11, no. 7, pp. 1585-1591, 2014.

Downloads

Published

2017-06-30

Issue

Section

Research Articles

How to Cite

[1]
R. Preethi, K. Selvi , " Hybrid Intelligent Similarity Measure for Effective Text Document Clustering Using Neural Network Algorithm, IInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology(IJSRCSEIT), ISSN : 2456-3307, Volume 2, Issue 3, pp.833-839, May-June-2017.