Candidate Feature Extraction and Categorization for Unstructured Text Document

Authors

  • Prof. Prajakta P Shelke  Department of Computer Science and Engineering, Government College of Engineering, Amravati, India
  • Aditya A Pardeshi  Department of Computer Science and Engineering, Government College of Engineering, Amravati, India

DOI:

https://doi.org/10.32628/CSEIT20639

Keywords:

Key Phrase Mining, Candidate Feature Mining, Feature Selection, Feature Classification, Natural Language Processing.

Abstract

In the phrases words contains crucial information which helps in feature extraction process. The established techniques for such has huge problem and has limitations in feature extraction process and also it ignores the grammatical structure for the phrases. So results as poor features get extracted. So to overcome this problem a system is proposed which is based on generation of parse tree for the input sentence and cut down into sub-tree subsequently. The branches of the tree are extracted using part-of-speech (POS) labelling intended for candidate phrase. To stay away from redundant phrases filtering is recommended. Finally machine learning is used for the Feature categorization progression. The result illustrates the effectiveness of the approach.

References

  1. A. Bougouin, F. Boudin, and B. Daille, "TopicRank: Graph based topic ranking for keyphrase extraction," in Proc. Int. Joint Conf. Natural Lang. Process. (IJCNLP), 2013, pp. 543–551.
  2. (2017). The Stanford Parser. Accessed: May 2, 2017. Online. Available: https://nlp.stanford.edu/software/lex-parser.html
  3. British Council. (2017). Learn English. Accessed: Dec. 30, 2017.Online.Available:https://learnenglish.britishcouncil.org/en/english-grammar/clause-phrase-and sentence/sente-nce-structure
  4. M.-S. Paukkeri, I. T. Nieminen, M. Pöllä, and T. Honkela, "A language-independent approach to keyphrase extraction and evaluation," in Proc. Coling Companion, 2008, pp. 83–86.
  5. S.R.El-BeltagyandA.Rafea,"KP-Miner Akeyphraseextrac-tionsystem for English and Arabic documents," Inf. Syst., vol. 34, no. 1, pp. 132–144, 2009.
  6. O. Medelyan, E. Frank, and I. H. Witten, "Human-competitive tagging using automatic keyphrase extraction," in Proc. Conf. Empirical Methods Natural Lang. Process., vol. 3, 2009, pp. 1318–1327.
  7. K. S. Nam, M. Olena, K. Min-Yen, and B. Timothy, "Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles," in Proc. 5th Int. Workshop Semantic Eval., 2010, pp. 21–26.
  8. S. N. Kim, O. Medelyan, M.-Y. Kan, and T. Baldwin, "Automatic keyphraseextractionfromscientificarticles,"L-ang.Resour.Eval.,vol.47, no. 3, pp. 723–742, 2013.
  9. S.Danesh,T.Sumner,andJ.H.Martin,"Sgrank:Combiningstatisticaandgraphicalmethodstoimprovethestateoftheartinunsupervisedkeyphrase extraction," in Proc. SEM NAACL-HLT, 2015, pp. 117–126.
  10. F. Boudin, "A comparison of centrality measures for graph-based keyphrase extraction," in Proc. Int. Joint Conf. Natural Lang. Process. (IJCNLP), 2013, pp. 834–838.
  11. Y.-B. Kang, P. D. Haghighi, and F. Burstein, "CFinder: An intelligent key concept finder from text for ontology development," Expert Syst. Appl., vol. 41, no. 9, pp. 4494–4504, 2014.
  12. Z.Liu,W.Huang,Y.Zheng,andM.Sun,"Automatickeyphrase-extractionviatopicdecomposition,"inProc.Conf.EmpiricalMethodsNaturalLang. Process., 2010, pp. 366–376.
  13. J. Martinez-Romo, L. Araujo, and A. D. Fernandez, "Semgraph: Extract- ing keyphrases following a novel semantic graph-based approach," J. Assoc. Inf. Sci. Technol., vol. 67, no. 1, pp. 71–82, 2016.
  14. N. Teneva and W. Cheng, "Salience rank: Efficient keyphrase extraction with topic modeling," in Proc. 55th Annu. Meeting Assoc. Comput. Lin- guistics, vol. 2, 2017, pp. 530–535.
  15. C. Florescu and C. Caragea, "Positionrank: An unsupervised approach to keyphrase extraction from scholarly documents," in Proc. 55th Annu. Meeting Assoc. Comput. Linguistics, vol. 1, 2017, pp. 1105–1115.
  16. J. Rafiei-Asl and A. Nickabadi, "TSAKE: A topical and structural auto- matic keyphrase extractor," Appl. Soft Comput., vol. 58, pp. 620–630, Sep. 2017.
  17. A. Hulth, "Improved automatic keyword extraction given more linguistic knowledge," in Proc. Conf. Empirical Methods Natural Lang. Process., 2003, pp. 216–223.
  18. F. Boudin and E. Morin, "Keyphrase extraction for n-best reranking in multi-sentence compression," in Proc. North Amer. Chapter Assoc. Com- put. Linguistics (NAACL), 2013, pp. 1–9.
  19. R. Barzilay and K. R. McKeown, "Sentence fusion for multidocument news summarization," Comput. Linguistics, vol. 31, no. 3, pp. 297–328, 2005.
  20. K. Filippova and M. Strube, "Sentence fusion via dependency graph compression," in Proc. Conf. Empirical Methods Natural Lang. Process., 2008, pp. 177–185.
  21. W. You, D. Fontaine, and J.-P. Barthés, "An automatic keyphrase extrac- tion system for scientific documents," Knowl. Inf. Syst., vol. 34, no. 3, pp. 691–724, 2013.
  22. D. Newman, N. Koilada, J. H. Lau, and T. Baldwin, "Bayesian text segmentation for index term identification and keyphrase extraction," in Proc. COLING, 2012, pp. 2077–2092.
  23. C. Huang, Y. Tian, Z. Zhou, C. X. Ling, and T. Huang, "Keyphrase extraction using semantic networks structure analysis," in Proc. 6th Int. Conf. Data Mining (ICDM), Dec. 2006, pp. 275–284.
  24. F.Wang,Z.Wang,S.Wang,andZ.Li,"Exploiting description knowledge for keyphrase extraction," in Proc. Pacific Rim Int. Conf. Artif. Intell., 2014, pp. 130–142.
  25. H.Zheng,Z.Li,S.Wang,Z.Yan,andJ.Zhou,"Aggregating -inter-sentence information to enhance relation extraction," in Proc. AAAI, 2016, pp. 3108–3115.
  26. K. Bennani-Smires, C. Musat, M. Jaggi, A. Hossmann, and M. Baeriswyl. (2018). "EmbedRank: Unsupervised keyphrase extraction using sentence embeddings." Online. Available: https://arxiv.org/abs/1801.04470
  27. X. Wu, Z. Du, and Y. Guo, "A visual attention-based keyword extraction for document classification," Multimedia Tools Appl., vol. 77, no. 19, pp. 25355–25367, 2018.
  28. J.Hu,S.Li,Y.Yao,L.Yu,G.Yang,andJ.Hu,"Patent keyword extraction algorithm based on distributed representation for patent classification," Entropy, vol. 20, no. 2, p. 104, 2018.
  29. L. Marujo, A. Gershman, J. Carbonell, R. Frederking, and J. P. Neto. (2013). "Supervised topical key phrase extraction of news stories using crowdsourcing, light filtering and co-reference normalization." Online. Available: https://arxiv.org/abs/1306.4886
  30. R. Mihalcea and P. Tarau, "Textrank: Bringing order into text," in Proc. Conf. Empirical Methods Natural Lang. Process., 2004, pp. 1–8.
  31. D. Klein and C. D. Manning, "Accurate unlexicalized parsing," in Proc. 41st Annu. Meeting Assoc. Comput. Linguistics, 2003, pp. 423–430.
  32. M.P.MarcusandM.A.Marcinkiewicz,andB.Santorini,"Bui-lding a large annotated corpus of English: The penn treebank," Comput. Linguistics, vol. 19, no. 2, pp. 313–330, 1993.

Downloads

Published

2020-06-30

Issue

Section

Research Articles

How to Cite

[1]
Prof. Prajakta P Shelke, Aditya A Pardeshi, " Candidate Feature Extraction and Categorization for Unstructured Text Document" International Journal of Scientific Research in Computer Science, Engineering and Information Technology(IJSRCSEIT), ISSN : 2456-3307, Volume 6, Issue 3, pp.81-87, May-June-2020. Available at doi : https://doi.org/10.32628/CSEIT20639