Web Content Extraction Using Hybrid Approach

Authors

  • Dhumal Tanuja  Computer Engineering, Trinity Academy of Engineering, Pune, Maharashtra, India
  • Kumbhar Shital  Computer Engineering, Trinity Academy of Engineering, Pune, Maharashtra, India
  • Malave Sumedha  Computer Engineering, Trinity Academy of Engineering, Pune, Maharashtra, India
  • Salunkhe Shrutika   Computer Engineering, Trinity Academy of Engineering, Pune, Maharashtra, India

Keywords:

Website, Automatic Extraction, Handcrafted Rules, Hybride Approach, Machine Learning.

Abstract

Wide Web has rich source of voluminous and heterogeneous information which The World continues to expand in size and complexity. Many Web pages are unstructured and semi structured, so it consists of noisy information like advertisement, links, headers, footers etc. This noisy information makes extraction of Web content tedious. Extracting main content from web page is the preprocessing of web information system. Many techniques that were proposed for Web content extraction are based on automatic extraction and hand crafted rule generation. A hybrid approach is proposed to extract main content from Web pages. A HTML Web page is converted to DOM tree and features are extracted and with the extracted features, rules are generated. Decision tree classification and Naive Bayes classification are machine learning methods used for rules generation.

References

  1. S. Baluja, Browsing on smalls screens: Recasting Webpage segmentation in toan efficient machine learning framework, Proceedings of the 15th International Conference on World Wide Web, pp. 3342, 2006.
  2. S. Debnath, P. Mitra, N. Pal and C. L. Giles, Automatic identification of informative sections of Web pages, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 9, pp. 12331246, 2005.
  3. S. Mahesha, M. S. Shashidhara and M. Giri, An Efficient web content extraction using mining techniques, International Journal of Computer Science and Management Research, Vol. 1, No. 4, pp. 872-875, 2012.
  4. Nikolaos Pappas, GeorgiosKatsimpras and EfstathiosStamatatos, Extracting Informative Textual Parts from Web Pages Containing User-Generated Content, Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies, 2012.

Downloads

Published

2017-04-30

Issue

Section

Research Articles

How to Cite

[1]
Dhumal Tanuja, Kumbhar Shital, Malave Sumedha, Salunkhe Shrutika , " Web Content Extraction Using Hybrid Approach, IInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology(IJSRCSEIT), ISSN : 2456-3307, Volume 2, Issue 2, pp.1155-1159, March-April-2017.