Web Content Extraction Using Hybrid Approach
Keywords:
Website, Automatic Extraction, Handcrafted Rules, Hybride Approach, Machine Learning.Abstract
Wide Web has rich source of voluminous and heterogeneous information which The World continues to expand in size and complexity. Many Web pages are unstructured and semi structured, so it consists of noisy information like advertisement, links, headers, footers etc. This noisy information makes extraction of Web content tedious. Extracting main content from web page is the preprocessing of web information system. Many techniques that were proposed for Web content extraction are based on automatic extraction and hand crafted rule generation. A hybrid approach is proposed to extract main content from Web pages. A HTML Web page is converted to DOM tree and features are extracted and with the extracted features, rules are generated. Decision tree classification and Naive Bayes classification are machine learning methods used for rules generation.
References
- S. Baluja, Browsing on smalls screens: Recasting Webpage segmentation in toan efficient machine learning framework, Proceedings of the 15th International Conference on World Wide Web, pp. 3342, 2006.
- S. Debnath, P. Mitra, N. Pal and C. L. Giles, Automatic identification of informative sections of Web pages, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 9, pp. 12331246, 2005.
- S. Mahesha, M. S. Shashidhara and M. Giri, An Efficient web content extraction using mining techniques, International Journal of Computer Science and Management Research, Vol. 1, No. 4, pp. 872-875, 2012.
- Nikolaos Pappas, GeorgiosKatsimpras and EfstathiosStamatatos, Extracting Informative Textual Parts from Web Pages Containing User-Generated Content, Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies, 2012.
Downloads
Published
Issue
Section
License
Copyright (c) IJSRCSEIT

This work is licensed under a Creative Commons Attribution 4.0 International License.