Web Content Extraction Using Hybrid Approach

Dhumal Tanuja; Kumbhar Shital; Malave Sumedha; Salunkhe Shrutika

doi:10.32628/CSEIT1722178

Authors

Dhumal Tanuja Computer Engineering, Trinity Academy of Engineering, Pune, Maharashtra, India
Kumbhar Shital Computer Engineering, Trinity Academy of Engineering, Pune, Maharashtra, India
Malave Sumedha Computer Engineering, Trinity Academy of Engineering, Pune, Maharashtra, India
Salunkhe Shrutika Computer Engineering, Trinity Academy of Engineering, Pune, Maharashtra, India

Keywords:

Website, Automatic Extraction, Handcrafted Rules, Hybride Approach, Machine Learning.

Abstract

Wide Web has rich source of voluminous and heterogeneous information which The World continues to expand in size and complexity. Many Web pages are unstructured and semi structured, so it consists of noisy information like advertisement, links, headers, footers etc. This noisy information makes extraction of Web content tedious. Extracting main content from web page is the preprocessing of web information system. Many techniques that were proposed for Web content extraction are based on automatic extraction and hand crafted rule generation. A hybrid approach is proposed to extract main content from Web pages. A HTML Web page is converted to DOM tree and features are extracted and with the extracted features, rules are generated. Decision tree classification and Naive Bayes classification are machine learning methods used for rules generation.

References

S. Baluja, Browsing on smalls screens: Recasting Webpage segmentation in toan efficient machine learning framework, Proceedings of the 15th International Conference on World Wide Web, pp. 3342, 2006.
S. Debnath, P. Mitra, N. Pal and C. L. Giles, Automatic identification of informative sections of Web pages, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 9, pp. 12331246, 2005.
S. Mahesha, M. S. Shashidhara and M. Giri, An Efficient web content extraction using mining techniques, International Journal of Computer Science and Management Research, Vol. 1, No. 4, pp. 872-875, 2012.
Nikolaos Pappas, GeorgiosKatsimpras and EfstathiosStamatatos, Extracting Informative Textual Parts from Web Pages Containing User-Generated Content, Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies, 2012.

Web Content Extraction Using Hybrid Approach

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite