A Survey on the Web Scraping : In the Search of Data

Authors

  • Prof. K. N. Aaglave  Department of Computer Engineering, S. B. Patil College of Engineering, Savitribai Phule Pune University
  • Shivanjali Santosh Jadhav  Department of Computer Engineering, S. B. Patil College of Engineering, Savitribai Phule Pune University
  • Amaan Firoj Khatib  Department of Computer Engineering, S. B. Patil College of Engineering, Savitribai Phule Pune University
  • Rohini Laxman Khurangale  Department of Computer Engineering, S. B. Patil College of Engineering, Savitribai Phule Pune University

Keywords:

Machine Learning, Artificial Intelligence, Web Scraping, Data Processing, Data Extraction, Data Analysis, Ethical Concerns, Web Scraping Tools, Web Scraping Framework, User Interface.

Abstract

AI-based methods and tools used for adjust themselves to scraping the data. Web scraping use machine learning and AI technologies. Data are not in structured formats and Difficulty of extracting relevant data from web pages.Need to classify web content in order to remove unwanted data.Difficulty of finding a suitable web scraping Need for a more flexible and extensible web scraping framework. Web scraping is a powerful tool for extracting data from the internet and has a wide range of applications across various industries. When done responsibly and legally, web scraping can provide valuable insights and data for businesses, researchers, and developers. It highlights Scrapy as a powerful web scraping tool, offering speed, extensibility, and efficient data extraction capabilities. Develop a more efficient and accurate way to extract and classify web content using AI and machine learning algorithms techniques.

References

  1. Gaikwad, Yogesh J. "A Review on Self Learning based Methods for Real World Single Image Super Resolution." (2021).
  2. Khetani, Y. Gandhi and R. R. Patil, "A Study on Different Sign Language Recognition Techniques," 2021 International Conference on Computing, Communication and Green Engineering (CCGE), Pune, India, 2021, pp. 1-4, doi: 10.1109/CCGE50943.2021.9776399.
  3. Vaddadi, S., Arnepalli, P. R., Thatikonda, R., &Padthe, A. (2022). Effective malware detection approach based on deep learning in Cyber-Physical Systems. International Journal of Computer Science and Information Technology, 14(6), 01-12.
  4. Thatikonda, R., Vaddadi, S.A., Arnepalli, P.R.R. et al. Securing biomedical databases based on fuzzy method through blockchain technology. Soft Comput (2023). https://doi.org/10.1007/s00500-023-08355-x
  5. Rashmi, R. Patil, et al. "Rdpc: Secure cloud storage with deduplication technique." 2020 fourth international conference on I-SMAC (IoT in social, mobile, analytics and cloud)(I-SMAC). IEEE, 2020.
  6. Khetani, V., Gandhi, Y., Bhattacharya, S., Ajani, S. N., &Limkar, S. (2023). Cross-Domain Analysis of ML and DL: Evaluating their Impact in Diverse Domains. International Journal of Intelligent Systems and Applications in Engineering, 11(7s), 253-262.
  7. Khetani, V., Nicholas, J., Bongirwar, A., &Yeole, A. (2014). Securing web accounts using graphical password authentication through watermarking. International Journal of Computer Trends and Technology, 9(6), 269-274.
  8. Kale, R., Shirkande, S. T., Pawar, R., Chitre, A., Deokate, S. T., Rajput, S. D., & Kumar, J. R. R. (2023). CR System with Efficient Spectrum Sensing and Optimized Handoff Latency to Get Best Quality of Service. International Journal of Intelligent Systems and Applications in Engineering, 11(10s), 829-839.
  9. Nagtilak, S., Rai, S., & Kale, R. (2020). Internet of things: A survey on distributed attack detection using deep learning approach. In Proceeding of International Conference on Computational Science and Applications: ICCSA 2019 (pp. 157-165). Springer Singapore.
  10. Mane, Deepak, and AniketHirve. "Study of various approaches in machine translation for Sanskrit language." International Journal of Advancements in Research & Technology 2.4 (2013): 383.
  11. Shivadekar, S., Kataria, B., Limkar, S. et al. Design of an efficient multimodal engine for preemption and post-treatment recommendations for skin diseases via a deep learning-based hybrid bioinspired process. Soft Comput (2023). https://doi.org/10.1007/s00500-023-08709-5
  12. Shivadekar, Samit, et al. "Deep Learning Based Image Classification of Lungs Radiography for Detecting COVID-19 using a Deep CNN and ResNet 50." International Journal of Intelligent Systems and Applications in Engineering 11.1s (2023): 241-250.Khin Than Nyunt,NawThiriWaiKhin “Web for career analysis based on youtube data APIs using web content mining abstract” on 2022.
  13. Ajay Sudhir,NaveenGhorpade,Rohith S, S Kamalesh, Rohith R, Rohan B S “Web Scraping Approaches and their Performance on Modern Website”on 2022.
  14. Chiapponi, Marc Dacier,OlivierThonnard,MohamedFangar,MattiasMattsson,VincentRigal “An industrial perspective on web scraping characteristics and open issues” on 2022.
  15. DipaliShete,SachinBojewar,AnkitSanghvi “Survey Paper on Web Content Extraction and Classification” 0n 2021.
  16. Roopesh N, Akarsh M S, C. NarendraBabu “An Optimal Data Entry Method, Using Web Scraping and Text Recognition” 0n 2121.
  17. Eric C. Dallmeier “Computer Vision-based Web Scraping for Internet Forums” on 2021.
  18. ERDINC¸ UZUN “A Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages” on 2019.
  19. VidhiSingrodia, AnirbanMitra “A Review on Web Scrapping and its Applications” on 2019.
  20. Rabiyatou DIOUF, EdouardNgor SARR, Ousmane SALL, Babiga BIRREGAH, Mamadou BOUSSO, SenyNdiaye ´ MBAYE “Web Scraping: State-of-the Art and Areas of Application” on 2019.
  21. Gunawan, R., Rahmatulloh, A., Darmawan, I., and Firdaus, F. (2019). Comparisonof web scraping techniques: regular expression, HTML DOM and Xpath. In International Conference on Industrial Enterprise and System Engineering (IcoIESE 2018) Comparison (Vol. 2):283-287.
  22. Parlewar, P., Jagtap, V., Pujeri, U. , Kulkarni, M. M. S. ., Shirkande, S. T. ., &Tripathi, A. . (2023). An Efficient Low-Loss Data Transmission Model for Noisy Networks. International Journal of Intelligent Systems and Applications in Engineering, 11(9s), 267–276

Downloads

Published

2023-10-30

Issue

Section

Research Articles

How to Cite

[1]
Prof. K. N. Aaglave, Shivanjali Santosh Jadhav, Amaan Firoj Khatib, Rohini Laxman Khurangale, " A Survey on the Web Scraping : In the Search of Data" International Journal of Scientific Research in Computer Science, Engineering and Information Technology(IJSRCSEIT), ISSN : 2456-3307, Volume 9, Issue 10, pp.191-195, September-October-2023.