Analysing and Predicting on Diseases using Data Pipeline in Hadoop

Authors

  • Arpna Joshi  Information Technology Department MAIT, Maharaja Agrasen Institute of Technology, New Delhi, India
  • Chirag Singla  Information Technology Department MAIT, Maharaja Agrasen Institute of Technology, New Delhi, India
  • Mr. Pankaj  Assistant Professor, Maharaja Agrasen Institute of Technology, New Delhi, India

DOI:

https://doi.org//10.32628/CSEIT1952362

Keywords:

Big data, Hadoop, Apache Kafka, Apache Spark, Cassandra.

Abstract

A data pipeline is a set of conducts that are performed from the time data is available for ingestion till value is obtained from that data. Such kind of actions is Extraction (getting value field from the dataset), Transformation and Loading (putting the data of value in a form that is useful for upstream use). In this big data project, we will simulate a simple batch data pipeline. Our dataset of interest we will get from https://www.githubarchive.org/ that records the health data of US for past 125years. The objective of this spark project will be to create a small but real-world pipeline that downloads this dataset as they become available, initiated the various form of transformation and load them into forms of storage that will need further use. In this project Apache kafka is used for data ingestion, Apache Spark for data processing and Cassandra for storing the processed result.

References

  1. Berestycki, H., and Nadal, J.-P. 2010. Self-organised critical hot spots of criminal activity. European Journal of Applied Mathematics 21(4-5):371–399.
  2. Romero Tyler, Barnes Zachary and Cipollone Frank “Predicting Emergency Incidents in San Diego” http://cs229.stanford.edu/proj2016/report/BarnesCipolloneRomeroPredictingEmergencyIncidentsInSanDiego.pdf
  3. https://www.tutorialspoint.com/python/
  4. http://www.saedsayad.com/decision_tree_reg.htm
  5. https://www.kaggle.com/
  6. Andrew Ng, Co-founder, Coursera; Adjunct Professor, Stanford University, “Decision Tree Regression”,https://www.coursera.org/learn/machine-learning

Downloads

Published

2019-04-30

Issue

Section

Research Articles

How to Cite

[1]
Arpna Joshi, Chirag Singla, Mr. Pankaj, " Analysing and Predicting on Diseases using Data Pipeline in Hadoop , IInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology(IJSRCSEIT), ISSN : 2456-3307, Volume 5, Issue 2, pp.1288-1292, March-April-2019. Available at doi : https://doi.org/10.32628/CSEIT1952362