Next-Generation Data Pipeline Designs for Modern Analytics : A Comprehensive Review

Authors

  • Anupkumar Ghogare Savitribai Phule Pune University, India Author

DOI:

https://doi.org/10.32628/CSEIT24106196

Keywords:

Hybrid Data Pipeline Architecture, Real-time Stream Processing, Data Quality Observability, Delta Lake Technology, Serverless Data Processing

Abstract

This comprehensive article examines the transformative evolution of data pipeline architectures in modern analytics, focusing on the integration of real-time and batch-processing methodologies to meet contemporary data processing demands. The article investigates how advanced frameworks like Apache Spark and Databricks, coupled with innovative technologies such as Delta Lake, are reshaping traditional data processing paradigms to accommodate increasing data volumes and complexity. Through a detailed article of hybrid pipeline architectures, data quality mechanisms, and observability practices, this paper demonstrates the critical role of next-generation pipeline designs in enabling organizations to build scalable, reliable, and maintainable data infrastructures. The article explores the implementation of ACID-compliant data lake technologies, automated monitoring systems, and sophisticated quality assurance methods that collectively ensure data integrity and processing efficiency. Key findings highlight the significance of emerging technologies, including edge computing and serverless architectures, in shaping future data pipeline designs. The article provides valuable insights into architectural patterns, best practices, and future trends that organizations can leverage to optimize their data processing capabilities and maintain competitive advantage in an increasingly data-driven business landscape.

Downloads

Download data is not yet available.

References

Armbrust, M., Das, T., Torres, J., Yavuz, B., Zhu, S., Xin, R., ... & Zaharia, M. (2020). Deltalake: High-performance ACID table storage over cloud object stores. Proceedings of the VLDB Endowment, 13(12), 3411-3424. https://doi.org/10.14778/3415478.3415560 DOI: https://doi.org/10.14778/3415478.3415560

Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107-113. https://doi.org/10.1145/1327452.1327492 DOI: https://doi.org/10.1145/1327452.1327492

Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., ... & Stoica, I. (2016). Apache Spark: A Unified Engine for Big Data Processing. Communications of the ACM, 59(11), 56-65. https://doi.org/10.1145/2934664 DOI: https://doi.org/10.1145/2934664

Lekkala, Chandrakanth. "Building Resilient Big Data Pipelines with Delta Lake for Improved Data Governance." European Journal of Advances in Engineering and Technology 7.12 (2020): 101-106. https://www.academia.edu/download/117094105/EJAET_7_12_101_106_1_.pdf

Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F., & Grafberger, A. (2018). Automating Large-Scale Data Quality Verification. Proceedings of the VLDB Endowment, 11(12), 1781-1794. https://doi.org/10.14778/3229863.3229867 DOI: https://doi.org/10.14778/3229863.3229867

Lyu, Zhihan & Qiao, Liang & Verma, Prof & ., Kavita. (2021). AI-enabled IoT-Edge Data Analytics for Connected Living. ACM Transactions on Internet Technology. 21. 10.1145/3421510. https://dl.acm.org/doi/10.1145/3421510 DOI: https://doi.org/10.1145/3421510

J. Wu, H. Wang, C. Ni, C. Zhang, and W. Lu, "Data Pipeline Training: Integrating AutoML to Optimize the Data Flow of Machine Learning Models," 2024 7th International Conference on Advanced Algorithms and Control Engineering (ICAACE), Shanghai, China, 2024, pp. 730-734, doi: 10.1109/ICAACE61206.2024.10549260. https://ieeexplore.ieee.org/document/10549260 DOI: https://doi.org/10.1109/ICAACE61206.2024.10549260

Shojaee Rad, Z., Ghobaei-Arani, M. Data pipeline approaches in serverless computing: a taxonomy, review, and research trends. J Big Data 11, 82 (2024). https://doi.org/10.1186/s40537-024-00939-0 DOI: https://doi.org/10.1186/s40537-024-00939-0

Downloads

Published

13-11-2024

Issue

Section

Research Articles

How to Cite

[1]
Anupkumar Ghogare, “Next-Generation Data Pipeline Designs for Modern Analytics : A Comprehensive Review”, Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol, vol. 10, no. 6, pp. 548–554, Nov. 2024, doi: 10.32628/CSEIT24106196.

Similar Articles

1-10 of 360

You may also start an advanced similarity search for this article.