Wide Area Analytics : Efficient Analytics For A Geo Distributed Datacenters

Authors

  • Alapati Janardhana Rao  MCA Department, Vignan's Lara Institute of Technology and Science, Vadlamudi, Guntur, Andhra Pradesh, India
  • Bellamkonda Naresh   MCA Department, Vignan's Lara Institute of Technology and Science, Vadlamudi, Guntur, Andhra Pradesh, India

Keywords:

Big Data, Analytics, Geo-Distributed Datacenters

Abstract

Large organizations today operate data centers around the globe where massive amounts of data are produced and consumed by local users. Despite their geographically diverse origin, such data must be analyzed/mined as a whole. We call the problem of supporting rich DAGs of computation across geographically distributed data Wide-Area Big-Data (WABD). To the best of our knowledge, WABD is not sup-ported by currently deployed systems nor sufficiently studied in literature; it is addressed today by continuously copying raw data to a central location for analysis. We observe from production workloads that WABD is important for large organizations, and that centralized solutions incur substantial cross-data center network costs. We argue that these trends will only worsen as the gap between data volumes and transoceanic bandwidth widens. Further, emerging concerns over data sovereignty and privacy may trigger government regulations that can threaten the very viability of centralized solutions. To address WABD we propose WANalytics, a system that pushes computation to edge data centers, automatically optimizing work own execution plans and replicating data when needed. Our Hardtop-based prototype delivers 257 reductions in WAN bandwidth on a production workload from Microsoft. We round out our evaluation by also demonstrating substantial gains for three standard benchmarks: TPC-CH, Berkeley Big Data, and Big Bench.

References

  1. Apache Oozie. http://oozie.apache.org.
  2. Apache Storm. http://storm.incubator.apache.org.
  3. Cloudera Impala. http://bit.ly/1eRUDeA.
  4. Greenplum. http://bit.ly/1oL4Srq.
  5. Greenplum. http://basho.com/riak/.
  6. Netezza. http://www.ibm.com/software/data/netezza/.
  7. Vertica. http://www.vertica.com/.
  8. S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu,I. Stoica, and J. Zhou. Reoptimizing data parallel computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pages 281{294, San Jose, CA, 2012. USENIX.
  9. S. Agarwal, B. Mozafari, A. Panda, H. Milner,S. Madden, and I. Stoica. Blinkdb: Queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys '13, pages 29{42, New York, NY, USA, 2013. ACM.
  10. M. O. Akinde, M. H. Bohlen,• T. Johnson, L. V. S. Lakshmanan, and D. Srivastava. E cient olap query processing in distributed data warehouses. In Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology, EDBT '02, pages 336{353, London, UK, UK, 2002. Springer-Verlag.
  11. G. Alonso, B. Reinwald, and C. Mohan. Distributed data management in work ow environments. In RIDE, 1997.
  12. A. Auradkar, C. Botev, S. Das, D. De Maagd, A. Feinberg, P. Ganti, L. Gao, B. Ghosh, K. Gopalakrishna, and B. Harris. Data infrastructure at linkedin. In 2012 IEEE 28th International Conference on Data Engineering, pages 1370{1381. IEEE, 2012.
  13. The Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark/.
  14. S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Rec., 1997.
  15. A. Chervenak, E. Deelman, M. Livny, M.-H. Su,R. Schuler, S. Bharathi, G. Mehta, and K. Vahi. Data placement for scienti c applications in distributed environments. In Proceedings of the 8th IEEE/ACM International Conference on Grid Computing, GRID '07, pages 267{274, Washington, DC, USA, 2007. IEEE Computer Society.
  16. R. Cole, F. Funke, L. Giakoumakis, W. Guy,A. Kemper, S. Krompass, H. Kuno, R. Nambiar,T. Neumann, M. Poess, K.-U. Sattler, M. Seibold, E. Simon, and F. Waas. The mixed workload CH-benCHmark. In DBTest '11.
  17. B. F. Cooper, R. Ramakrishnan, U. Srivastava,A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!'s hosted data serving platform. Proc. VLDB Endow., 1(2):1277{1288, Aug. 2008.
  18. J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost,J. Furman, S. Ghemawat, A. Gubarev, C. Heiser,P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle,S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google's globally-distributed database. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pages 261{264, Hollywood, CA, Oct. 2012. USENIX Association.
  19. E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil,C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman,J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: A framework for mapping complex scienti c work ows onto distributed systems. Sci. Program., 13(3):219{237, July 2005.
  20. A. Dey. Yahoo cross data-center data movement. http://yhoo.it/1nPRImNl, 2010.
  21. A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess,A. Crolotte, and H.-A. Jacobsen. Bigbench: Towards an industry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 1197{1208, New York, NY, USA,2013. ACM.
  22. G. Giannikis, D. Makreshanski, G. Alonso, and D. Kossmann. Shared workload optimization. Proceedings of the VLDB Endowment, 7(6), 2014.
  23. A. Gupta, F. Yang, J. Govig, A. Kirsch, K. Chan,K. Lai, S. Wu, S. Dhoot, A. Kumar, A. Agiwal, S. Bhansali, M. Hong, J. Cameron, M. Siddiqi,D. Jones, J. Shute, A. Gubarev, S. Venkataraman, and D.Agrawal. Mesa: Geo-replicated, near real-time, scalable data warehousing. In VLDB, 2014.
  24. H. Herodotou and S. Babu. Pro ling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB, 2011.
  25. J.-H. Hwang, Y. Xing, U. Cetintemel, and S. Zdonik. A cooperative, self-con guring high-availability solution for stream processing. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 176{185, April 2007.
  26. K. Karanasos, A. Balmin, M. Kutsch, F. Ozcan,V. Ercegovac, C. Xia, and J. Jackson. Dynamically optimizing queries over large scale data platforms. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 943{954, New York, NY, USA, 2014. ACM.
  27. D. Kossmann. The state of the art in distributed query processing. ACM Comput. Surv., 2000.
  28. T. Kraska, G. Pang, M. J. Franklin, S. Madden, and A. Fekete. MDCC: Multi-data center consistency. In EuroSys, 2013.
  29. G. Lee, J. Lin, C. Liu, A. Lorek, and D. Ryaboy. The uni ed logging infrastructure for data analytics at Twitter. PVLDB, 2012.
  30. H. Lim, H. Herodotou, and S. Babu. Stubby: A transformation-based optimizer for mapreduce work ows. Proc. VLDB Endow., 5(11):1196{1207, July 2012.
  31. L. F. Mackert and G. M. Lohman. R optimizer validation and performance evaluation for distributed queries. In VLDB, 1986.
  32. S. Madden. Database abstractions for managing sensor network data. Proc. of the IEEE, 98(11):1879{1886, 2010.
  33. F. D. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In SIGMOD, 2009.
  34. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD '08, pages 1099{1110, New York, NY, USA, 2008. ACM.
  35. A. Rabkin, M. Arye, S. Sen, V. S. Pai, and M. J. Freedman. Aggregation and degradation in jetstream: Streaming analytics in the wide area. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pages 275{288, Seattle, WA, Apr. 2014. USENIX Association.
  36. L. Ravindranath, J. Padhye, S. Agarwal, R. Mahajan,I. Obermiller, and S. Shayandeh. Appinsight: Mobile app performance monitoring in the wild. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI'12, pages 107{120, Berkeley, CA, USA, 2012. USENIX Association.
  37. M. Stonebraker, P. Brown, D. Zhang, and J. Becla. Scidb: A database management system for applications with complex analytics. Computing in Science Engineering, 15(3):54{62, May 2013.
  38. A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu. Data warehousing and analytics infrastructure at facebook. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD '10, pages 1013{1020, New York, NY, USA, 2010. ACM.
  39. R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin,S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 13{24, New York, NY, USA, 2013. ACM.
  40. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2{2. USENIX Association, 2012.

Downloads

Published

2018-04-30

Issue

Section

Research Articles

How to Cite

[1]
Alapati Janardhana Rao, Bellamkonda Naresh , " Wide Area Analytics : Efficient Analytics For A Geo Distributed Datacenters, IInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology(IJSRCSEIT), ISSN : 2456-3307, Volume 4, Issue 2, pp.08-19, March-April-2018.