Various Platforms and Machine Learning Techniques for Big Data Analytics : A Technological Survey

Shahid Mohammad Ganie; Majid Bashir Malik; Tasleem Arif

doi:10.32628/CSEIT217242

Authors

Shahid Mohammad Ganie Department of Computer Sciences, BGSB University, Rajouri, J&K, India
Majid Bashir Malik Department of Computer Sciences, BGSB University, Rajouri, J&K, India
Tasleem Arif Department of Information Technology, BGSB University, Rajouri, J&K, India

Keywords:

Big data, Big data platforms, Hadoop, spark, HPC, GPU, Machine learning

Abstract

Data is growing drastically more and more every day and it becomes difficult task to store, analyse and interpret this data. Big data is a term that describe large volumes of high velocity, complex and variable data that cannot be stored and processed using traditional approach. Big data analytics require advanced tools and techniques in order to capture, storage, distribution, management, and analysis the data. Because of the complexity and heterogeneity of big data, various data mining and machine learning techniques are being used for big data analytics in order to develop better expert systems of real-world problems. In this paper, we have surveyed the state-of-art analysis of various platforms (software as well as hardware) for big data analytics like Hadoop ecosystem, Spark, High performance clusters (HPC), Graphical Processing Unit (GPU), etc., which are together used to collect, store, process and analyse the big data. This paper also reinforces some machine learning techniques that must be taken in account while dealing with big data lifecycle.

References

Oracle, “Oracle: Big Data for the Enterprise Oracle White Paper—Big Data for the Enterprise,” An Oracle White Pap., no. June, 2013.
D. Singh and C. K. Reddy, “A survey on platforms for big data analytics,” J. Big Data, vol. 2, no. 1, pp. 1–20, 2015, doi: 10.1186/s40537-014-0008-6.
C. A. Technology, “Batch processing  11/28/2017  6,” pp. 1–4, 2018.
M. Docs, “Big data architectures  11/28/2017  10,” pp. 1–7, 2018.
B. C. Big, A. B. Big, and V. Machines, “Big compute architecture style  08/30/2018  3,” pp. 8–11, 2018.
A. Gandomi and M. Haider, “Beyond the hype: Big data concepts, methods, and analytics,” Int. J. Inf. Manage., vol. 35, no. 2, pp. 137–144, 2015, doi: 10.1016/j.ijinfomgt.2014.10.007.
S. Landset, T. M. Khoshgoftaar, A. N. Richter, and T. Hasanin, “A survey of open source tools for machine learning with big data in the Hadoop ecosystem,” J. Big Data, vol. 2, no. 1, pp. 1–36, 2015, doi: 10.1186/s40537-015-0032-1.
T. R. Rao, P. Mitra, R. Bhatt, and A. Goswami, The big data system, components, tools, and technologies: a survey, vol. 60, no. 3. Springer London, 2019.
“How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read.” Online]. Available: https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#45381faf60ba. Accessed: 22-May-2019].
“300 Hours of Video are Uploaded to YouTube Every Minute.” Online]. Available: https://tubularinsights.com/youtube-300-hours/. Accessed: 19-Feb-2019].
“Google Search Statistics - Internet Live Stats.” Online]. Available: https://www.internetlivestats.com/google-search-statistics/. Accessed: 22-May-2019].
“Infographic: How Big Data Will Unlock the Potential of Healthcare.” Online]. Available: https://www.visualcapitalist.com/big-data-healthcare/. Accessed: 23-May-2019].
R. Saracco, “Another shift in content production ,” pp. 2019–2020, 2020.
T. Shafer, “The 42 V ’ s of Big Data and Data Science,” kdnuggets.com Elder Res., pp. 1–3, 2017.
V. K. A. -Arockia Panimalar. S, Varnekha Shree. S, “The 17 V’s of Big Data,” Int. Res. J. Eng. Technol., vol. 4, no. 9, pp. 3–6, 2017.
N. Khan, M. Alsaqer, H. Shah, G. Badsha, A. A. Abbasi, and S. Salehian, “The 10 Vs, Issues and Challenges of Big Data,” pp. 52–56, 2018, doi: 10.1145/3206157.3206166.
M. A. U. D. Khan, M. F. Uddin, and N. Gupta, “Seven V’s of Big Data understanding Big Data to extract value,” Proc. 2014 Zo. 1 Conf. Am. Soc. Eng. Educ. - “Engineering Educ. Ind. Involv. Interdiscip. Trends”, ASEE Zo. 1 2014, 2014, doi: 10.1109/ASEEZone1.2014.6820689.
V. C. Storey and I. Y. Song, “Big data technologies and Management: What conceptual modeling can do,” Data Knowl. Eng., vol. 108, no. February, pp. 50–67, 2017, doi: 10.1016/j.datak.2017.01.001.
H. Jasim Hadi, A. Hameed Shnain, S. Hadishaheed, and A. Haji Ahmad, “Big Data and Five V’S Characteristics,” Int. J. Adv. Electron. Comput. Sci., no. 2, pp. 2393–2835, 2015.
S. Mazumdar, D. Seybold, K. Kritikos, and Y. Verginadis, A survey on data storage and placement methodologies for Cloud-Big Data ecosystem, vol. 6, no. 1. Springer International Publishing, 2019.
D. Blazquez and J. Domenech, “Big Data sources and methods for social and economic analyses,” Technol. Forecast. Soc. Change, vol. 130, no. September 2017, pp. 99–113, 2018, doi: 10.1016/j.techfore.2017.07.027.
D. Singh and C. K. Reddy, “A survey on platforms for big data analytics,” J. Big Data, vol. 2, no. 1, pp. 1–20, 2015, doi: 10.1186/s40537-014-0008-6.
M. Merrouchi, M. Skittou, and T. Gadi, “Popular platforms for big data analytics: A survey,” 2018 Int. Conf. Electron. Control. Optim. Comput. Sci. ICECOCS 2018, pp. 1–6, 2019, doi: 10.1109/ICECOCS.2018.8610652.
M. Irestig, N. Hallberg, H. Eriksson, and T. Timpka, “Peer-to-peer computing in health-promoting voluntary organizations: A system design analysis,” J. Med. Syst., vol. 29, no. 5, pp. 425–440, 2005, doi: 10.1007/s10916-005-6100-x.
P. Kisembe and W. Jeberson, “Future of Peer-To-Peer Technology with the Rise of Cloud Computing,” Int. J. Peer to Peer Networks, vol. 8, no. 2/3, pp. 45–54, 2017, doi: 10.5121/ijp2p.2017.8304.
O. Sievert and H. Casanova, “A simple MPI process swapping architecture for iterative applications,” Int. J. High Perform. Comput. Appl., vol. 18, no. 3, pp. 341–352, 2004, doi: 10.1177/1094342004047430.
S. Landset, T. M. Khoshgoftaar, A. N. Richter, and T. Hasanin, “A survey of open source tools for machine learning with big data in the Hadoop ecosystem,” J. Big Data, vol. 2, no. 1, pp. 1–36, 2015, doi: 10.1186/s40537-015-0032-1.
S. Bahri, N. Zoghlami, M. Abed, and J. M. R. S. Tavares, “BIG DATA for Healthcare: A Survey,” IEEE Access, vol. 7, pp. 7397–7408, 2019, doi: 10.1109/ACCESS.2018.2889180.
S. Mehta and V. Mehta, “Hadoop Ecosystem: An Introduction,” Int. J. Sci. Res., vol. 5, no. 6, pp. 557–562, 2016, doi: 10.21275/v5i6.nov164121.
V. S. N. Bhagavatula and S. S. Raju, “A SURVEY OF HADOOP ECOSYSTEM AS A HANDLER OF BIGDATA,” no. August 2016, 2017.
B. Leang, S. Ean, G. A. Ryu, and K. H. Yoo, “Improvement of kafka streaming using partition and multi-threading in big data environment,” Sensors (Switzerland), vol. 19, no. 1, 2019, doi: 10.3390/s19010134.
J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” OSDI 2004 - 6th Symp. Oper. Syst. Des. Implement., pp. 137–149, 2004, doi: 10.21276/ijre.2018.5.5.4.
P. Sun and Y. Wen, “Scalable Architectures for Big Data Analysis,” Encycl. Big Data Technol., vol. c, pp. 1446–1454, 2019, doi: 10.1007/978-3-319-77525-8_281.
I. Kaur, N. Kaur, A. Ummat, J. Kaur, and N. Kaur, “Research Paper on Big Data and Hadoop,” vol. 8491, no. 1, pp. 50–53, 2016.
B. J. Mathiya and V. L. Desai, “Apache Hadoop Yarn Parameter configuration Challenges and Optimization,” Proc. IEEE Int. Conf. Soft-Computing Netw. Secur. ICSNS 2015, 2015, doi: 10.1109/ICSNS.2015.7292373.
Y. Perwej, B. Kerim, M. S. Adrees, and O. E. Sheta, “An Empirical Exploration of the Yarn in Big Data,” Int. J. Appl. Inf. Syst., vol. 12, no. 9, pp. 19–29, 2017, doi: 10.5120/ijais2017451730.
S. Alkatheri, S. A. Abbas, and M. A. Siddiqui, “Big Data Frameworks: A Comparative Study,” Int. J. Comput. Sci. Inf. Secur., vol. 17, no. 1, 2019.
D. Y. Perwej, M. Omer, and B. Kerim, “A Comprehend The Apache Flink In Big Data Environments,” IOSR J. Comput. Eng. (IOSR-JCE), e-ISSN 2278-0661,p-ISSN 2278-8727,www.iosrjournals.org, vol. Volume 20, no. March, p. Page 48-58, 2018, doi: 10.9790/0661-2001044858.
T. Rabl, J. Traub, A. Katsifodimos, and V. Markl, “Apache Flink in current research,” it - Inf. Technol., vol. 58, no. 4, pp. 2–9, 2016, doi: 10.1515/itit-2016-0005.
H. Benbrahim, H. Hachimi, and A. Amine, “Comparison between Hadoop and Spark,” Proc. Int. Conf. Ind. Eng. Oper. Manag., vol. 2019, no. MAR, pp. 690–701, 2019.
N. M. Faseeh Qureshi et al., “Dynamic Container-based Resource Management Framework of Spark Ecosystem,” Int. Conf. Adv. Commun. Technol. ICACT, vol. 2019-Febru, no. February, pp. 522–526, 2019, doi: 10.23919/ICACT.2019.8701970.
P. Basu, “HDFS for Big Data,” J. Chem. Inf. Model., vol. 53, no. 9, pp. 1689–1699, 2013, doi: 10.1017/CBO9781107415324.004.
C. Jin and S. Ran, “The research for storage scheme based on Hadoop,” Proc. 2015 IEEE Int. Conf. Comput. Commun. ICCC 2015, pp. 62–66, 2016, doi: 10.1109/CompComm.2015.7387541.
S. rna C and Z. Ansari, “Apache Pig - A Data Flow Framework Based on Hadoop Map Reduce,” Int. J. Eng. Trends Technol., vol. 50, no. 5, pp. 271–275, 2017, doi: 10.14445/22315381/ijett-v50p244.
A. Fuad, A. Erwin, and H. P. Ipung, “Processing performance on Apache Pig, Apache Hive and MySQL cluster,” Proc. 2014 Int. Conf. Information, Commun. Technol. Syst. ICTS 2014, pp. 297–301, 2014, doi: 10.1109/ICTS.2014.7010600.
V. R. Eluri, M. Ramesh, A. S. M. Al-Jabri, and M. Jane, “A comparative study of various clustering techniques on big data sets using Apache Mahout,” 2016 3rd MEC Int. Conf. Big Data Smart City, ICBDSC 2016, pp. 374–377, 2016, doi: 10.1109/ICBDSC.2016.7460397.
D. Kumar, L. Ali, and S. Memon, “Design and Implementation of High Performance Computing ( HPC ) Cluster Design and Implementation of High Performance Computing ( HPC ) Cluster,” no. January, 2018.
C. S. Yeo, R. Buyya, R. Eskicioglu, and P. Graham, “Handbook of Nature-Inspired and Innovative Computing,” Handb. Nature-Inspired Innov. Comput., no. June 2014, pp. 0–24, 2006, doi: 10.1007/0-387-27705-6.
J. Ruiz-Rosero, G. Ramirez-Gonzalez, and R. Khanna, “Field Programmable Gate Array Applications—A Scientometric Review,” Computation, vol. 7, no. 4, p. 63, 2019, doi: 10.3390/computation7040063.
J. Qiu, Q. Wu, G. Ding, Y. Xu, and S. Feng, “A survey of machine learning for big data processing,” EURASIP J. Adv. Signal Process., vol. 2016, no. 1, 2016, doi: 10.1186/s13634-016-0355-x.
K. S. Divya, P. Bhargavi, and S. Jyothi, “Machine Learning Algorithms in Big data Analytics,” Int. J. Comput. Sci. Eng., vol. 6, no. 1, pp. 63–70, 2018, doi: 10.26438/ijcse/v6i1.6370.
D. Fumo, “Types of Machine Learning Algorithms You Should Know,” Towar. Data Sci., pp. 1–7, 2017.
C. Mamatha, P. Buddha Reddy, M. A. Ranjit Kumar, and S. Kumar, “Analysis of big data with neural network,” Int. J. Civ. Eng. Technol., vol. 8, no. 12, pp. 211–215, 2017.
M. Vennapusa and S. Bhyrapuneni, “A comprehensive study of machine learning mechanisms on big data,” Int. J. Recent Technol. Eng., vol. 7, no. 6, pp. 773–779, 2019.
O. Obulesu, M. Mahendra, and M. Thrilokreddy, “Machine Learning Techniques and Tools: A Survey,” Proc. Int. Conf. Inven. Res. Comput. Appl. ICIRCA 2018, no. Icirca, pp. 605–611, 2018, doi: 10.1109/ICIRCA.2018.8597302.
S. L. Nita, “Machine Learning Techniques Used in Big Data,” Sci. Bull. Nav. Acad., vol. 19, no. 1, pp. 466–471, 2016, doi: 10.21279/1454-864x-16-i1-078.

Various Platforms and Machine Learning Techniques for Big Data Analytics : A Technological Survey

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite