Survey and Performance Analysis of Machine Learning Based Security Threats Detection Approaches in Cloud Computing

Cloud computing is gaining a lot of attention, however, security is a major obstacle to its widespread adoption. Users of cloud services are always afraid of data loss, security threats and availability problems. Recently, machine learning-based methods of threat detection are gaining popularity in the literature with the advent of machine learning techniques. Therefore, the study and analysis of threat detection and prevention strategies are a necessity for cloud protection. With the help of the detection of threats, we can determine and inform the normal and inappropriate activities of users. Therefore, there is a need to develop an effective threat detection system using machine learning techniques in the cloud computing environment. In this paper, we present the survey and comparative analysis of the effectiveness of machine learning-based methods for detecting the threat in a cloud computing environment. The performance assessment of these methods is performed using tests performed on the UNSW-NB15 dataset. In this work, we analyse machine learning models that include Support Vector Machine (SVM), Decision Tree (DT), Naive Bayes (NB), Random Forests (RF) and the K-Nearest neighbour (KNN). Additionally, we have used the most important performance indicators, namely, accuracy, precision, recall and F1 score to test the effectiveness of several methods.

Instead, they simply access or rent hardware or software that pays for only what they use. The possibility to pay as you proceed with the activities extensively demanded by cloud hosting providers is gaining popularity in the business-computing model [1].
Although Cloud computing is seen as a significant infrastructure change, more security work is still needed to reduce its failures. Since a significant amount of personal and corporate information is stored in cloud data centres, those cloud security and vulnerability issues need to be identified and prevented. Because cloud infrastructure uses standard Internet protocols and virtualisation techniques, it may be vulnerable to attack. Such attacks can come from traditional sources such as Address Resolution Protocol, IP spoofing, Denial of Service (DoS) [2], [3].
The traditional techniques used for detection and prevention do not work well enough to manage those attacks while also working with large data flows.
Machine learning (ML) techniques are helpful in detecting attacks. Several solutions based on machine learning have been suggested to detect cloud attacks.
A major challenge in machine learning-based solutions is to detect these attacks with high accuracy.
The main purpose of this paper is to provide a comparative study and performance analysis using various techniques based on the study of machine learning techniques in cloud computing. We are analysing machine learning strategies by Random Forest (RF), K-Nearest Neighbours (KNN), Decision Tree (DT), Naïve Bayes (NB), and Support Vector Machine (SVM). For analytical purposes, UNSW-NB15 [4], [5] is used as a dataset and Python is used as a programming language.
The rest of the paper is structured as follows: Section II presents a literature review of the latest techniques used to detect the threat. Section III discusses machine learning methods, Section IV discusses data set, implementation and test results. Finally, the conclusion of the study is provided in Section V. Osanaiye et al. [19] proposed An ensemble-based multi-filter feature selection method. This method achieves a good selection by combining the output of four filter methods. The proposed method has been used to use cloud computing and is used to detect DDOS attacks. Extensive experimental testing of the proposed method was performed using a database of Mobilio et al. [9] introduced Cloud-based anomaly detection as a service that uses a standard rule used in cloud systems to declare control of the concept of incorrect discovery. They also propose first results with lightweight machines that show a promising solution to better control the concept of detection of malformations. They also discussed how to apply the as-service paradigm to the unfavourable acquisition concept and gain anonymous acquisition as a service.
They also recommend building a paradigm that supports paradigm as a service and can work in conjunction with any viewing system that stores data in a series of time series. Preliminary testing of as-a- Fernandez and Xu [24] presented a case study using the Deep learning network to find out the threat. The author said he had achieved excellent results in detecting network threats. They also showed that using only the first three octets of IP addresses can be effective in managing the use of dynamic IP addresses, representing the DNN uncommon occurrence of DHCP. This approach has shown that autoencoders can be used to detect inaccuracies wherever they are trained in the expected flow.

Naïve Bayes algorithm
The Naïve Bayes algorithm is used to perform classification, which is based on the Bayes theorem.
This algorithm works on assumption that all input attributes are conditionally independent.
The steps of Naïve Bayes algorithm are as follows: • Step 1: Given a training set S, Calculate the probability of each class p(vj).
• Step 2: Given a training set S, For each attribute value, ai of each attribute a, calculate conditional probability p(ai|vj).
• Step 3: Given an unknown instance X', Classify X' according to the best probability.

Decision Tree algorithm
Decision tree learning is a method for approximating discrete-valued target functions, in which the learned function is represented by a decision tree.
Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute. An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given example. This process is then repeated for the subtree rooted at the new node.
The working steps of the Decision Tree algorithm are given below: • Step 1: First, To place the best attribute from the dataset at the root of the tree some mathematical measure like information gain is used.
• Step 2: Second, Divide the training dataset into subsets. While dividing, we should consider each subset should contain data with the same value for an attribute.
• Step 3: Lastly, just repeat Step 1 and Step 2 on each subset until we find leaf nodes in all the branches of the tree.

Random forests algorithm
Random forests are an ensemble learning method for classification or regression that operate by constructing multiple decision trees by picking the "K" number of data points from the dataset and then merges them to get a more accurate and stable prediction. For each "K" data point's decision tree, we have many predictions and then we take the average of all the predictions.
The steps for the Random Forest algorithm are as follows: • Step 1: Select randomly "i" features from the entire "j" features with one condition i << j.
• Step 2: Using the concept of best split point, calculate node "n" from the "i" features.
• Step 3: Again using the concept of the best split, we need to split node "n" into daughter node.
• Step 4: Repeat Step 1-Step 3 until "1" number of nodes has been reached. • Step 5: Build forest by repeating Step 1-Step 4 for "k" number of times to create "k" number of trees.
• Step 6: To predict the target, take test features and use the rules of each randomly created decision tree and store the predicted target.

•
Step 7: Then simply find out votes for each predicted target. The steps of the K-Nearest Neighbours algorithm are given below: • Step 1: Decide the value of K.
• Step 2: Calculate the distance between the query instance and all the training samples.
• Step 3: Sort the distance in ascending order and confirm nearest neighbours supported the Kth minimum distance.
• Step 4: Based on the majority of the class of nearest neighbours, assign the prediction value of the query instance.

Support Vector Machine (SVM)
The SVM classifier is used for classification and regression. In SVM, data is spat into the data point by using a hyperplane and it is used to determine the class of data point. The distance from the boundary to the nearest data point is called a margin and the data point that lies closest to the classification boundary is called a support vector. When we deal with SVM, then we have to assume two things: 1) The margin should be as large as possible, and 2) The support vectors are the most useful data points because they are the ones most likely to be incorrectly classified.
The working steps for SVM are as follows: • Step 1: Define optimal hyperplane: maximize margin. • Step 2: Extend the definition mentioned in Step 1 for nonlinearly separable problems: have a penalty term for misclassifications.

Security threat detection methodology used in experimentation
The details of the threat detection methodology used in experimentation are illustrated in Fig. 1.