Predicting Cervical Cancer Cases Resulting in Biopsies Using Machine Learning Techniques

There are various algorithms and methodologies used for automated screening of cervical cancer by segmenting and classifying cervical cancer cells into different categories. This study presents a critical review of different research papers published that integrated ML methods in screening cervical cancer via different approaches analyzed in terms of typical metrics like dataset size, drawbacks, accuracy etc. An attempt has been made to furnish the reader with an insight of Machine Learning algorithms like SVM (Support Vector Machines), k-NN (k-Nearest Neighbors), RFT (Random Forest Trees), for feature extraction and classification. This paper also covers the publicly available datasets related to cervical cancer. It presents a holistic review on the computational methods that have evolved over the period of time, in detection of malignant cells. In this paper, we are going to train our model using various machine learning techniques and all the models thus made are compared in terms of accuracy, precision and recall.

states that In Cervigram, cervix region occupies about half of the raw cervigram image. Other parts of the image contain inconsequential information. This irrelevant information can muddle automatic identification of the tissues within the cervix. Asselin et al. [5] discuss the imaging methods available to provide appropriate biomarkers of tumor structure and function using selective regions of interest (ROI),

Cluster analysis and Histogram analysis. Turid
Torheim et.al [6] present the paper with texture analysis methods and classification by using SVM to identify the cured and relapsed images. S. Jagadeeswari and S. Malarkhodi [7] presented a paper on classification by using an Artificial Neural Network to identify the normal and abnormal tumor images with Fourier transform and Gaussian low pass filter. Rupinderpal and Rajneet presented a noise removal method using discrete wavelet transform.
Here, we have tried to detect the cancer using ML techniques along with ensembling.

II. Literature Review Supervised Learning Techniques:
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples.
The supervised machine learning techniques that we used here are Logistic Regression, Random Forest Classifier, SVM.

Random Forest:
Random Forest is an ensemble algorithm which creates many decision trees (a forest), and applies them to multiple subsets of the dataset, creating multiple classification results. The Random Forest Classifier uses a voting system to make its final classification prediction, with each tree voting, and chooses the class with the most votes. An alternative voting measure is using weights to assign the impact of a decision tree's result, with trees with high errors getting low weightings, and vice versa. In this voting system, trees with low error rates have a higher impact on the final classification decision.
The Random Forest Classifier splits the dataset into a training set and testing set by sampling with replacement, until approximately one-third of the data is remaining, which is used for testing the classifier. Before applying the classifier to the data, you must determine how many trees each forest should contain, and the minimum number of nodes required in order for the tree to split. Advantages: 1.It's works well with noisy data and it reduces overfitting. Since the end result is an average or majority vote of multiple classification results, the classifier has a significantly lower chance of overfitting the data.
2.Since there are multiple forests, not every forest is necessarily affected by noisy data.  Advantages: • They work well with high-dimensional data, avoid the curse of dimensionality, and they still work well in cases where there are more dimensions than there are samples of data. Disadvantage: • They are harder to analyze, as they do not give out a probability score.

Logistic Regression:
This is a learning technique employed when the output of training data is in the form of groups called classes.
In our model, the output is 0 if the patient is negative with cancer and 1 if she is diagnosed positive. Advantages of Logistic Regression: • Logistic Regression performs well when the dataset is linearly separable.
• Logistic regression is less prone to over-fitting but it can overfit in high dimensional datasets. You should consider Regularization (L1 and L2) techniques to avoid over-fitting in these scenarios.

Disadvantages:
• Main limitation of Logistic Regression is the assumption of linearity between the dependent variable and the independent variables. In the real world, the data is rarely linearly separable.
Most of the time data would be a jumbled mess.
• Logistic Regression can only be used to predict discrete functions. Therefore, the dependent variable of Logistic Regression is restricted to the discrete number set. This restriction itself is problematic, as it is prohibitive to the prediction of continuous data.
The unsupervised technique used is K means clustering.

K-Means Clustering:
It is an unsupervised approach to classifying data which tries to make clusters of similar data. Each data point is compared to randomly selected centroids, and placed in the neighborhood of its nearest cluster (using Euclidean distance). The number of clusters must be defined at the beginning of the model. After selecting the initial centroids, the distances are computed and the data is assigned to centroid, and the centroids are recomputed multiple times until they don't move around anymore.
K-means clustering performs well and is easy to understand the visualization of the data, However, kmeans clustering doesn't perform well when the data is of different sizes or densities, and has problems when the data contains outliers. Advantages: • Relatively simple to implement.
• Scales to large data sets.

SVM:
The kernel is set to "rbf" with C=100, As far as the models are all trained and tested, the decision tree model has got the highest accuracy when compared to all the other models.
As the decision trees can avoid overfitting, we don't see the problem of overfitting in our model. Parameter tuning: • The logistic regression technique that is set with 'l2' penalty has more accuracy than that of 'l1' • The decision tree and random forest has same accuracy because the parameters of both the models are set same.
• No matter if we take one tree or a forest.
• When the number of jobs are being increased the accuracy of random forest went down. So the optimum values is set to 10.
• The accuracy of perceptron has no effect though the learning rate is changed.

V. CONCLUSION
In conjunction with more accurate diagnostics, AI has the potential to bring down the cost of unwanted interventions for cervical cancer screening. Early detection will promise a greater rate of patients prognosis especially in case of non-invasive cancer.
The papers discussed above made use of independent data sources, consequently a base for comparing algorithms on a single scale was hard to define.