k-mean clustering and its real usecase in the security domain

Dharmika D
3 min readAug 12, 2021

--

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

To understand this, first we need to know what is unsupervised learning, clustering…

Unsupervised Learning is a machine learning technique in which, there are no labels for the training data. A machine learning algorithm tries to learn the underlying patterns or distributions that govern the data.

Clustering:- A cluster refers to a collection of data points aggregated together because of certain similarities. Clustering is the process of dividing the entire data into groups (also known as clusters) based on the patterns in the data.

Now let’s see how this K-means algorithm works

K-means is an algorithm that identifies k number of centroids and then allocates every data point to the nearest cluster while keeping the centroids as small as possible.

Here “K” refers to the number of centroids we need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.

working flow of k-means

The goal of the K-Means algorithm is to find clusters in the given input data. There are a couple of ways to accomplish this. We can use the trial and error method by specifying the value of K (e.g., 3,4, 5). As we progress, we keep changing the value until we get the best clusters.

Another method is to use the Elbow technique to determine the value of K. Once we get the K’s value, the system will assign that many centroids randomly and measure the distance of each of the data points from these centroids. Accordingly, it assigns those points to the corresponding centroid from which the distance is minimum. So each data point will be assigned to the centroid, which is closest to it. Thereby we have a K number of initial clusters.

Real usecase in the security domain

Cyber profiling:

Cyber profiling is the process of collecting data from individuals and groups to identify significant co-relations. the idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.

Through K-mean Clustering algorithms, the data can be grouped by the number of websites visited. This grouping aims to see what the user frequently accesses websites.

Insurance fraud detection:

Machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

THANK YOU FOR READING 😊

--

--