K-Means Cluster

As a marketer, you often face the dilemma where to invest your marketing dollars, specifically which customers to invest in. Do you invest in your high value customers or high growth customers? How do you recognize “growth” customers? How do you recognize “diminishing” customers? Each of these warrant extensive discussions that I will cover in future blogs.

But, we can all agree that customer segmentation is a good investment framework. There are several advanced analytics techniques that can be used for customer segmentation, the most popular of which is K-means cluster algorithm. In this blog, I’ll explain what K-means clustering is and how to create k-means cluster segmentation using R.


What is k-means cluster and how does it work? The algorithm starts by randomly selecting k objects from the data set to serve as the initial centers for the clusters. The selected objects are also known as cluster means or centroids. Next, each of the remaining objects is assigned to its closest centroid, where closest is defined using the Euclidean distance between the object and the cluster mean. This step is called “cluster assignment step”.

After the assignment step, the algorithm computes the new mean value of each cluster. The term cluster “centroid update” is used to design this step. Now that the centers have been recalculated, every observation is checked again to see if it is closer to a different cluster. All the objects are reassigned again using the updated cluster means.

The cluster assignment and centroid update steps are iteratively repeated until the cluster assignments stop changing. That is, the clusters formed in the current iteration are the same as those obtained in the previous iteration. Now you have customer segments that are homogenous within a segment and heterogenous across each segment.

Here’s the approach to run k-means cluster in R.

1. Import R Libraries

As a first step, you need to import the following libraries for data format and plotting before you run cluster analysis. All you need to do is type in the commands:


And, you are ready to go.

2. Load the Data Frame

As a next step, load the Data frame using the command, as below


This loads the data from your .csv file to R and you are ready to run cluster analysis

3. Estimate the Optimal Number of Clusters

fviz_nbclust(df[,5:6], kmeans, method = “wss”) + geom_vline(xintercept = 4, linetype = 2)

It computes sum of squares for different values of K. For this blog, you can ignore the sum of square values and focus on “point of inflection”. You can clearly see that the point of inflection is at 4 for the graph shown below. So, 4 is the optimum number of clusters.


Another way to determine the right number of clusters is by “trial and error”. You run clusters for 4, 5, 6 or different values depending on your business problem and select the optimum number that is most suited based on what these clusters tell you.

4. Run cluster analysis

Once you decide the number of clusters, use this code to run cluster analysis.

clusterdata<-kmeans(df[,5:6],4,nstart= 25)
kmeans(x, centers, nstart = 1)

  • x: represents the data frame indicated by “df[,5:6]” where 5 and 6 represents the column where data is present
  • cluster centers: Possible values are the number of clusters (k) or a set of initial (distinct) cluster centers. In this case, 4 are chosen as the initial centers.
  • nstart: The number of random starting partitions when the center is a number. Trying nstart > 1 is often recommended. In this example, we have set it at 25.

5. Save cluster membership

When you are running cluster analysis, it’s important to save the cluster membership for each record (customer) that you are running the analysis for. The code below will assign a cluster membership for each customer.


6. Plot the data to see the clusters

Finally, plot a graph to see how the clusters are formed. Eye-ball to make sure that the clusters make sense using the “plot” algorithm.

plot<-ggplot(df,aes(V1,V2,color=df$Cluster)) + geom_point()

To conclude, k-means cluster analysis is a really good segmentation method for segmenting customers using continuous variables. I’ll try and cover CHAID using R in the near future for segmentation with discrete or categorical variables.