CLUSTERING IN R

Description of Clustering

It is basically a type of unsupervised learning method . An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labeled responses. Generally, it is used as a process to find meaningful structure, explanatory underlying processes, generative features, and groupings inherent in a set of examples.

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.

Role / Importance

The application of cluster analysis in data mining has two main aspects: first, clustering analysis can be used as a pre-processing step for the other algorithms such as features and classification algorithm, and also can be used for further correlation analysis. Second, it can be used as a stand-alone tool in order to get the data distribution, to observe each cluster features, then focus on a specific cluster for some further analysis. Cluster analysis can be available in market segmentation, target customer orientation, performance assessment, biological species etc.

PROBLEM - Bird Data Set

Source Code

bird <- read.csv('C:/sk/bird.csv')

names(bird)

bird.new<- bird[,c(2,3,4,5,6,7,8,9,10,11)]

bird.class<- bird[,"type"]

bird.new

normalize <- function(x){

return ((x-min(x))/(max(x)-min(x)))

}

bird.new$huml<- normalize(bird.new$huml)

bird.new$humw<- normalize(bird.new$humw)

bird.new$ulnal<- normalize(bird.new$ulnal)

bird.new$ulnaw<- normalize(bird.new$ulnaw)

bird.new$feml<- normalize(bird.new$feml)

bird.new$femw<- normalize(bird.new$femw)

bird.new$tibl<- normalize(bird.new$tibl)

bird.new$tibw<- normalize(bird.new$tibw)

bird.new$tarl<- normalize(bird.new$tarl)

bird.new$tarw<- normalize(bird.new$tarw)

head(bird.new)

result<- kmeans(bird.new,3) #apply k-means algorithm with no. of centroids(k)=3

result$size # gives no. of records in each cluster

result

result$centers # gives value of cluster center datapoint value(3 centers for k=3)

result$cluster #gives cluster vector showing the custer where each record falls

clusplot(bird,result$cluster,color = TRUE,shade = TRUE,labels = 2,lines = 0)

clusters <- hclust(dist(bird[,2:3]))

plot(clusters)

clusters <- hclust(dist(bird[,2:3]),method = "average")

plot(clusters)

library(ggplot2)