CLUSTERING IN R - Bird Data Set

     

 CLUSTERING IN R



Description of Clustering

It is basically a type of unsupervised learning method . An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labeled responses. Generally, it is used as a process to find meaningful structure, explanatory underlying processes, generative features, and groupings inherent in a set of examples.

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.


Role / Importance

The application of cluster analysis in data mining has two main aspects: first, clustering analysis can be used as a pre-processing step for the other algorithms such as features and classification algorithm, and also can be used for further correlation analysis. Second, it can be used as a stand-alone tool in order to get the data distribution, to observe each cluster features, then focus on a specific cluster for some further analysis. Cluster analysis can be available in market segmentation, target customer orientation, performance assessment, biological species etc.


PROBLEM - Bird Data Set

Source Code

bird <- read.csv('C:/sk/bird.csv')

names(bird)


bird.new<- bird[,c(2,3,4,5,6,7,8,9,10,11)]

bird.class<- bird[,"type"]

bird.new


normalize <- function(x){

    return ((x-min(x))/(max(x)-min(x)))

}


bird.new$huml<- normalize(bird.new$huml)

bird.new$humw<- normalize(bird.new$humw)

bird.new$ulnal<- normalize(bird.new$ulnal)

bird.new$ulnaw<- normalize(bird.new$ulnaw)

bird.new$feml<- normalize(bird.new$feml)

bird.new$femw<- normalize(bird.new$femw)

bird.new$tibl<- normalize(bird.new$tibl)

bird.new$tibw<- normalize(bird.new$tibw)

bird.new$tarl<- normalize(bird.new$tarl)

bird.new$tarw<- normalize(bird.new$tarw)

head(bird.new)


result<- kmeans(bird.new,3) #apply k-means algorithm with no. of centroids(k)=3


result$size # gives no. of records in each cluster


result


result$centers # gives value of cluster center datapoint value(3 centers for k=3)


result$cluster #gives cluster vector showing the custer where each record falls


clusplot(bird,result$cluster,color = TRUE,shade = TRUE,labels = 2,lines = 0)

clusters <- hclust(dist(bird[,2:3]))

plot(clusters)


clusters <- hclust(dist(bird[,2:3]),method = "average")

plot(clusters)


library(ggplot2)


result$cluster <- as.factor(result$cluster)

ggplot(bird, aes(huml, humw, color = bird.class)) + geom_point()


result$cluster <- as.factor(result$cluster)

ggplot(bird, aes(ulnal, ulnaw, color = bird.class)) + geom_point()


result$cluster <- as.factor(result$cluster)

ggplot(bird, aes(feml, femw, color = bird.class)) + geom_point()


result$cluster <- as.factor(result$cluster)

ggplot(bird, aes(tibl, tibw, color = bird.class)) + geom_point()

result$cluster <- as.factor(result$cluster)

ggplot(bird, aes(tarl, tarw, color = bird.class)) + geom_point()

table(result$cluster,bird.class)

OUTPUT


Preprocess the dataset

Since clustering is a type of Unsupervised Learning, we would not require Class Label(type) during execution of our algorithm. We will, therefore, remove Class Attribute “type” and store it in another variable.



Normalisation is used to eliminate redundant data and ensures that good quality clusters are generated which can improve the efficiency of clustering algorithms.


Apply k-means clustering algorithm

One measurement is Within Cluster Sum of Squares (WCSS), which measures the squared average distance of all the points within a cluster to the cluster centroid.

Another measurement is Between Clusters Sum of Squares (BCSS), which measures the squared average distance between all centroids. BCSS measures the variation between all clusters.


Using 3 groups (K = 3) we had 75.1% of well-grouped data. 







Plot to see how huml and humw data points have been distributed in clusters as per bird types   


Most Swimming birds and wading birds have the longest humerus bone and other birds have short humerus bone











Most Swimming birds and wading birds have the longest ulna bone and other birds have short ulna bone

Most Swimming birds and wading birds have the longest femur bone and other birds have short femur bone


Most Swimming birds and wading birds have the longest Tibiotarsus bone and other birds have short Tibiotarsus bone


Most Swimming birds and wading birds have the medium sized Tarsometatarsus bone and other birds have short Tarsometatarsus bone



Comments