The Research Mining Technology

Saturday 19 October 2013

K-means clustering

K-means Is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The k-means algorithm takes the inputs parameter, k, and partitions a set of n objects into k cluster so that the resulting intracluster similarity is high but the intercluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the cluster’s centroid or center of gravity.
Given K, the k-means algorithm is implemented in four steps:
  • Partition objects into k nonempty subsets
  • Compute seed points as the centroids of the clusters of the current partition(the centroid is the center, i.e., mean point, of the cluster)
  • Assign each object to the cluster with the nearest seed point
  • Go back to step 2, stop when no more new assignment 
Finally this algorithm aims at minimizing an objective function, in this case a mean squared error function is calculated as:
\[E=\sum\limits_{i=0}^{k}{{{\sum }_{p\in {{c}_{i}}}}\left| p-{{m}_{i}} \right|}\]
where E is the sum of the square error for all objects in the data set; p is the point in space representing a given object; and mi is the mean of cluster Ci (both p and mi are multidimensional).
Algorithm: k-means.
The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster.
Input:
K: the number of clusters,  
D: a data set containing n objects.
Output: A set of k cluster
Advantages: with a large number of variables, k-means may be computationally faster than hierarchical clustering (if k is small). K-means may produce tighter cluster than hierarchical clustering, especially if the cluster are globular.
Disadvantages:
·         Difficult in comparing the quality of the clusters produced.
·         Applicable only when mean is defined.
·         Need to specify k, the number of clusters, in advance.
·         Unable to handle noisy data and outliers.  
·         Not suitable to discover clusters with non-convex shapes. 
Share:

0 comments:

Post a Comment

Comment

BTemplates.com

Search This Blog

Powered by Blogger.

Translate

About Me

My photo
Tirunelveli, Tamil Nadu, India

Featured post

Mahalanobis Distance using R code

Mahalanobis distance is one of the standardized distance measure in statistics. It is a unit less distance measure introduced by P. C. Mah...

Weekly