K-means Is one of
the simplest unsupervised learning algorithms that solve the well known
clustering problem. The k-means algorithm takes the inputs parameter, k, and
partitions a set of n objects into k cluster so that the resulting intracluster similarity is high but the intercluster
similarity is low. Cluster similarity is measured in regard to the mean value
of the objects in a cluster, which can be viewed as the cluster’s centroid or center of gravity.
Given K, the
k-means algorithm is implemented in four steps:
- Partition objects into k nonempty subsets
- Compute seed points as the centroids of the clusters of the current partition(the centroid is the center, i.e., mean point, of the cluster)
- Assign each object to the cluster with the nearest seed point
- Go back to step 2, stop when no more new assignment
Finally this
algorithm aims at minimizing an objective function, in this case a mean squared
error function is calculated as:
\[E=\sum\limits_{i=0}^{k}{{{\sum }_{p\in {{c}_{i}}}}\left| p-{{m}_{i}} \right|}\]
\[E=\sum\limits_{i=0}^{k}{{{\sum }_{p\in {{c}_{i}}}}\left| p-{{m}_{i}} \right|}\]
where
E is the sum of the square error for all objects in the data set; p is the
point in space representing a given object; and mi is the
mean of cluster Ci (both
p and mi are
multidimensional).
Algorithm: k-means.
The
k-means algorithm for partitioning, where each cluster’s center is represented
by the mean value of the objects in the cluster.
Input:
K: the
number of clusters,
D: a
data set containing n objects.
Output: A set of k cluster
Advantages: with a large number of variables, k-means may be computationally
faster than hierarchical clustering (if k is small). K-means may produce
tighter cluster than hierarchical clustering, especially if the cluster are
globular.
Disadvantages:
·
Difficult
in comparing the quality of the clusters produced.
·
Applicable
only when mean is defined.
·
Need to
specify k, the number of clusters, in advance.
·
Unable
to handle noisy data and outliers.
·
Not
suitable to discover clusters with non-convex shapes.