Citizendia

The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, k < n. Clustering is the classification of objects into different groups or more precisely the partitioning of a Data set into Subsets (clusters In Mathematics, a partition of a set X is a division of X into non-overlapping " parts " or " blocks " It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data. An expectation-maximization ( EM) algorithm is used in Statistics for finding Maximum likelihood estimates of Parameters in probabilistic The normal distribution, also called the Gaussian distribution, is an important family of Continuous probability distributions applicable in many fields It assumes that the object attributes form a vector space. In Mathematics, a vector space (or linear space) is a collection of objects (called vectors) that informally speaking may be scaled and added The objective it tries to achieve is to minimize total intra-cluster variance, or, the squared error function

V = \sum_{i=1}^{k} \sum_{x_j \in S_i} (x_j - \mu_i)^2

where there are k clusters Si, i = 1, 2, . . . , k, and µi is the centroid or mean point of all the points xjSi. In Geometry, the centroid or barycenter of an object X in n- Dimensional space is the intersection of all Hyperplanes

The most common form of the algorithm uses an iterative refinement heuristic known as Lloyd's algorithm. In Computer graphics and Electrical engineering, Lloyd's algorithm, also known as Voronoi iteration or relaxation is a method for evenly distributing samples Lloyd's algorithm starts by partitioning the input points into k initial sets, either at random or using some heuristic data. It then calculates the mean point, or centroid, of each set. It constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters, and algorithm repeated by alternate application of these two steps until convergence, which is obtained when the points no longer switch clusters (or alternatively centroids are no longer changed).

Lloyd's algorithm and k-means are often used synonymously, but in reality Lloyd's algorithm is a heuristic for solving the k-means problem[1], but with certain combinations of starting points and centroids, Lloyd's algorithm can in fact converge to the wrong answer (ie a different and optimal answer to the minimization function above exists. )

Other variations exist[2], but Lloyd's algorithm has remained popular because it converges extremely quickly in practice. In fact, many have observed that the number of iterations is typically much less than the number of points. Recently, however, David Arthur and Sergei Vassilvitskii showed that there exist certain point sets on which k-means takes superpolynomial time: 2Ω(√n) to converge. [3]

Approximate k-means algorithms have been designed that make use of coresets: small subsets of the original data. A coreset (in Computational geometry) is a small Subset of a point set that is used to compute a solution that approximates the solution that would be found if the

In terms of performance the algorithm is not guaranteed to return a global optimum. The quality of the final solution depends largely on the initial set of clusters, and may, in practice, be much poorer than the global optimum. Since the algorithm is extremely fast, a common method is to run the algorithm several times and return the best clustering found.

A drawback of the k-means algorithm is that the number of clusters k is an input parameter. An inappropriate choice of k may yield poor results. The algorithm also assumes that the variance is an appropriate measure of cluster scatter. In Probability theory and Statistics, the variance of a Random variable, Probability distribution, or sample is one measure of

Contents

Demonstration of the algorithm

The following images demonstrate the k-means clustering algorithm in action, for the two-dimensional case. The initial centres are generated randomly to demonstrate the stages in more detail.

Applications of the algorithm

Image Segmentation

The k-means clustering algorithm is commonly used in computer vision as a form of image segmentation. Computer vision is the science and technology of machines that see In Computer vision, segmentation refers to the process of partitioning a Digital image into multiple Regions ( sets of Pixels. The results of the segmentation are used to aid border detection and object recognition. Edge detection is a terminology in Image processing and Computer vision, particularly in the areas of feature detection and Feature extraction Object recognition in Computer vision is a task of finding given object in an image or video sequence In this context, the standard euclidean distance is usually insufficient in forming the clusters. In Mathematics, the Euclidean distance or Euclidean metric is the "ordinary" Distance between two points that one would measure with a ruler Instead, a weighted distance measure utilizing pixel coordinates, RGB pixel color and/or intensity, and image texture is commonly used. In Digital imaging, a pixel ( pict ure el ement is the smallest piece of information in an image [4]

Relation to PCA

It has been shown recently[5][6] that the relaxed solution of k-means clustering, specified by the cluster indicators, are given by the PCA (principal component analysis) principal components, and the PCA subspace spanned by the principal directions is identical to the cluster centroid subspace specified by the between-class scatter matrix.

Enhancements

In 2006 a new way of choosing the initial centers was proposed [1], dubbed "k-means++". The idea is to select centers in a way that they are already initially close to large quantities of points. The authors use L2 norm in selecting the centers, but general Ln may be used to tune the aggressiveness of the seeding.

This seeding method gives out considerable improvements in the final error of k-means. Although the initial selection in the algorithm takes considerable time, the k-means itself converges very fast after this seeding and thus the seeding actually lowers the computation time too. The authors tested their method with real and synthetic datasets and obtained typically 2-fold to 10-fold improvements in speed, and for certain datasets close to 1000-fold improvements in error. Their tests almost always showed the new method to be at least as good as vanilla k-means in both speed and error.

Additionally, the authors calculate an approximation ratio for their algorithm. This is something that has not been done with vanilla k-means (although with several variations of it). The k-means++ guarantees to have approximation ratio O(log(k)) where k is the number of clusters used.

Variations

The set of squared error minimizing cluster functions also includes the K-medoids algorithm, an approach which forces the center point of each cluster to be one of the actual points. The K-medoids algorithm is a clustering Algorithm related to the K-means algorithm and the medoidshift algorithm

References

  1. ^ a b D. Arthur, S. Vassilvitskii: "k-means++ The Advantages of Careful Seeding" 2007 Symposium on Discrete Algorithms (SODA).
  2. ^ An efficient k-means clustering algorithm: Analysis and implementation, T. Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu, IEEE Trans. Pattern Analysis and Machine Intelligence, 24 (2002), 881-892.
  3. ^ David Arthur & Sergei Vassilvitskii (2006). "How Slow is the k-means Method?". Proceedings of the 2006 Symposium on Computational Geometry (SoCG).  
  4. ^ Shapiro, Linda G. & Stockman, George C. (2001). Computer Vision. Upper Saddle River, NJ: Prentice Hall.
  5. ^ H. Zha, C. Ding, M. Gu, X. He and H. D. Simon. "Spectral Relaxation for K-means Clustering", Neural Information Processing Systems vol. 14 (NIPS 2001). pp. 1057-1064, Vancouver, Canada. Dec. 2001.
  6. ^ Chris Ding and Xiaofeng He. "K-means Clustering via Principal Component Analysis". Proc. of Int'l Conf. Machine Learning (ICML 2004), pp 225-232. July 2004.

External links

See also

Clustering is the classification of objects into different groups or more precisely the partitioning of a Data set into Subsets (clusters The Linde-Buzo-Gray algorithm is a Vector quantization algorithm to derive a good Codebook.
© 2009 citizendia.org; parts available under the terms of GNU Free Documentation License, from http://en.wikipedia.org
Dapyx Software network: MP3 Explorer | Ebook Manager | Zenithic