Newsgroups: sci.stat.edu Date: Mon, 5 Aug 1996 18:13:29 -0400 Message-ID: <DvoqMw.1z7@unx.sas.com> From: saswss@unx.sas.com (Warren Sarle) Subject: Re: "best" number of clusters? Lines: 443 In article <v01540b00ae2bfff45c99@[129.71.130.223]>, perdue@wvsvax.wvnet.edu (Charles Perdue) writes: |> When performing cluster analysis (K-means or hierarchical), is there a |> commonly accepted criterion for choosing the "best-fitting" solution? I am |> somewhat familiar with the methods used for choosing the appropriate |> dimensionality in PCA (scree, Kaiser, ML, etc) and MDS, but am not sure |> what to look for when comparing the adequacy of various clusterings. |> |> Maybe another way to frame this question is in terms of "confirmatory" |> cluster analysis: How would you determine if a given set of clusters was |> an "acceptable" fit to the data? Adapted Mar 16, 1996, from the SAS/STAT User's Guide (1990) and Sarle and Kuo (1993). Copyright 1996 by SAS Institute Inc, Cary, NC, USA. The Number of Clusters ---------------------- There are no completely satisfactory methods for determining the number of population clusters for any type of cluster analysis (Everitt 1979, 1980; Hartigan 1985; Bock 1985). If your purpose in clustering is dissection, that is, to summarize the data without trying to uncover real clusters, it may suffice to look at R^2 for each variable and pooled over all variables. Plots of R^2 against the number of clusters are useful. It is always a good idea to look at your data graphically. If you have only two or three variables, use PROC PLOT to make scatterplots identifying the clusters. With more variables, use PROC CANDISC to compute canonical variables for plotting. Ordinary significance tests, such as analysis-of-variance F tests, are not valid for testing differences between clusters. Since clustering methods attempt to maximize the separation between clusters, the assumptions of the usual significance tests, parametric or nonparametric, are drastically violated. For example, if you take a sample of 100 observations from a single univariate normal distribution, have PROC FASTCLUS divide it into two clusters, and run a t test between the clusters, you usually obtain a probability level of less than 0.0001. For the same reason, methods that purport to test for clusters against the null hypothesis that objects are assigned randomly to clusters (McClain and Rao 1975; Klastorin 1983) are useless. Most valid tests for clusters either have intractable sampling distributions or involve null hypotheses for which rejection is uninformative. For clustering methods based on distance matrices, a popular null hypothesis is that all permutations of the values in the distance matrix are equally likely (Ling 1973; Hubert 1974). Using this null hypothesis, you can do a permutation test or a rank test. The trouble with the permutation hypothesis is that with any real data, the null hypothesis is implausible even if the data do not contain clusters. Rejecting the null hypothesis does not provide any useful information (Hubert and Baker 1977). Another common null hypothesis is that the data are a random sample from a multivariate normal distribution (Wolfe 1970, 1978; Duda and Hart 1973; Lee 1979). The multivariate normal null hypothesis is better than the permutation null hypothesis, but it is not satisfactory because there is typically a high probability of rejection if the data are sampled from a distribution with lower kurtosis than a normal distribution, such as a uniform distribution. The tables in Englemann and Hartigan (1969), for example, generally lead to rejection of the null hypothesis when the data are sampled from a uniform distribution. Hawkins, Muller, and ten Krooden (1982,337-340) discuss a highly conservative Bonferroni method for hypothesis testing. The conservativeness of this approach may compensate to some extent for the liberalness exhibited by tests based on normal distributions when the population is uniform. Perhaps a better null hypothesis is that the data are sampled from a uniform distribution (Hartigan 1978; Arnold 1979; Sarle 1983). The uniform null hypothesis leads to conservative error rates when the data are sampled from a strongly unimodal distribution such as the normal. However, in two or more dimensions and depending on the test statistic, the results can be very sensitive to the shape of the region of support of the uniform distribution. Sarle (1983) suggests using a hyperbox with sides proportional in length to the singular values of the centered coordinate matrix. Given that the uniform distribution provides an appropriate null hypothesis, there are still serious difficulties in obtaining sampling distributions. Some asymptotic results are available (Hartigan 1978, 1985; Pollard 1981; Bock 1985) for the within-cluster sum of squares, the criterion that PROC FASTCLUS and Ward's minimum variance method attempt to optimize. No distributional theory for finite sample sizes has yet appeared. Currently, the only practical way to obtain sampling distributions for realistic sample sizes is by computer simulation. Arnold (1979) used simulation to derive tables of the distribution of a criterion based on the determinant of the within-cluster sum of squares matrix |W|. Both normal and uniform null distributions were used. Having obtained clusters with either FASTCLUS or CLUSTER, you can compute Arnold's criterion with the ANOVA or CANDISC procedure. Arnold's tables provide a conservative test because FASTCLUS and CLUSTER attempt to minimize the trace of W rather than the determinant. Marriott (1971, 1975) also gives useful information on |W| as a criterion for the number of clusters. Sarle (1983) used extensive simulations to develop the cubic clustering criterion (CCC), which can be used for crude hypothesis testing and estimating the number of population clusters. The CCC is based on the assumption that a uniform distribution on a hyperrectangle will be divided into clusters shaped roughly like hypercubes. In large samples that can be divided into the appropriate number of hypercubes, this assumption gives very accurate results. In other cases the approximation is generally conservative. For details about the interpretation of the CCC, consult Sarle (1983). Milligan and Cooper (1985) and Cooper and Milligan (1984) compared thirty methods for estimating the number of population clusters using four hierarchical clustering methods. The three criteria that performed best in these simulation studies with a high degree of error in the data were a pseudo F statistic developed by Calinski and Harabasz (1974), a statistic referred to as J_e(2)/J_e(1) by Duda and Hart (1973) that can be transformed into a pseudo t^2 statistic, and the cubic clustering criterion. The pseudo F statistic and the CCC are printed by FASTCLUS; these two statistics and the pseudo t^2 statistic, which can be applied only to hierarchical methods, are printed by CLUSTER. It may be advisable to look for consensus among the three statistics, that is, local peaks of the CCC and pseudo F statistic combined with a small value of the pseudo t^2 statistic and a larger pseudo t^2 for the next cluster fusion. It must be emphasized that these criteria are appropriate only for compact or slightly elongated clusters, preferably clusters that are roughly multivariate normal. Mixture models, usually Gaussian, provide a useful statistical model for cluster analysis. Hypothesis testing requires bootstrapping except in special cases (Titterington, Smith, and Makov 1985; McLachlan, and Basford 1988); The Bayesian approach is promising for a variety of mixture models, both Gaussian and non-Gaussian. See Binder (1978, 1981) and Banfield and Raftery (1993). Recent research has tended to deemphasize mixture models in favor of nonparametric models in which clusters correspond to modes in the probability density function. Hartigan and Hartigan (1985) and Hartigan (1985) developed a test of unimodality vs. bimodality in the univariate case. Nonparametric tests for the number of clusters can also be based on nonparametric density estimates. This approach requires much weaker assumptions than mixture models, namely, that the observations are sampled independently and that the distribution can be estimated nonparametrically. Silverman (1986) describes a bootstrap test for the number of modes using a Gaussian kernel density estimate, but problems have been reported with this method under the uniform null distribution (Marron, S., personal communication). Further developments in nonparametric methods are given by Mueller and Sawitzki (1991), Minnotte (1992), and Polonik (1993). All of these methods suffer from heavy computational requirements. One useful descriptive approach to the number-of-clusters problem is provided by Wong and Schaack (1982), based on a kth-nearest-neighbor density estimate. The kth-nearest-neighbor clustering method developed by Wong and Lane (1983) is applied with varying values of k. Each value of k yields an estimate of the number of modal clusters. If the estimated number of modal clusters is constant for a wide range of k values, there is strong evidence of at least that many modes in the population. A plot of the estimated number of modes against k can be highly informative. Attempts to derive a formal hypothesis test from this diagnostic plot have met with difficulties, but a simulation approach similar to Silverman's (1986) does seem to work (Girman 1994). The simulation, of course, requires considerable computer time. Sarle and Kuo (1993) document a less expensive approximate nonparametric test for the number of clusters that has been implemented in the MODECLUS procedure in the SAS/STAT product. This test sacrifices statistical efficiency for computational efficiency. The method for conducting significance tests is as follows: * Estimate densities using fixed-radius uniform kernels. * Obtain preliminary clusters by a "valley-seeking" method. Other clustering methods could be used but would yield less power. * Compute an approximate p-value for each cluster by comparing the estimated maximum density in the cluster with the estimated maximum density on the cluster boundary. * Repeatedly join the least significant cluster with a neighboring cluster until all remaining clusters are significant. * Estimate the number of population clusters as the number of significant sample clusters. * The proceeding steps can be repeated for any number of different radii, and the estimate of the number of population clusters can be taken to be the maximum number of significant sample clusters for any radius. This method has the following useful features: * No distributional assumptions are required. * The choice of smoothing parameter is not critical since you can try any number of different values. * The data can be coordinates or distances. * Time and space requirements for the significance tests are no worse than those for obtaining the clusters. * The power is high enough to be useful for practical purposes. The method for computing the p-values is based on a series of plausible approximations. There are as yet no rigorous proofs that the method is infallible. Neither are there any asymptotic results. However, simulations for sample sizes ranging from 20 to 2000 indicate that the p-values are almost always conservative. The only case discovered so far in which the p-values are liberal is a uniform distribution in one dimension for which the simulated error rates exceed the nominal significance level only slightly for a limited range of sample sizes. References ---------- Anderberg, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press, Inc. Arnold, S.J. (1979), "A Test for Clusters," Journal of Marketing Research, 16,545-551. Art, D., Gnanadesikan, R., and Kettenring, R. (1982), "Data-based Metrics for Cluster Analysis," Utilitas Mathematica, 21A, 75-99. Banfield, J.D. and Raftery, A.E. (1993), "Model-Based Gaussian and Non- Gaussian Clustering", Biometrics, 49, 803-821. Barnett, V., ed. (1981), Interpreting Multivariate Data, New York: John Wiley & Sons, Inc. Binder, D.A. (1978), "Bayesian Cluster Analysis," Biometrika, 65, 31-38. Binder, D.A. (1981), "Approximations to Bayesian Clustering Rules," Biometrika, 68, 275-285. Blashfield, R.K. and Aldenderfer, M.S. (1978), "The Literature on Cluster Analysis," Multivariate Behavioral Research, 13, 271-295. Bock, H.H. (1985), "On Some Significance Tests in Cluster Analysis," Journal of Classification, 2, 77-108. Calinski, T. and Harabasz, J. (1974), "A Dendrite Method for Cluster Analysis," Communications in Statistics, 3, 1-27. Cooper, M.C. and Milligan, G.W. (1988), "The Effect of Error on Determining the Number of Clusters," Proceedings of the International Workshop on Data Analysis, Decision Support and Expert Knowledge Representation in Marketing and Related Areas of Research, 319-328. Duda, R.O. and Hart, P.E. (1973), Pattern Classification and Scene Analysis, New York: John Wiley & Sons, Inc. Duran, B.S. and Odell, P.L. (1974), Cluster Analysis, New York: Springer-Verlag. Englemann, L. and Hartigan, J.A. (1969), "Percentage Points of a Test for Clusters," Journal of the American Statistical Association, 64, 1647-1648. Everitt, B.S. (1979), "Unresolved Problems in Cluster Analysis," Biometrics, 35, 169-181. Everitt, B.S. and Hand, D.J. (1981), Finite Mixture Distributions, New York: Chapman and Hall. Girman, C.J. (1994), "Cluster Analysis and Classification Tree Methodologyas an Aid to Improve Understandinh of Benign Prostatic Hyperplasia," Ph.D. thesis, Chapel Hill, NC: Department of Biostatistics, University of North Carolina. Gitman, I. (1973), "An Algorithm for Nonsupervised Pattern Classification,` IEEE Transactions on Systems, Man, and Cybernetics, SMC-3, 66-74. Good, I.J. (1977), "The Botryology of Botryology," in Classification and Clustering, ed. J. Van Ryzin, New York: Academic Press, Inc. Harman, H.H. (1976), Modern Factor Analysis, 3d Edition, Chicago: University of Chicago Press. Hartigan, J.A. (1975), Clustering Algorithms, New York: John Wiley & Sons, Inc. Hartigan, J.A. (1977), "Distribution Problems in Clustering," in Classification and Clustering, ed. J. Van Ryzin, New York: Academic Press, Inc. Hartigan, J.A. (1978), "Asymptotic Distributions for Clustering Criteria," Annals of Statistics, 6, 117-131. Hartigan, J.A. (1981), "Consistency of Single Linkage for High-Density Clusters," Journal of the American Statistical Association, 76, 388-394. Hartigan, J.A. (1985), "Statistical Theory in Clustering," Journal of Classification, 2, 63-76. Hartigan, J.A. and Hartigan, P.M. (1985), "The Dip Test of Unimodality,` Annals of Statistics, 13, 70-84. Hartigan, P.M. (1985), "Computation of the Dip Statistic to Test for Unimodality," Applied Statistics, 34, 320-325. Hawkins, D.M., Muller, M.W., and ten Krooden, J.A. (1982), "Cluster Analysis," in Topics in Applied Multivariate Analysis, ed. D.M. Hawkins, Cambridge: Cambridge University Press. Hubert, L. (1974), "Approximate Evaluation Techniques for the Single-Link and Complete-Link Hierarchical Clustering Procedures," Journal of the American Statistical Association, 69, 698-704. Hubert, L.J. and Baker, F.B. (1977), "An Empirical Comparison of Baseline Models for Goodness-of-Fit in r-Diameter Hierarchical Clustering," in Classification and Clustering, ed. J. Van Ryzin, New York: Academic Press, Inc. Huizinga, D. H. (1978), "A Natural or Mode Seeking Cluster Analysis Algorithm,` Technical Report 78-1, Behavioral Research Institute, 2305 Canyon Blvd., Boulder, Colorado 80302. Kaufmann, L. and Rousseeuw, P.J. (1990), Finding Groups in Data, New York: John Wiley & Sons, Inc. Klastorin, T.D. (1983), "Assessing Cluster Analysis Results," Journal of Marketing Research, 20, 92-98. Koontz, W.L.G. and Fukunaga, K. (1972a), "A Nonparametric Valley-Seeking Technique for Cluster Analysis,` IEEE Transactions on Computers, C-21, 171-178. Koontz, W.L.G. and Fukunaga, K. (1972b), "Asymptotic Analysis of a Nonparametric Clustering Technique,` IEEE Transactions on Computers, C-21, 967-974. Koontz, W.L.G., Narendra, P.M., and Fukunaga, K. (1976), "A Graph-Theoretic Approach to Nonparametric Cluster Analysis,` IEEE Transactions on Computers, C-25, 936-944. Lee, K.L. (1979), "Multivariate Tests for Clusters," Journal of the American Statistical Association, 74, 708-714. Ling, R.F (1973), "A Probability Theory of Cluster Analysis," Journal of the American Statistical Association, 68, 159-169. MacQueen, J.B. (1967), "Some Methods for Classification and Analysis of Multivariate Observations," Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281-297. Marriott, F.H.C. (1971), "Practical Problems in a Method of Cluster Analysis," Biometrics, 27, 501-514. Marriott, F.H.C. (1975), "Separating Mixtures of Normal Distributions," Biometrics, 31, 767-769. Massart, D.L. and Kaufman, L. (1983), The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis, New York: John Wiley & Sons, Inc. McClain, J.O. and Rao, V.R. (1975), "CLUSTISZ: A Program to Test for the Quality of Clustering of a Set of Objects," Journal of Marketing Research, 12, 456-460. McLachlan, G.J. and Basford, K.E. (1988), Mixture Models, New York: Marcel Dekker, Inc. Mezzich, J.E and Solomon, H. (1980), Taxonomy and Behavioral Science, New York: Academic Press, Inc. Milligan, G.W. (1980), "An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms," Psychometrika, 45, 325-342. Milligan, G.W. (1981), "A Review of Monte Carlo Tests of Cluster Analysis," Multivariate Behavioral Research, 16, 379-407. Milligan, G.W. and Cooper, M.C. (1985), "An Examination of Procedures for Determining the Number of Clusters in a Data Set," Psychometrika, 50, 159-179. Minnotte, M.C. (1992), "A Test of Mode Existence with Applications to Multimodality,` Ph.D. thesis, Rice University, Department of Statistics. Mizoguchi, R. and Shimura, M. (1980), "A Nonparametric Algorithm for Detecting Clusters Using Hierarchical Structure,` IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2, 292-300. Mueller, D.W. and Sawitzki, G. (1991), "Excess mass estimates and tests for multimodality,` JASA 86, 738-746. Pollard, D. (1981), "Strong Consistency of k-Means Clustering," Annals of Statistics, 9, 135-140. Polonik, W. (1993), "Measuring Mass Concentrations and Estimating Density Contour Clusters--An Excess Mass Approach," Technical Report, Beitraege zur Statistik Nr. 7, Universitaet Heidelberg. Sarle, W.S. (1982), "Cluster Analysis by Least Squares," Proceedings of the Seventh Annual SAS Users Group International Conference, 651-653. Sarle, W.S. (1983), Cubic Clustering Criterion, SAS Technical Report A-108, Cary, NC: SAS Institute Inc. Sarle, W.S and Kuo, An-Hsiang (1993), The MODECLUS Procedure, SAS Technical Report P-256, Cary, NC: SAS Institute Inc. Scott, A.J. and Symons, M.J. (1971), "Clustering Methods Based on Likelihood Ratio Criteria," Biometrics, 27, 387-397. Silverman, B.W. (1986), Density Estimation, New York: Chapman and Hall. Sneath, P.H.A. and Sokal, R.R. (1973), Numerical Taxonomy, San Francisco: W.H. Freeman. Spath, H. (1980), Cluster Analysis Algorithms, Chichester, England: Ellis Horwood. Symons, M.J. (1981), "Clustering Criteria and Multivariate Normal Mixtures," Biometrics, 37, 35-43. Titterington, D.M., Smith, A.F.M., and Makov, U.E. (1985), Statistical Analysis of Finite Mixture Distributions, New York: John Wiley & Sons, Inc. Tukey, P.A. and Tukey, J.W. (1981), "Data-Driven View Selection; Agglomeration and Sharpening,` in Barnett (1981). Ward, J.H. (1963), "Hierarchical Grouping to Optimize an Objective Function," Journal of the American Statistical Association, 58, 236-244. Wolfe, J.H. (1970), "Pattern Clustering by Multivariate Mixture Analysis," Multivariate Behavioral Research, 5, 329-350. Wolfe, J.H. (1978), "Comparative Cluster Analysis of Patterns of Vocational Interest," Multivariate Behavioral Research, 13, 33-44. Wong, M.A. (1982), "A Hybrid Clustering Method for Identifying High-Density Clusters," Journal of the American Statistical Association, 77, 841-847. Wong, M.A. and Lane, T. (1983), "A kth Nearest Neighbor Clustering Procedure," Journal of the Royal Statistical Society, Series B, 45, 362-368. Wong, M.A. and Schaack, C. (1982), "Using the kth Nearest Neighbor Clustering Procedure to Determine the Number of Subpopulations," American Statistical Association 1982 Proceedings of the Statistical Computing Section, 40-48. -- Warren S. Sarle SAS Institute Inc. The opinions expressed here saswss@unx.sas.com SAS Campus Drive are mine and not necessarily (919) 677-8000 Cary, NC 27513, USA those of SAS Institute. * * * * * * * * * * * * * * * * * * * * * * * * *
  • Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
  • FAQ top.
  • Ulrich home page.
  • Ulrich FAQ. http://www.pitt.edu/~wpilib/stats99.html