<- file stat .html -> FAQ - Chap. 8, data reduction ******************* cluster, factor, data reduction *******
  • Cluster analysis texts
  • ======================Warren Sarle, 12 Jun 1996========ssm From: saswss@hotellng.unx.sas.com (Warren Sarle) Subject: Re: Cluster Analysis book? Message-ID: <DswqrH.K8E@unx.sas.com> In article <romano-1206961056250001@puma.fub.it>, romano@fub.it (Gianni Romano) writes: |> Can anybody recommend a recent book on cluster analysis? We seem to be getting an awful lot of requests for cluster references, so here's my bibliography again: Massart and Kaufman (1983) is the best elementary introduction to cluster analysis. Other important texts are Anderberg (1973), Sneath and Sokal (1973), Duran and Odell (1974), Hartigan (1975), Titterington, Smith, and Makov (1985), McLachlan and Basford (1988), and Kaufmann and Rousseeuw (1990). Hartigan (1975) and Spath (1980) give numerous FORTRAN programs for clustering. Any prospective user of cluster analysis should study the Monte Carlo results of Milligan (1980), Milligan and Cooper (1985), and Cooper and Milligan (1984). Essential references on the statistical aspects of clustering include MacQueen (1967), Wolfe (1970), Scott and Symons (1971), Hartigan (1977; 1978; 1981; 1985), Binder (1978; 1981), Symons (1981), Wong and Schaack (1982), Wong and Lane (1983), Sarle (1983), Bock (1985), Banfield and Raftery (1993), and SAS Institute (1993). For fuzzy clustering, see Bezdek (1981) and Bezdek and Pal (1992). See Blashfield and Aldenderfer (1978) for a discussion of the fragmented state of the literature on cluster analysis. Avoid articles in the Journal of Marketing Research. There is a separate list of references at the end on nonparametric clustering methods, which define a cluster as a mode in the probability density function; these nonparamatric methods have major advantages over all traditional methods. << see cluster FAQ for many additional references>> * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • Correspondence analysis
  • =====================Carl Ramm, 16 Jan 1996========ssc Message-ID: <STAT-L%96011608144960@VM1.MCGILL.CA> From: "Carl W. Ramm" <14762cwr@MSU.EDU> Jean Thioulouse <Jean.Thioulouse@BIOMSERV.UNIV-LYON1.FR> wrote: >Look at the following ref: >ter Braak, C.J.F. (1986). Canonical correspondence analysis: a new >eigenvector method for multivariate direct gradient analysis. >Ecology, 67, 1167-1179. See also: Palmer, M.W. Putting things in even better order: the advantages of canonical correspondence analysis. Ecology 74(8): 2215-2230. This paper follows the thread comparing and contrasting correspondence analysis and detrended correspondence analysis (Wartenberg et al. 1987, Peet et al. 1988 in the American Naturalist). Also see: Jongman, R.H.G., C.J.F. ter Braak, and O.F.R. van Tongeren. 1987. Data analysis in community and landscape ecology. Pudoc, Wageningen. 299 p. ===============Jean Thioulouse, 22 Jan 1996========sse From: Jean.Thioulouse@biomserv.univ-lyon1.fr (Jean Thioulouse) Message-ID: <Jean.Thioulouse-2201960921100001@macjt.univ-lyon1.fr> See chapter four in this book: Lebart, L., Morineau, L. and Warwick, K.M. (1984). Multivariate descriptive analysis: correspondence analysis and related techniques for large matrices. New York: John Wiley and Sons. And also this paper: Tenenhaus, M. and Young, F.W. (1985). An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika 50, 91-119. -- Jean Thioulouse - Laboratoire de Biometrie - Universite Lyon 1 * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • ... find Procrustean Factor analysis?
  • =====================Alan Zonderman, 18 Apr 1996========(spss) From: abz@lpc.grc.nia.nih.gov (Alan Zonderman) Subject: Re: procrustes rotation in spss Message-ID: <abz-1704962337380001@dsvg30.grc.nia.nih.gov> ... you can go direct to the source: http://lpcwww.grc.nia.nih.gov/WWW/WWW/LPCpub/INPRESS/CFA.html Alan Zonderman abz@lpc.grc.nia.nih.gov Laboratory of Personality & Cognition, Nat'l Institute on Aging, NIH * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • How many clusters? (Sarle)
  • =====================Warren Sarle, 16 Mar 1996========ssc From: saswss@hotellng.unx.sas.com (Warren Sarle) Subject: Re: Number of clusters [long] Message-ID: <DoDqHz.nL@unx.sas.com> In article <4idhg2$aqb@epervier.CC.UMontreal.CA>, mcduffp@ERE.UMontreal.CA (Mcduff Pierre) writes: |> I looked into the archives of sci.stat.consult but couldn't find THE real |> answer to my problem. I'm using 5 measures of anger for both woman and |> man in 200 couples (5 measures X 2 sexes). I'm trying to do a cluster |> analysis but could not decide if i keep the 3 groups or the 4 groups |> solution. Adapted Mar 16, 1996, from the SAS/STAT User's Guide (1990) and Sarle and Kuo (1993). Copyright 1996 by SAS Institute Inc, Cary, NC, USA. The Number of Clusters ---------------------- There are no completely satisfactory methods for determining the number of population clusters for any type of cluster analysis (Everitt 1979, 1980; Hartigan (1985); Bock (1985). If your purpose in clustering is dissection, that is, to summarize the data without trying to uncover real clusters, it may suffice to look at R^2 for each variable and pooled over all variables. Plots of R^2 against the number of clusters are useful. It is always a good idea to look at your data graphically. If you have only two or three variables, use PROC PLOT to make scatterplots identifying the clusters. With more variables, use PROC CANDISC to compute canonical variables for plotting. Ordinary significance tests, such as analysis-of-variance F tests, are not valid for testing differences between clusters. Since clustering methods attempt to maximize the separation between clusters, the assumptions of the usual significance tests, parametric or nonparametric, are drastically violated. For example, if you take a sample of 100 observations from a single univariate normal distribution, have PROC FASTCLUS divide it into two clusters, and run a t test between the clusters, you usually obtain a probability level of less than 0.0001. For the same reason, methods that purport to test for clusters against the null hypothesis that objects are assigned randomly to clusters (McClain and Rao 1975; Klastorin 1983) are useless. Most valid tests for clusters either have intractable sampling distributions or involve null hypotheses for which rejection is uninformative. .... Earlier note, Clusters. (Sarle). =====================Warren Sarle, 16 Jan 1996========ssc Message-ID: <DLArwM.JGD@unx.sas.com> From: Warren Sarle <saswss@UNX.SAS.COM> Subject: Re: clustering - cutting the dendogram |> On Fri, 5 Jan 1996, James Richard Soltys wrote: |> |> > I have read a few books about hierarchical clustering, but it |> > seems none of them explain a method for determining where to |> > cut the dendogram to get the best number of partitions. If any |> > one knows where I can get this answer, or knows the answer, |> > please let me know. |> |> Have you checked out: |> |> Mojena, R. (1977). Hierarchical grouping methods and stopping rules - |> an evaluation. _Computer_ _Journal_, 20, 359-363. Mojena is too out-of-date to be of any use. The best studies on this subject are: Cooper, M.C. and Milligan, G.W. (1988), "The Effect of Error on Determining the Number of Clusters," Proceedings of the International Workshop on Data Analysis, Decision Support and Expert Knowledge Representation in Marketing and Related Areas of Research, 319-328. Milligan, G.W. and Cooper, M.C. (1985), "An Examination of Procedures for Determining the Number of Clusters in a Data Set," Psychometrika, 50, 159-179. -- * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • What is MDS (and refs)?
  • ======================Warren Sarle, 13 Jun 1996=========sse Message-ID: <DsyHss.MIE@unx.sas.com> Reply-To: saswss@unx.sas.com From: saswss@unx.sas.com (Warren Sarle) Subject: Re: Multidimensional scaling In article <Pine.SUN.3.91N2x.960613171856.4776A-100000@iplab02.nada.kth.se>, Bo Schenkman <bosch@nada.kth.se> writes: |> I would be glad to receive a recommendation for a good |> textbook (or instructive papers) on multidimensional scaling. Ideally, it |> should aimed at scientists and not specifically to professional |> statisticians. >From Technical Report P-229, _SAS/STAT Software: Changes and Enhancements_, release 6.07.03: Multidimensional scaling (MDS) is a class of methods for estimating the coordinates of a set of objects in a space of specified dimensionality from data measuring the distances between pairs of objects. A variety of models can be used involving different ways of computing distances and various functions relating the distances to the actual data. The MDS procedure fits two- and three-way, metric and nonmetric multidimensional scaling models. ... The data for the MDS procedure consist of one or more square symmetric or asymmetric matrices of similarities or dissimilarities between {it objects_ or _stimuli_ (Kruskal and Wish 1978, 7-11). Such data are also called _proximity_ data. ... In psychometric applications, each matrix typically corresponds to a _subject_, and models that fit different parameters for each subject are called _individual difference_ models. ... For an introduction to multidimensional scaling, see Kruskal and Wish (1978) and Arabie, Carroll, and DeSarbo (1987). A more advanced treatments is given by Young (1987). Many practical issues of data collection and analysis are discussed in Schiffman, Reynolds, and Young (1981). The fundamentals of psychological measurement, including both unidimensional and multidimensional scaling, are expounded by Torgerson (1958). Nonlinear least-squares estimation of MDS models is discussed in Null and Sarle (1982). ... References: Arabie, P., Carroll, J.D., and DeSarbo, W.S. (1987), _Three-Way Scaling and Clustering_, Sage University Paper series on Quantitative Applications in the Social Sciences, 07-065. Beverly Hills and London: Sage Publications. Carroll, J.D. and Chang, J.J. (1970), "Analysis of Individual Differences in Multidimensional Scaling via an N-way Generalization of the 'Eckart-Young' Decomposition," _Psychometrika_, 35, 283-319. Davison, M.L. (1983), _Multidimensional Scaling_, New York: John Wiley & Sons. Heiser, W.J. (1981), _Unfolding Analysis of Proximity Data_, Leiden: Department of Datatheory, University of Leiden. Jacobowitz, D. (1975), _The Acquisition of Semantic Structures_, Doctoral dissertation, University of North Carolina at Chapel Hill. Krantz, D.H., Luce, R.D., Suppes, P., and Tversky, A. (1971), _Foundations of Measurement_, New York: Academic Press. Kruskal, J.B. and Wish, M. (1978), _Multidimensional Scaling_, Sage University Paper series on Quantitative Applications in the Social Sciences, 07-011. Beverly Hills and London: Sage Publications. Null, C.H. and Sarle, W.S. (1982), "Multidimensional Scaling by Least Squares" in SAS Institute Inc., _Proceedings of the Seventh Annual SAS Users Group International Conference_, Cary, NC: SAS Institute Inc. Rabinowitz, G. (1976), "A Procedure for Ordering Object Pairs Consistent with the Multidimensional Unfolding Model," _Psychometrika_, 41, 349-373. Ramsay, J.O. (1986), "The MLSCALE Procedure" in SAS Institute Inc., _SUGI Supplemental Library User's Guide, Version 5 Edition_, Cary, NC: SAS Institute Inc. Schiffman, S.S., Reynolds, M.L., and Young, F.W. (1981), _Introduction to Multidimensional Scaling_, New York: Academic Press. Torgerson, W.S. (1958) _Theory and Methods of Scaling_, New York: John Wiley & Sons. Young, F.W. (1982), "Enhancements in ALSCAL-82" in SAS Institute Inc., _Proceedings of the Seventh Annual SAS Users Group International Conference_, Cary, NC: SAS Institute Inc. Young, F.W. (1987), _Multidimensional Scaling: History, Theory, and Applications_, edited by R.M. Hamer. Hillsdale, NJ: Lawrence Erlbaum Associates. Young, F.W., Lewyckyj, R., and Takane, Y. (1986), "The ALSCAL Procedure" in SAS Institute Inc., _SUGI Supplemental Library User's Guide, Version 5 Edition_, Cary, NC: SAS Institute Inc. * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • ... choose number of factors?
  • ======================Yvonnick Noel, 05 Jun 1996========sse Message-ID: <31B63763.228@reseau.galeode.fr> From: Yvonnick Noel <yvonnick.noel@reseau.galeode.fr> Kristian Sundstrvm (mg24650@gaia.swipnet.se) wrote: > Questions: > 1. When the following three criteria give different results concerning the > number of factors you should contain in your model, which one are supposed > to trust the most? 1)Eigenvalue>1 2)75% explanation of variance 3)scree > plot As far as I know, recent simulation studies seem to have shown Horn's parallel analysis to be the best method for determining the "right" number of components. See the following references. Horn's original paper : --------------------- Horn J.L. (1965) A Rationale and Test for the Number of Factors in Factor Analysis, Psychometrika, 30, 179-185. Recent developments and simulations (among others) : -------------------------------------------------- Zwick W.R., Velicer W.F. (1986) Comparison of Five Rules for Determinig the Number of Components to Retain, Psychological Bulletin, 99(3), 432-442. Cota A.A., Longman R.S., Holden R.R., Fekken G.C. (1993) Comparing Different Methods for Implementing Parallel Analysis: A Practical index of Accuracy, Educational and Psychological Measurement, 53, 865-876. Cota A.A., Longman R.S., Holden R.R., Fekken G.C., Xinaris S. (1993) Interpolating 95th Percentile Eigenvalues from Random Data: An Empirical Example, Educational and Psychological Measurement, 53, 585-596. Glorfeld L.W. (1995) An Improvement on Horn's Parallel Analusis Methodology For Selecting the Correct Number of Factors to Retain, Educational and Psychological Measurement, 55(3), 377-393. Hope this helps. Yvonnick. * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • Nonlinear canonical corr, etc.
  • =====================Gaston Hilkhuysen, 21 Mar 1996========ssc Message-ID: <33CC1D360A6@otb.tudelft.nl> From: Gaston Hilkhuysen <hilkhuysen@OTB.TUDELFT.NL> Subject: Non-linear Canonical Analysis -- refs wanted Mike Richman wrote: >Can anyone point me to any literature on a non-linear equivalent to >canonical correlation analysis [an oxymoron, I realize]? We have a >problem in which we are looking to use either a fully non-linear >approach or a piecewise linear fit (fitting positive and negative values independently in a single analysis). > >Thanks in advance, > Gifi (1990) discusses Non-linear Canonical Correlation Analysis in chapter 6. The department of Data-theory at Leiden University in the Netherlands has a Fortran program, CANALS that does the trick. The program comes with a manual that explains the procedure and gives some examples. Gifi, A., (1990), Non-linear Multivariate Analysis, Chichester(GB): Wiley & Sons. van der Burg, E., (1983), CANALS, Leiden:DSWO-press. -Gaston. * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
  • FAQ top.
  • Ulrich home page.
  • Ulrich FAQ. http://www.pitt.edu/~wpilib/stats99.html