- file stat .html ->
FAQ - Chap. 8, data reduction
******************* cluster, factor, data reduction *******
Cluster analysis texts
======================Warren Sarle, 12 Jun 1996========ssm
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Cluster Analysis book?
Message-ID:
In article , romano@fub.it (Gianni Romano) writes:
|> Can anybody recommend a recent book on cluster analysis?
We seem to be getting an awful lot of requests for cluster references,
so here's my bibliography again:
Massart and Kaufman (1983) is the best elementary introduction to
cluster analysis. Other important texts are Anderberg (1973), Sneath
and Sokal (1973), Duran and Odell (1974), Hartigan (1975), Titterington,
Smith, and Makov (1985), McLachlan and Basford (1988), and Kaufmann and
Rousseeuw (1990). Hartigan (1975) and Spath (1980) give numerous
FORTRAN programs for clustering. Any prospective user of cluster
analysis should study the Monte Carlo results of Milligan (1980),
Milligan and Cooper (1985), and Cooper and Milligan (1984). Essential
references on the statistical aspects of clustering include MacQueen
(1967), Wolfe (1970), Scott and Symons (1971), Hartigan (1977; 1978;
1981; 1985), Binder (1978; 1981), Symons (1981), Wong and Schaack
(1982), Wong and Lane (1983), Sarle (1983), Bock (1985), Banfield and
Raftery (1993), and SAS Institute (1993). For fuzzy clustering, see
Bezdek (1981) and Bezdek and Pal (1992). See Blashfield and Aldenderfer
(1978) for a discussion of the fragmented state of the literature on
cluster analysis. Avoid articles in the Journal of Marketing Research.
There is a separate list of references at the end on nonparametric
clustering methods, which define a cluster as a mode in the probability
density function; these nonparamatric methods have major advantages over
all traditional methods.
<< see cluster FAQ for many additional references>>
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Correspondence analysis
=====================Carl Ramm, 16 Jan 1996========ssc
Message-ID:
From: "Carl W. Ramm" <14762cwr@MSU.EDU>
Jean Thioulouse wrote:
>Look at the following ref:
>ter Braak, C.J.F. (1986). Canonical correspondence analysis: a new
>eigenvector method for multivariate direct gradient analysis.
>Ecology, 67, 1167-1179.
See also:
Palmer, M.W. Putting things in even better order: the advantages
of canonical correspondence analysis. Ecology 74(8): 2215-2230.
This paper follows the thread comparing and contrasting correspondence
analysis and detrended correspondence analysis (Wartenberg et al. 1987,
Peet et al. 1988 in the American Naturalist).
Also see:
Jongman, R.H.G., C.J.F. ter Braak, and O.F.R. van Tongeren. 1987. Data
analysis in community and landscape ecology. Pudoc, Wageningen. 299 p.
===============Jean Thioulouse, 22 Jan 1996========sse
From: Jean.Thioulouse@biomserv.univ-lyon1.fr (Jean Thioulouse)
Message-ID:
See chapter four in this book:
Lebart, L., Morineau, L. and Warwick, K.M. (1984). Multivariate descriptive
analysis: correspondence analysis and related techniques for large matrices.
New York: John Wiley and Sons.
And also this paper:
Tenenhaus, M. and Young, F.W. (1985). An analysis and synthesis of multiple
correspondence analysis, optimal scaling, dual scaling, homogeneity analysis
and other methods for quantifying categorical multivariate data.
Psychometrika 50, 91-119.
--
Jean Thioulouse - Laboratoire de Biometrie - Universite Lyon 1
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
... find Procrustean Factor analysis?
=====================Alan Zonderman, 18 Apr 1996========(spss)
From: abz@lpc.grc.nia.nih.gov (Alan Zonderman)
Subject: Re: procrustes rotation in spss
Message-ID:
... you can go direct to the source:
http://lpcwww.grc.nia.nih.gov/WWW/WWW/LPCpub/INPRESS/CFA.html
Alan Zonderman abz@lpc.grc.nia.nih.gov
Laboratory of Personality & Cognition, Nat'l Institute on Aging, NIH
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
How many clusters? (Sarle)
=====================Warren Sarle, 16 Mar 1996========ssc
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Number of clusters [long]
Message-ID:
In article <4idhg2$aqb@epervier.CC.UMontreal.CA>,
mcduffp@ERE.UMontreal.CA (Mcduff Pierre) writes:
|> I looked into the archives of sci.stat.consult but couldn't find THE real
|> answer to my problem. I'm using 5 measures of anger for both woman and
|> man in 200 couples (5 measures X 2 sexes). I'm trying to do a cluster
|> analysis but could not decide if i keep the 3 groups or the 4 groups
|> solution.
Adapted Mar 16, 1996, from the SAS/STAT User's Guide (1990) and Sarle
and Kuo (1993). Copyright 1996 by SAS Institute Inc, Cary, NC, USA.
The Number of Clusters
----------------------
There are no completely satisfactory methods for determining the number
of population clusters for any type of cluster analysis (Everitt 1979,
1980; Hartigan (1985); Bock (1985).
If your purpose in clustering is dissection, that is, to summarize the
data without trying to uncover real clusters, it may suffice to look at
R^2 for each variable and pooled over all variables. Plots of R^2
against the number of clusters are useful.
It is always a good idea to look at your data graphically. If you have
only two or three variables, use PROC PLOT to make scatterplots
identifying the clusters. With more variables, use PROC CANDISC to
compute canonical variables for plotting.
Ordinary significance tests, such as analysis-of-variance F tests, are
not valid for testing differences between clusters. Since clustering
methods attempt to maximize the separation between clusters, the
assumptions of the usual significance tests, parametric or
nonparametric, are drastically violated. For example, if you take a
sample of 100 observations from a single univariate normal distribution,
have PROC FASTCLUS divide it into two clusters, and run a t test between
the clusters, you usually obtain a probability level of less than
0.0001. For the same reason, methods that purport to test for clusters
against the null hypothesis that objects are assigned randomly to
clusters (McClain and Rao 1975; Klastorin 1983) are useless.
Most valid tests for clusters either have intractable sampling
distributions or involve null hypotheses for which rejection is
uninformative. ....
Earlier note, Clusters. (Sarle).
=====================Warren Sarle, 16 Jan 1996========ssc
Message-ID:
From: Warren Sarle
Subject: Re: clustering - cutting the dendogram
|> On Fri, 5 Jan 1996, James Richard Soltys wrote:
|>
|> > I have read a few books about hierarchical clustering, but it
|> > seems none of them explain a method for determining where to
|> > cut the dendogram to get the best number of partitions. If any
|> > one knows where I can get this answer, or knows the answer,
|> > please let me know.
|>
|> Have you checked out:
|>
|> Mojena, R. (1977). Hierarchical grouping methods and stopping rules -
|> an evaluation. _Computer_ _Journal_, 20, 359-363.
Mojena is too out-of-date to be of any use. The best studies on this
subject are:
Cooper, M.C. and Milligan, G.W. (1988), "The Effect of
Error on Determining the Number of Clusters," Proceedings
of the International Workshop on Data Analysis, Decision
Support and Expert Knowledge Representation in Marketing and
Related Areas of Research, 319-328.
Milligan, G.W. and Cooper, M.C. (1985), "An Examination of Procedures
for Determining the Number of Clusters in a Data Set,"
Psychometrika, 50, 159-179.
--
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
What is MDS (and refs)?
======================Warren Sarle, 13 Jun 1996=========sse
Message-ID:
Reply-To: saswss@unx.sas.com
From: saswss@unx.sas.com (Warren Sarle)
Subject: Re: Multidimensional scaling
In article , Bo
Schenkman writes:
|> I would be glad to receive a recommendation for a good
|> textbook (or instructive papers) on multidimensional scaling. Ideally, it
|> should aimed at scientists and not specifically to professional
|> statisticians.
>From Technical Report P-229, _SAS/STAT Software: Changes and Enhancements_,
release 6.07.03:
Multidimensional scaling (MDS) is a class of methods for estimating the
coordinates of a set of objects in a space of specified dimensionality
from data measuring the distances between pairs of objects. A variety
of models can be used involving different ways of computing distances
and various functions relating the distances to the actual data. The
MDS procedure fits two- and three-way, metric and nonmetric
multidimensional scaling models. ...
The data for the MDS procedure consist of one or more square symmetric
or asymmetric matrices of similarities or dissimilarities between {it
objects_ or _stimuli_ (Kruskal and Wish 1978, 7-11). Such data are
also called _proximity_ data. ...
In psychometric applications, each matrix typically corresponds
to a _subject_, and models that fit different parameters for
each subject are called _individual difference_ models. ...
For an introduction to multidimensional scaling, see Kruskal and Wish
(1978) and Arabie, Carroll, and DeSarbo (1987). A more advanced
treatments is given by Young (1987). Many practical issues of data
collection and analysis are discussed in Schiffman, Reynolds, and Young
(1981). The fundamentals of psychological measurement, including both
unidimensional and multidimensional scaling, are expounded by Torgerson
(1958). Nonlinear least-squares estimation of MDS models is discussed
in Null and Sarle (1982). ...
References:
Arabie, P., Carroll, J.D., and DeSarbo, W.S. (1987), _Three-Way
Scaling and Clustering_, Sage University Paper series on Quantitative
Applications in the Social Sciences, 07-065. Beverly Hills and London:
Sage Publications.
Carroll, J.D. and Chang, J.J. (1970), "Analysis of Individual Differences in
Multidimensional Scaling via an N-way Generalization of the 'Eckart-Young'
Decomposition," _Psychometrika_, 35, 283-319.
Davison, M.L. (1983), _Multidimensional Scaling_, New York: John Wiley & Sons.
Heiser, W.J. (1981), _Unfolding Analysis of Proximity Data_, Leiden:
Department of Datatheory, University of Leiden.
Jacobowitz, D. (1975), _The Acquisition of Semantic Structures_, Doctoral
dissertation, University of North Carolina at Chapel Hill.
Krantz, D.H., Luce, R.D., Suppes, P., and Tversky, A. (1971),
_Foundations of Measurement_, New York: Academic Press.
Kruskal, J.B. and Wish, M. (1978), _Multidimensional Scaling_,
Sage University Paper series on Quantitative Applications in the Social
Sciences, 07-011. Beverly Hills and London: Sage Publications.
Null, C.H. and Sarle, W.S. (1982), "Multidimensional Scaling by Least
Squares" in SAS Institute Inc., _Proceedings of the Seventh Annual SAS
Users Group International Conference_, Cary, NC: SAS Institute Inc.
Rabinowitz, G. (1976), "A Procedure for Ordering Object Pairs Consistent with
the Multidimensional Unfolding Model," _Psychometrika_, 41, 349-373.
Ramsay, J.O. (1986), "The MLSCALE Procedure"
in SAS Institute Inc., _SUGI Supplemental Library User's Guide, Version 5
Edition_, Cary, NC: SAS Institute Inc.
Schiffman, S.S., Reynolds, M.L., and Young, F.W. (1981), _Introduction
to Multidimensional Scaling_, New York: Academic Press.
Torgerson, W.S. (1958) _Theory and Methods of Scaling_, New York: John Wiley &
Sons.
Young, F.W. (1982), "Enhancements in ALSCAL-82" in SAS Institute Inc.,
_Proceedings of the Seventh Annual SAS Users Group International
Conference_, Cary, NC: SAS Institute Inc.
Young, F.W. (1987), _Multidimensional Scaling: History, Theory, and
Applications_, edited by R.M. Hamer. Hillsdale, NJ: Lawrence Erlbaum
Associates.
Young, F.W., Lewyckyj, R., and Takane, Y. (1986), "The ALSCAL Procedure"
in SAS Institute Inc., _SUGI Supplemental Library User's Guide, Version 5
Edition_, Cary, NC: SAS Institute Inc.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
... choose number of factors?
======================Yvonnick Noel, 05 Jun 1996========sse
Message-ID: <31B63763.228@reseau.galeode.fr>
From: Yvonnick Noel
Kristian Sundstrvm (mg24650@gaia.swipnet.se) wrote:
> Questions:
> 1. When the following three criteria give different results concerning the
> number of factors you should contain in your model, which one are supposed
> to trust the most? 1)Eigenvalue>1 2)75% explanation of variance 3)scree
> plot
As far as I know, recent simulation studies seem to have shown Horn's
parallel analysis to be the best method for determining the "right" number of
components. See the following references.
Horn's original paper :
---------------------
Horn J.L. (1965) A Rationale and Test for the Number of Factors in Factor
Analysis, Psychometrika, 30, 179-185.
Recent developments and simulations (among others) :
--------------------------------------------------
Zwick W.R., Velicer W.F. (1986) Comparison of Five Rules for Determinig the
Number of Components to Retain, Psychological Bulletin, 99(3), 432-442.
Cota A.A., Longman R.S., Holden R.R., Fekken G.C. (1993) Comparing Different
Methods for Implementing Parallel Analysis: A Practical index of Accuracy,
Educational and Psychological Measurement, 53, 865-876.
Cota A.A., Longman R.S., Holden R.R., Fekken G.C., Xinaris S. (1993)
Interpolating 95th Percentile Eigenvalues from Random Data: An Empirical
Example, Educational and Psychological Measurement, 53, 585-596.
Glorfeld L.W. (1995) An Improvement on Horn's Parallel Analusis Methodology
For Selecting the Correct Number of Factors to Retain, Educational and
Psychological Measurement, 55(3), 377-393.
Hope this helps.
Yvonnick.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Nonlinear canonical corr, etc.
=====================Gaston Hilkhuysen, 21 Mar 1996========ssc
Message-ID: <33CC1D360A6@otb.tudelft.nl>
From: Gaston Hilkhuysen
Subject: Non-linear Canonical Analysis -- refs wanted
Mike Richman wrote:
>Can anyone point me to any literature on a non-linear equivalent to
>canonical correlation analysis [an oxymoron, I realize]? We have a
>problem in which we are looking to use either a fully non-linear
>approach or a piecewise linear fit (fitting positive and negative
values independently in a single analysis).
>
>Thanks in advance,
>
Gifi (1990) discusses Non-linear Canonical Correlation Analysis in
chapter 6. The department of Data-theory at Leiden University in the
Netherlands has a Fortran program, CANALS that does the trick. The
program comes with a manual that explains the procedure and gives
some examples.
Gifi, A., (1990), Non-linear Multivariate Analysis, Chichester(GB):
Wiley & Sons.
van der Burg, E., (1983), CANALS, Leiden:DSWO-press.
-Gaston.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
FAQ top.
Ulrich home page.
Ulrich FAQ.
http://www.pitt.edu/~wpilib/stats99.html