- file stat .html ->
FAQ - Chap. 10, other analyses. MISC.
******************* CHAID, CART, SEM, Neural nets**********
CHAID, CART
===========================Rob Hughes, 09 Jun 1996======ssc
From: rhughesmd@aol.com (RHughesMD)
Subject: Re: CART
Message-ID: <4pev5i$qqi@newsbf02.news.aol.com>
KnowledgeSeeker from Angoss does CART & CHAID, SPlus has CART, SPSS sells
CHAID, check out http://info.gte.com/%7Ekdd/ for a list of other
classification packages.
=============================Tony Dusoir, 10 Jun 196=======ssc
Message-ID: <009A3A33.8F96C87C.3@ujvax.ulst.ac.uk>
From: AE.Dusoir@ULST.AC.UK
SC (Statistical Calculator) does CART. It allows you to use just
about any definable function to do the splitting (you can use
exact tests, for example), and includes options for multiplicity-
correction. It produces postscript tree-diagrams, among other things.
Details of SC are on the web site below, or e-mail me.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
What is CHAID?
======================Paul Thompson, 3 Oct 1995=========sse
Message-ID: <199510031555.LAA04187@owl.INS.CWRU.Edu>
From: pat@po.CWRU.Edu (Paul A. Thompson)
Subject: Re: CHAID
Reply to message from bgunter@pluto.njcc.com of Tue, 03 Oct
>
>maugx@csv.warwick.ac.uk (Gavin A Jordan) wrote:
>
>> Does anyone have any information about CHAID. It may be some kind of
>
>Just wanted to note for public consumption that CHAID (the last 3
>letters stand for "automatic interaction detector", I believe; first
>two?) is SPSS's implementation and evolution of something known as
I have a serious problem with this. In fact, it makes a serious confusion
between an ALGORITHM and an IMPLEMENTATION. SPSS has an IMPLEMENTATION of
the CHAID (CHi-squared Automatic Interaction Detection) ALGORITHM. The
ALGORITHM is what is called CHAID. Now, in this case, SPSS has used the
name of the ALGORITHM for their IMPLEMENTATION. I dislike that, but I
can't do anything about it. The name CHAID and the associated ALGORITHM is
in the public domain, because it was published before their use of it.
Having worked with CHAID software myself, and done some consulting on it, I
believe that I am in an informed position regarding it. Essentially, what
goes on in CHAID is that you 1) have a dependent classification (i.e.,
voting for Clinton), and 2) a bunch of other variables (i.e., race, sex,
SES, etc.) . 3) the algorithm finds splits in the population which are as
different as possible. It usually works stepwise, and due to the problem
of multiplicity, goes forward. You find the most diverse split, and then
work each of these splits to find more diverse splits. You stop when
splits no longer are signficant; i.e., that group is homogeneous with
respect to variables not yet used.
CHAID is a stepwise method. I would be real cautious about using it
when your population is small; i.e., less than 10,000. When using it,
I would do a 2/3 - 1/3 split, and replicate the exploration with the
confirmatory group.
>software and also in the S-Plus statistical software package. As with
>all such approaches, there are situations where it is quite useful and
>others where it is not. Some background knowledge is therefore
>required for effective application.
>
I agree with this.
--
Paul Thompson, Ph.D. |
Department of Psychiatry |
Case Western Reserve Univ|
Cleveland, OH 44106 |
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
SEM features of stat packs?
====================James H. Steiger, 09 Jun 1996========ssc
From: steiger@unixg.ubc.ca (James H. Steiger)
Subject: Re: Statistica any good?
Message-ID: <31bb2490.12604793@news.ucs.ubc.ca>
Diana Kornbrot wrote:
>i also find statistica's advertising offensive.
>
>however i might change (to support their software engineers rather than
>their marketing folk) if i knew of any particular analysis that can be
>done in statistica that CANNOT be done in SPSS or SAS or S-PLUS
>specific egs are:
>1. ANCOVA that gives separate slopes (and intercepts) for eahc
> level of categorical variables (not obvious in SPSS)
>2. confirmatory factor analysis
>3. robust inferential statistics to go with robust descriptives that
> are in most packages
>4. easily available power calculations
>5. easily speicifiable multi-level models
>initially my dept had statistica becaues it calculates Cronbach's alpha
>which is very useful for psychologists devising psychometric tests.
>however SPSS now has this.
>
I'm moderately knowledgeable in the field of Structural Equation
Modeling and factor analysis, Diana, so, to clarify to the best of my
knowledge...
a. Base SAS cannot do confirmatory factor analysis.
However, the Advanced Stats part of SAS has CALIS, a module that can
do confirmatory factor analysis. You have to pay an additional fee to
get it, and it only works on a single sample. CALIS was written by
Wolfgang Hartmann.
b. SPSS base and advanced offers no confirmatory factor analysis. SPSS
recently announced it will sell AMOS (price not yet established) as an
add-on. AMOS was written by James Arbuckle of Temple University. It
features a very nice, usier friendly, graphical interface for doing
SEM and confirrmatory factor analysis. SPSS was offering LISREL as an
add-on at very substantial cost, but is discontinuing it. I am
assuming it will be priced as an add-on.
c. Systat 6 for DOS offers RAMONA, which can do confirmatory factor
analysis and SEM on single samples. RAMONA is part of the base Systat
package, and does not cost you anything extra.. RAMONA offers a
user-friendly interface, and was written by Michael Browne, a widely
known expert in the field of structural modeling, and Gerhard Mels, a
student.
d. Statistica for Windows includes SEPATH, my module, which does
confirmatory factor analysis and structural equation modeling on one
or more groups (up to a maximum of 10 independent groups.). SEPATH is
part of the base Statistica package, and does not cost anything extra.
It features an extremely fast user interface, and was reviewed in a
recent issue of Structural Equation Modeling, a journal published by
Erlbaum.
So, in the realm of confirmatory factor analysis and structural
equation modeling, Statistica offers the most standard capability,
Systat offers excellent capability for single samples, SAS offers
excellent capability at an additional price. SPSS will offer an option
sometime soon, at an additional price.
Here is the summary table, which seems to indicate that
Statistica has an advantage in this domain of confirmatory
factor analysis and structural equation modeling.
Summary - SEM and Confirmatory Factor Analysis
--------------------------------------------------------------------
SAS SPSS Systat Statistica
Is Module Available? X a X X
Free w. Base Package? X X
Multiple Sample Capability? X X
a. Available later this year X - available now
--------------------------------------------------------------------
I apologize in advance if any aspect of this is not correct, but I
believe the information is correct and up-to-date. Nonetheless, I am
donning my flame-retardand clothing.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Neural nets - SPSS claims? SAS varieties?
weight decay? Sarle.
=====================Warren Sarle, 07 Oct 1995========ssc
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: shareware packages + neural nets
Message-ID:
In article <453r80$7gp@netsrv2.spss.com>,
nichols@spss.com (David Nichols) writes:
|> SPSS has just begun distributing a product
|> called Neural Connection that fits several kinds of neural net models
|> (radial basis functions, Kohonen networks and multilayer perceptrons).
The unsolicited brochure that SPSS sent me has some interesting claims
in it. The brochure says that "Neural nets can model data faster and
more closely than traditional statistics. ... neural networks do not
assume linearity or homogeneity of variance." The "highlights" include
the claim, "NO ASSUMPTIONS REQUIRED. Neural nets do not require
assumptions about the form of the data to analyze it."
It is true that a variety of neural nets can fit flexible nonlinear
functions, much like splines or kernel regression, and thus do not
"assume linearity." But I am very curious to know exactly what kinds
of neural nets are free of assumptions about homogeneity of variance,
independence or errors, and error distributions. Neural networks
such as MLPs and RBF nets are usually trained by maximum likelihood
assuming a normal or Bernoulli distribution (although the neural net
literature does not use that terminology). What marvelous new
estimation methods does the SPSS product use that evade the usual
distributional assumptions?
--
================Warren Sarle, 10 Nov 1995========ssm
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: neural network and cross-sectional data
Message-ID:
In article <47q0nc$s93@clipperton.bondy.orstom.fr>,
traissac@orstom.fr (Traissac Pierre) writes:
|> In article bcr@hydra.unm.edu, bohara@unm.edu (Alok Bohara ECONOMICS) writes:
|> >
|> > What would be a good software to do a neural network modeling? How
|> >much does it cost? (I am thinking in terms of environmental economics,
|> >or finance applications etc.).
|>
|> SAS Institute has developped a neural network module that has been
|> successfully tested in the fields of application you suggest.
|> From what I remember it is not cheap.
Our point-and-click neural net application is indeed "not cheap". But
the macros that are used behind that front-end are free. If you have
SAS/OR (6.08 or later) and don't mind typing percent signs (our macro
language uses lots of percent signs), then you can try the macros to see
if you can muddle through without the GUI. The macros are available by
anonymous ftp from ftp.sas.com (Internet gateway IP 192.35.83.8) in the
directory /pub/sugi19/neural. Read the README file.
There will be a new version of the TNN macro by the end of the year
that will include RBF nets and other neat stuff.
Alok Bohara also asks:
|> Could anyone give me a nice reference that shows a nonlinear (for
|> example) model estimated using the neural network method that shows some
|> interpretation of the estimated parameters? Some applications I have
|> seen in finance simply shows the predictive power of the neural network
|> over other methods such the ARIMA etc...
The parameter estimates are rarely interpretable. In most cases,
neural nets are so ill-conditioned that you can't even get accurate
parameter estimates unless you have a huge sample size (the predictions
may be quite accurate even if the parameter estimates are not).
Even then, it is common for the parameters to be underidentified.
For more information, check out the comp.ai.neural-nets newsgroup
and get the FAQ from:
http://wwwipd.ira.uka.de/~prechelt/FAQ/neural-net-faq.html
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Neural-net, weight decay. (refs). Sarle.
=====================Warren Sarle, 27 Mar 1996========ssm, can-n
Newsgroups: comp.ai.neural-nets,sci.stat.math
Subject: Re: weight decay
Message-ID:
In article <4j5i74$sta@dfw-ixnews4.ix.netcom.com>,
jdadson@ix.netcom.com(Jive Dadson ) writes:
|> I've added weight decay to my NN software. Now I need to learn how
|> to use it. :-) I'm not even sure I've done it right. Should the output
|> layer have the same weight penalties as hidden layers? Does the type
|> of layer matter? For example, should a softmax output layer have
|> a different weight penalty than hidden tanh layers in front of it?
I'm working on weight decay for the comp.ai.neural-nets FAQ. Here's
what I've got so far.
Weight decay adds a penalty term to the error function. The usual
penalty is the sum of squared weights times a decay constant. In a
linear model, this form of weight decay is equivalent to ridge
regression. See "What is jitter?" for more explanation of ridge
regression.
The penalty term causes the weights to converge to smaller absolute
values than they otherwise would. Large weights can hurt generalization
in two different ways. Excessively large weights leading to hidden
units can cause the output function to be too rough, possibly with near
discontinuities. Excessively large weights leading to output units can
cause wild outputs far beyond the range of the data if the output
activation function is not bounded to the same range as the data.
Other penalty terms besides the sum of squared weights are sometimes
used. Weight elimination (Weigend, Rumelhart, and Huberman 1991)
uses:
(w_i)^2
sum -------------
(w_i)^2 + c^2
where w_i is the ith weight and c is a user-specified constant. Whereas
decay using the sum of squared weights tends to shrink the large
coefficients more than the small ones, weight elimination tends to
shrink the small coefficients more, and is therefore more useful for
suggesting subset models (pruning).
The generalization ability of the network can depend crucially on the
decay constant, especially with small training sets. One approach to
choosing the decay constant is to train several networks with different
amounts of decay and estimate the generalization error for each; then
choose the decay constant that minimizes the estimated generalization
error. Weigend, Rumelhart, and Huberman (1991) iteratively update the
decay constant during training.
There are other important considerations for getting good results from
weight decay. You must either standardize the inputs and targets, or
adjust the penalty term for the standard deviations of all the inputs
and targets. It is usually a good idea to omit the biases from the
penalty term.
A fundamental problem with weight decay is that different types of
weights in the network will usually require different decay constants
for good generalization. At the very least, you need three different
decay constants for input-to-hidden, hidden-to-hidden, and
hidden-to-output weights. Adjusting all these decay constants to produce
the best estimated generalization error often requires vast amounts of
computation.
Fortunately, there is a superior alternative to weight decay:
hierarchical Bayesian estimation. Bayesian estimation makes it possible
to estimate efficiently numerous decay constants. See "What is Bayesian
estimation?" [Unfortunately, I haven't written the answer to that
question yet].
References:
Bishop, C.M. (1995), Neural Networks for Pattern Recognition,
Oxford: Oxford University Press.
Ripley, B.D. (1996) Pattern Recognition and Neural
Networks, Cambridge: Cambridge University Press.
Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991).
Generalization by weight-elimination with application to forecasting.
In: R. P. Lippmann, J. Moody, & D. S. Touretzky (eds.),
Advances in Neural Information Processing Systems 3,
San Mateo, CA: Morgan Kaufmann.
|> The books I have don't have sufficient explanations of the theoretical
|> motivation or the practical application. Can anyone recommend one?
|>
|> I haven't really come to grips with error criteria in general. I want
|> to get thoroughly familiar with the Bayesian significance of the
|> error penalty as it applies both to weights and to the difference
|> between estimated values and the corresponding training values.
Bishop (1995) and Ripley (1996) are, of course, excellent sources on
weight decay and Bayesian issues. The best textbook I've seen on
Bayesian inference is Gelman, Carlin, Stern, and Rubin (1995). O'Hagan
(1985) is an excellent explanation of some of the odd things that can
happen with MAP estimation. MacKay and Neal have done the most work on
Bayesian methods for neural nets. I have had trouble getting some of
MacKay's methods to work; my own efforts are described too briefly
(there was a 10 page limit!) in Sarle (1995).
Bernardo, J.M., DeGroot, M.H., Lindley, D.V. and Smith, A.F.M., eds.,
(1985), Bayesian Statistics 2, Amsterdam: Elsevier
Science Publishers B.V. (North-Holland).
Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995),
Bayesian Data Analysis, London: Chapman & Hall,
ISBN 0-412-03991-5.
MacKay, D.J.C. (1992), "A practical Bayesian framework for
backpropagation networks," Neural Computation, 4, 448-472.
MacKay, D.J.C. (199?), "Probable networks and plausible
predictions--a review of practical Bayesian methods for supervised
neural networks," ftp://mraos.ra.phy.cam.ac.uk/pub/mackay/network.ps.Z.
Neal, R.M. (1995), Bayesian Learning for Neural Networks,
Ph.D. thesis, University of Toronto,
ftp://ftp.cs.toronto.edu/pub/radford/thesis.ps.Z.
O'Hagan, A. (1985), "Shoulders in hierarchical models,"
in Bernardo et al. (1985), 697-710.
Sarle, W.S. (1995), "Stopped Training and Other
Remedies for Overfitting," to appear in Proceedings of
the 27th Symposium on the Interface,
ftp://ftp.sas.com/pub/neural/inter95.ps.Z
(this is a very large compressed postscript file, 747K, 10 pages)
====================Warren Sarle, 16 Feb 1996===============ssc
Message-ID:
From: Warren Sarle
Subject: Re: stepwise
In article <4fq9k4$dfn@news.duke.edu>, Frank Harrell writes:
|> I don't know when stepwise methods would be appropriate (unless you
|> use Tibshirani's new "lasso" method in a recent JRSS article to
|> shrink the regression coefficients).
The lasso is somewhat similar to certain weight decay methods in the
neural network (NN) literature. "Weight decay" is the NN term for
regularization methods such as ridge regression. Ridge regression
can be done by minimizing an objective function equal to the error
sum of squares plus a penalty term given by the sum of the squared
regression coefficients, sum(b_i)^2, times the ridge value. Instead
of sum(b_i)^2, the NN folks sometimes use:
(b_i)^2
sum -------------
(b_i)^2 + c^2
which is also called "weight elimination". c is a user-specified
constant. Whereas ridge regression tends to shrink the large
coefficients more than the small ones, weight elimination tends to
shrink the small coefficients more, and is therefore more useful for
suggesting subset models. How well this really works, no one seems to
know.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
FAQ top.
Ulrich home page.
Ulrich FAQ.
http://www.pitt.edu/~wpilib/stats99.html