[Message dealing with CART-type methods:]

Newsgroups: sci.stat.math,sci.stat.edu,sci.stat.consult
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Date: Sun, 11 Sep 1994 18:35:20 GMT

In "CART- Classification and Regression Trees", sci.stat.math article
<34t1t0$m2i@search01.news.aol.com>, ajhorovitz@aol.com (AJHorovitz) 
|> CART-Classification and Regression Trees (Algorithms produced by
|> California Statistical Software (Breiman, et al, 1984) and Interface by
|> ...
|> CART is a new tree structured statistical analysis program that can
|> automatically search for and find the hidden structure in your data.   Based
|> on the original work of some of the world's leading statisticians, CART is
|> the only "stand-alone" tree-based program that can give you statistically
|> valid results.
Since the task of distributing information on empirical decision tree
methodology seems to have fallen on me, I feel I should correct the
misinformation in the post quoted above.
I asked AJHorovitz whether he intended to say that FIRM, Knowledge
Seeker, and Data Splits (to name but a few '"stand-alone" tree-based
programs') are statistically invalid. The gist of his reply was that
only cross-validation yields statistically valid results.
FIRM and Knowledge Seeker do multiplicity-adjusted significance tests.
While some statisticians have philosophical objections to significance
tests, branding significance tests as invalid in advertising literature
strikes me as misleading as anything in the Systat/Statistica debate.
Even if we grant that significance tests are statistically invalid, we
are left with the fact that Data Splits and IND both do the same kind of
cross-validation as CART does. So the claim that 'CART is the only
"stand-alone" tree-based program that can give you statistically valid
results' is clearly incorrect.
I set follow-ups to sci.stat.edu, since that is where the recent
debate on statistical software marketing has been going on.
Here is my summary of empirical decision tree software. I updated the
information on CART to give Salford Systems address.
There are many algorithms and programs for computing empirical 
trees.  Several families can be identified with typical characteristics
as listed below:
   The CART family: CART, tree (S), etc.
      Motivation: statistical prediction.
      Exactly two branches from each nonterminal node.
      Cross-validation and pruning are used to determine size of tree.
      Response variable can be quantitative or nominal.
      Predictor variables can be nominal or ordinal, and continuous
         predictors are supported.
   The CLS family: CLS, ID3, C4.5, etc.
      Motivation: concept learning.
      Number of branches equals number of categories of predictor.
      Only nominal response and predictor variables are supported
         in early versions, although I'm told that the latest version
         of C4.5 supports ordinal predictors
      Motivation: detecting complex statistical relationships.
      Number of branches varies from two to the number of categories
         of predictor.
      Statistical significance tests (with multiplicity adjustments
         in the later versions) are used to determine size of tree.
      AID, MAID, and XAID are for quantitative responses.
      THAID, CHAID, and TREEDISC are for nominal responses, although
         the version of CHAID from Statistical Innovations, distributed
        by SPSS, can handle a quantitative categorical response.
      FIRM comes in two varieties for categorical or continuous response.
      Predictors can be nominal or ordinal and there is usually
        provision for a missing-value category.
      Some versions can handle continuous predictors, others cannot.
There are also a variety of methods that do splits on linear
combinations rather than single predictors. I have not yet constructed
a taxonomy for such methods.
Some programs combine two or more families. For example, IND combines
methods from CART and C4 as well as Bayesian and minimum encoding
methods. Knowledge Seeker combines methods from CHAID and ID3 
with a novel multiplicity adjustment.
There are numerous unresolved statistical issues regarding these
methods.  Perhaps the most important is how big should the tree be?
CART supporters claim that its pruning method using cross-validation is
superior to the significance testing method used in the AID family.
However, pruning is very easy and quick to do in the AID family since
the p-values are computed while growing the tree and no cross-validation
is required for pruning. The validity of CART cross-validation is
suspect because CART seems to produce much smaller trees than the AID
family, even using very conservative significance levels for the latter,
which one would expect to validate well although empirical evidence is
scarce.  I have not seen any published comparison of CART and AID
methods. This would make an excellent topic for a thesis.
Some references:
Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984),
_Classification and Regression Trees_, Wadsworth: Belmont, CA.
Chambers, J.M. amd Hastie, T.J. (1992), _Statistical Models in S_,
Wadsworth & Brooks /Cole: Pacific Grove, CA.
Hawkins, D.M. & Kass, G.V. (1982), "Automatic Interaction Detection",
in Hawkins, D.M., ed., _Topics in Applied Multivariate Analysis_,
267-302, Cambridge Univ Press: Cambridge.
Morgan & Messenger (1973) _THAID--a sequential analysis
program for the analysis of nominal scale dependent variables_,
Survey Research Center, U of Michigan.
Morgan & Sonquist (1963) "Problems in the analysis of survey data
and a proposal", JASA, 58, 415-434. (Original AID)
Morton, S.C. (1992) "New advances in statistical dendrology", Chance,
5, 76-79. See also letter to editor in volume 6 no. 1.
Quinlan, J.R. (1993), _C4.5: Programs for Machine Learning_, Morgan
Kaufman: San Mateo, CA.
The following information on software sources has been culled from
previous posts and may be out of date or inaccurate:
   C source code for a new, improved decision tree algorithm known as
   C4.5 is in the new book by Ross Quinlan (of ID3 fame). "C4.5:
   Programs for Machine Learning", Morgan Kaufmann, 1992. It goes for
   $44.95.  With accompanying software on magnetic media it runs for
   $69.95.  ISBN # 1-55860-238-0
   Salford Systems, 341 N44th Street #711, Lincoln NE 68503, USA.
   Academic price is $399.00 (US).
   SYSTAT Corporation distributes a PC version of CART. They can be
   reached at SYSTAT, Inc., 1800 Sherman Avenue, Evanston, IL 60201,
   USA.  Phone: (708) 864-5670, FAX: (708) 492-3567.
   PC version from SPSS (800) 543-5831.
   Mainframe version from Statistical Innovations Inc., 375 Concord
   Avenue Belmont, Mass. 02178
Data Splits
   From Padraic Neville (510) 787-3452, $10 for preliminary release.
   `FIRM Formal Inference-based Recursive Modeling', University of
   Minnesota School of Statistics Technical Report #546, 1992.  The
   writeup and a diskette containing executables is available from the
   U of M bookstore for $17.50. Incredible bargain!
   Version 2.0 should be available soon at a modest price from NASAs
   COSMIC center in Georgia, USA. Enquiries should be directed to:
   mail (to customer support): service@cossack.cosmic.uga.edu Phone:
   (706) 542-3265 and ask for customer support FAX: (706) 542-4807.
Knowledge Seeker
   Phone 613 723 8020.
   is available from Austin Data Management Associates,
   P.O. Box 4358, Austin, TX 78765, (512) 320-0935.  It runs on IBM
   and compatible personal computers with 512K of memory, and costs
   $495.  A free demo version of the program is available upon
   S: phone 800 462-8146
   SAS macro using SAS/IML and SAS/OR available free from SAS Institute
   technical support (919) 677-8000.
Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.