[Message dealing with CART-type methods:] Newsgroups: sci.stat.math,sci.stat.edu,sci.stat.consult From: saswss@hotellng.unx.sas.com (Warren Sarle) Date: Sun, 11 Sep 1994 18:35:20 GMT In "CART- Classification and Regression Trees", sci.stat.math article <34t1t0$m2i@search01.news.aol.com>, ajhorovitz@aol.com (AJHorovitz) writes: |> |> CART-Classification and Regression Trees (Algorithms produced by |> California Statistical Software (Breiman, et al, 1984) and Interface by |> SALFORD SYSTEMS) |> ... |> CART is a new tree structured statistical analysis program that can |> automatically search for and find the hidden structure in your data. Based |> on the original work of some of the world's leading statisticians, CART is |> the only "stand-alone" tree-based program that can give you statistically |> valid results. Since the task of distributing information on empirical decision tree methodology seems to have fallen on me, I feel I should correct the misinformation in the post quoted above. I asked AJHorovitz whether he intended to say that FIRM, Knowledge Seeker, and Data Splits (to name but a few '"stand-alone" tree-based programs') are statistically invalid. The gist of his reply was that only cross-validation yields statistically valid results. FIRM and Knowledge Seeker do multiplicity-adjusted significance tests. While some statisticians have philosophical objections to significance tests, branding significance tests as invalid in advertising literature strikes me as misleading as anything in the Systat/Statistica debate. Even if we grant that significance tests are statistically invalid, we are left with the fact that Data Splits and IND both do the same kind of cross-validation as CART does. So the claim that 'CART is the only "stand-alone" tree-based program that can give you statistically valid results' is clearly incorrect. I set follow-ups to sci.stat.edu, since that is where the recent debate on statistical software marketing has been going on. Here is my summary of empirical decision tree software. I updated the information on CART to give Salford Systems address. .................................................................. There are many algorithms and programs for computing empirical decision trees. Several families can be identified with typical characteristics as listed below: The CART family: CART, tree (S), etc. Motivation: statistical prediction. Exactly two branches from each nonterminal node. Cross-validation and pruning are used to determine size of tree. Response variable can be quantitative or nominal. Predictor variables can be nominal or ordinal, and continuous predictors are supported. The CLS family: CLS, ID3, C4.5, etc. Motivation: concept learning. Number of branches equals number of categories of predictor. Only nominal response and predictor variables are supported in early versions, although I'm told that the latest version of C4.5 supports ordinal predictors The AID family: AID, THAID, CHAID, MAID, XAID, FIRM, TREEDISC, etc. Motivation: detecting complex statistical relationships. Number of branches varies from two to the number of categories of predictor. Statistical significance tests (with multiplicity adjustments in the later versions) are used to determine size of tree. AID, MAID, and XAID are for quantitative responses. THAID, CHAID, and TREEDISC are for nominal responses, although the version of CHAID from Statistical Innovations, distributed by SPSS, can handle a quantitative categorical response. FIRM comes in two varieties for categorical or continuous response. Predictors can be nominal or ordinal and there is usually provision for a missing-value category. Some versions can handle continuous predictors, others cannot. There are also a variety of methods that do splits on linear combinations rather than single predictors. I have not yet constructed a taxonomy for such methods. Some programs combine two or more families. For example, IND combines methods from CART and C4 as well as Bayesian and minimum encoding methods. Knowledge Seeker combines methods from CHAID and ID3 with a novel multiplicity adjustment. There are numerous unresolved statistical issues regarding these methods. Perhaps the most important is how big should the tree be? CART supporters claim that its pruning method using cross-validation is superior to the significance testing method used in the AID family. However, pruning is very easy and quick to do in the AID family since the p-values are computed while growing the tree and no cross-validation is required for pruning. The validity of CART cross-validation is suspect because CART seems to produce much smaller trees than the AID family, even using very conservative significance levels for the latter, which one would expect to validate well although empirical evidence is scarce. I have not seen any published comparison of CART and AID methods. This would make an excellent topic for a thesis. Some references: Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984), _Classification and Regression Trees_, Wadsworth: Belmont, CA. Chambers, J.M. amd Hastie, T.J. (1992), _Statistical Models in S_, Wadsworth & Brooks /Cole: Pacific Grove, CA. Hawkins, D.M. & Kass, G.V. (1982), "Automatic Interaction Detection", in Hawkins, D.M., ed., _Topics in Applied Multivariate Analysis_, 267-302, Cambridge Univ Press: Cambridge. Morgan & Messenger (1973) _THAID--a sequential analysis program for the analysis of nominal scale dependent variables_, Survey Research Center, U of Michigan. Morgan & Sonquist (1963) "Problems in the analysis of survey data and a proposal", JASA, 58, 415-434. (Original AID) Morton, S.C. (1992) "New advances in statistical dendrology", Chance, 5, 76-79. See also letter to editor in volume 6 no. 1. Quinlan, J.R. (1993), _C4.5: Programs for Machine Learning_, Morgan Kaufman: San Mateo, CA. The following information on software sources has been culled from previous posts and may be out of date or inaccurate: C4.5 C source code for a new, improved decision tree algorithm known as C4.5 is in the new book by Ross Quinlan (of ID3 fame). "C4.5: Programs for Machine Learning", Morgan Kaufmann, 1992. It goes for $44.95. With accompanying software on magnetic media it runs for $69.95. ISBN # 1-55860-238-0 CART Salford Systems, 341 N44th Street #711, Lincoln NE 68503, USA. Academic price is $399.00 (US). SYSTAT Corporation distributes a PC version of CART. They can be reached at SYSTAT, Inc., 1800 Sherman Avenue, Evanston, IL 60201, USA. Phone: (708) 864-5670, FAX: (708) 492-3567. CHAID PC version from SPSS (800) 543-5831. Mainframe version from Statistical Innovations Inc., 375 Concord Avenue Belmont, Mass. 02178 Data Splits From Padraic Neville (510) 787-3452, $10 for preliminary release. FIRM `FIRM Formal Inference-based Recursive Modeling', University of Minnesota School of Statistics Technical Report #546, 1992. The writeup and a diskette containing executables is available from the U of M bookstore for $17.50. Incredible bargain! IND Version 2.0 should be available soon at a modest price from NASAs COSMIC center in Georgia, USA. Enquiries should be directed to: mail (to customer support): service@cossack.cosmic.uga.edu Phone: (706) 542-3265 and ask for customer support FAX: (706) 542-4807. Knowledge Seeker Phone 613 723 8020. PC-Group is available from Austin Data Management Associates, P.O. Box 4358, Austin, TX 78765, (512) 320-0935. It runs on IBM and compatible personal computers with 512K of memory, and costs $495. A free demo version of the program is available upon request. tree S: phone 800 462-8146 TREEDISC SAS macro using SAS/IML and SAS/OR available free from SAS Institute technical support (919) 677-8000. -- Warren S. Sarle SAS Institute Inc. The opinions expressed here saswss@unx.sas.com SAS Campus Drive are mine and not necessarily (919) 677-8000 Cary, NC 27513, USA those of SAS Institute.