---> USER NOTES <--- KMINFLU.EXE K-MEANS CLUSTERING ALGORITHMS WITH INTERNAL INFLUENCE MEASUREMENT PROGRAMMED BY GLENN W. MILLIGAN REVISED BY RICHARD CHENG OHIO STATE UNIVERSITY VERSION 2.0 -- FEB. 1995 List of Files on Diskette: KMINFLU.EXE - Execute File for DOS / 486 & Pentium Class Systems KMINFLU.F - FORTRAN Source File (ASCII Format) TEST1.DAT - Example Data File (ASCII Format) CENT1.DAT - Cluster Centroids for TEST1.DAT (ASCII Format) README.KM - USER NOTES File (ASCII Format) Introduction This diskette contains both source code and execution files for K-Means Clustering with Internal Influence Measurement (see Cheng & Milligan, 1995a, 1995b). An overview article on the software can be found in Cheng and Milligan (1995c). The DOS- based execution file, KMINFLU.EXE, was complied using Microway's NDP FORTRAN (508-746-7341) with the Ergo DOS extender. The execution file must be run from the DOS prompt. It will not run from within WINDOWS. The NDP execute file is optimized to run on a machine using a 486 or Pentium-class Intel processor. It may not execute on a machine using a different processor. The operating system should be DOS 3.3 or a later version. At least 4MB of memory configured as expanded (EMS) memory is required. A mouse is not used. A color graphics monitor is recommended, but not required. The program will execute in two modes, a graphics mode and a text mode. The graphics mode is fairly general and should run on any EGA, VGA, or SVGA class system. The graphics mode is default. The graphics mode is a 650 by 350 screen using a 16 color palette. If the user is using a CGA or LCD monitor, then the text mode can be invoked by added an argument value of 1 to the command line: C:\> kminflu 1 Text mode makes a brief call to a 640 by 200 screen based on a 2 color palette. We understand that this mode is supported on CGA, EGA, MCGA, VGA, and SVGA monitors. After the call, the software should revert to the text mode of operating system. History and Modifications The source code file, KMINFLU.F, represents a revised version of a software system first released by Milligan and Sokol (1980). Users interested in compiling and linking the software on a different system are free to do so. We warn you that this may require source code modification. From experience, the authors have found that the portability of source code can be problematic from one FORTRAN compiler to another. For example, the source code calls several routines that may be specific to the NDP compiler. Comments in the source code indicate that these calls can be disabled if necessary. The user can also attempt to increase the array sizes by changing the specifications in all three (3) PARAMETER statements in the code. This will allow the user to analyze data sets larger than current program limits. Similarly, users can insert their own code for the computation of a new proximity measure in the software. The sections to be modified are indicated in the source code. Finally, the source code is provided for advanced users with FORTRAN programming capabilities who might want to modify or further develop the software. The authors make no representation as to warranty of the software. To the best of our knowledge, the code executes as intended. However, the reader is warned that the use of the software is at the user's own risk. The authors will not be able to support code modification or compiling the software on other computers and operating systems. Program Control The program presents a simple interactive interface. The program will elicit a variety of control parameters before conducting the data analysis. The data file is assumed to exist as a separate ASCII file to be found in the local directory. Output of the analysis is directed to an ASCII file written to the local directory. Alternatively, complete path names can be supplied to write to other user directories. Error handling routines are in place to detect user input values that are out of range or otherwise not acceptable. After an input error has been detected, the user is prompted to enter a new value. This request will be repeated until an acceptable value is entered. The error correction process is fairly thorough. However, it will not test the user's format statement [see input specifications (13 & 14) below]. If the statements are not legal FORTRAN formats, then an execution error will occur when reading the data files. If the format is acceptable, data is read and the analysis continues. In order to provide a check on the user's formats, the first two lines of data as read by the program is printed in the output file for verification purposes. If the user makes a data entry error, program execution can be terminated by entering a -99 at any prompt for numerical information. This is the preferred way to abort program execution. The user's monitor display should be returned to the original settings in effect before the program was started. The user may abort program execution by using the SHIFT- BREAK or CONTROL-C key combination. The break feature has not been disabled in the compiled version of the software. However, this action should only be taken in desperation! The abrupt termination of the program may not result in the return of the system video defaults. That is, the screen colors and video mode may not be reset. Data Input Specifications The input control parameters ask the user for at least 15 specifications. In some cases, default values are listed in the selections. These can be automatically selected by simply pressing the ENTER or RETURN key with no other key strokes. The input sequence: (1) A one line "Title" for the analysis. (2) The sample size of the data set. The current limit is 550. (3) The number of variables to be included in the analysis. The current limit is 30. (4) Specify the number of clusters to be evaluated. The current limit is 40 clusters. (5) Select one of the clustering methods (see Anderberg, 1973): 1 = Jancey's K-Means Method 2 = Forgy's K-Means Method (6) Control the iteration process for the K-means algorithms. Each iteration represents a complete pass through the data set. A limit of 0 allows the algorithm to continue iterating until a stable solution is reached where no data points change cluster membership. Entering an integer value of "n" causes the iteration process to stop as soon as "n" or fewer data points change cluster membership during the most recent iteration. (7) The user is asked whether the data should be standardized by range (see Milligan and Cooper, 1988). (8) Select how the initial seed points are to be determined: 1 = Seed points are specified as selected data elements 2 = Data units are grouped into an initial partition 3 = Centroids are specified as seed points 4 = Randomly selected elements to be used as seed points (9) If option 3 selected in (8), the user is asked is to enter the name of the file containing the cluster centroids. If standardization by range has been selected in (7) and option 3 was selected in (8), the user is asked to confirm that the input centroids have also been standardized. If not, the starting seeds will be particularly inappropriate and likely result in one or more clusters with zero elements assigned to it. Thus, the program will terminate with appropriate instructions to the user if the data is to be standardized and the input centroids have not been similarly treated. If option 1 is selected in (8), the user is asked to specify those data points to be used as starting seeds. If option 2 is selected in (8), the data elements used to compute each starting centroid must be grouped together in the data input file. The user is prompted to specify the first data element of each group of observations. Data element #1 is assumed to be the first element of the first group. (10) If option 3 is selected in (8), the user is asked to specify the FORTRAN format for reading the centroid file. For data files generated by the companion hierarchical software, HCINFLU.EXE, the format is (30G15.5) (see Milligan and Cheng, 1995c). (11) Specify whether the measure of internal influence is to be computed for each data point at the cluster levels specified in (4) above. This can be time consuming for larger size data sets. The default is to by-pass this option. In some cases, the clustering that results from the removal of the data element in question may produce a cluster with zero elements assigned to it. Since the clustering method is not able to complete such an analysis, the data element is flagged in the output and the internal influence is specified as being undefined. (12) Specify the name of the file containing the data values. The program will verify that the file exists before proceeding. (13) Specify the FORTRAN format for reading the data file. For variables data, the format specifies how the data for a single observation is to be read. The format must be in parentheses and it must conform to FORTRAN format conventions. For example, the format (6x,8f4.0) indicates that the first six columns are to be skipped, and then eight number fields exactly four columns wide are to be read, with 0 decimal places assumed present for each number. Skipping selected columns allows the user to omit variables or other information from the analysis. For example, in the TEST1.DAT data set, the first 6 columns are used to encode observation identification information. When specifying the format for reading the data, be sure to use a floating point field (F, G, or E formats). Do not use an integer data field (I format). (14) Specify the name of the file to which the clustering output will be directed. (15) Specify the extent of cluster output. Enter 0 to obtain a normal output listing of cluster memberships, internal influence values, and point-biserial statistics. Enter 1 to obtain extensive output based on a series of n clusterings where each point is eliminated in turn during the influence computation. After determining the input specifications, the program presents a confirmation screen that allows the user to review and change selections if desired. Not all possible selection items will be presented, just those that affect the current analysis. Once the confirmation screen has been accepted, the program reads the data file and conducts the analysis. A series of informational lines are displayed at the user's screen to indicate the progress of the analysis. Example Input Below is a partial listing of user input specification from an example run. The listing shows the control parameters requested by the user: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ KMINFLU K-Means Algorithm With Internal Influence Measurement Developed By Glenn W. Milligan Revised by Richard Cheng Ohio State University Version 2.0 -- Feb. 1995 Test analysis using TEST1.DAT data file. User Input Specifications: Number of elements = 30 Number of variables = 8 Number of clusters = 4 Minimum release = 0 Initial partition = User supplied starting centroids Method used = Jancey's Algorithm Data file name = test1.dat Data input format = (6x,8f4.0) The first two data lines were read as follows: 21. 47. 40. 34. 16. 30. 44. 17. 25. 27. 46. 61. 14. 24. 50. 11. Centroids file name = cent.dat Centroid input format = (30g15.5) The first two lines of centroids file were read as follows: 16.750 43.250 37.500 29.750 21.500 29.250 30.250 9.2500 27.222 31.111 47.444 62.167 13.722 21.889 38.889 10.444 Starting Partition Information: Initial cluster centroids read in as follows: Centroids for cluster 1 16.750 43.250 37.500 29.750 21.500 29.250 30.250 9.2500 Centroids for cluster 2 27.222 31.111 47.444 62.167 13.722 21.889 38.889 10.444 Centroids for cluster 3 31.500 29.500 51.000 82.000 17.500 16.000 14.500 8.5000 Centroids for cluster 4 27.667 22.833 47.000 84.667 8.0000 13.167 41.500 12.667 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Example Output The example output below for the TEST1.DAT includes five segments of the analysis: 1: The clustering iteration history. Here we see that only four iterations were required. This may attest to the quality of the starting seeds. 2: A cluster membership listing shows the cluster assignment of each point in turn. 3: A listing of cluster sizes. 4: Detailed information for each of the four final clusters. Included are membership listings and cluster centroids. 5: The point-biserial statistic for the final solution. This value is based on the clustering of all n objects. Since the magnitude of the user's data may vary from study to study, the FORTRAN output frequently uses the G format. The G format will determine the best way to present the output data depending on the number of output columns allowed in the format. On occiasion, the user may find output values that use E notation. For example: .4587E-1 is the same as .4587 times 10 to the -1 power, or .04587. .4587E+2 is equal to .4587 times 10 to the 2nd power, or 45.87. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Intermediate Results: (history of the clustering iterations) Iteration # of Data Units Moved Summed Deviation About Seed Points ------------------------------------------------------------------------- 1 30 462.33942 2 2 459.79034 3 2 448.62408 4 0 450.95224 Final clustering results are: Raw Membership List: Data Cluster -------------------- 1 1 2 2 3 3 4 4 5 2 6 2 7 4 8 2 9 4 10 4 11 4 12 2 13 1 14 2 15 2 16 4 17 2 18 2 19 2 20 4 21 2 22 1 23 3 24 4 25 4 26 1 27 2 28 4 29 2 30 4 Cluster Sizes: Cluster Count -------------------- 1 4 2 13 3 2 4 11 Detailed information for each cluster: Cluster 1 contains 4 data elements Membership List: 1 13 22 26 Centroid Coordinates: 16.750 43.250 37.500 29.750 21.500 29.250 30.250 9.2500 Cluster 2 contains 13 data elements Membership List: 2 5 6 8 12 14 15 17 18 19 21 27 29 Centroid Coordinates: 25.923 30.615 47.231 56.923 14.077 21.538 36.692 9.9231 Cluster 3 contains 2 data elements Membership List: 3 23 Centroid Coordinates: 31.500 29.500 51.000 82.000 17.500 16.000 14.500 8.5000 Cluster 4 contains 11 data elements Membership List: 4 7 9 10 11 16 20 24 25 28 30 Centroid Coordinates: 29.000 27.182 47.455 80.636 10.182 17.545 42.909 12.273 The point biserial goodness-of-fit index for the complete data set is: 0.558 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Example Output: Cluster Statistics & Internal Influence Presented below are summary statistics from the four cluster solution selected for detailed analysis in the TEST1.DAT data set. (The output correspond to the example input given above.) The internal influence measure for each data point is presented, assuming that this option had been requested by the user in option (11). In the output below, it appears that only data element #23 is an influence point. See the papers by Cheng and Milligan (1995a, 1995b) for further information about the measurement of internal influence. The remaining output relates to the methodology described by Milligan (1981) and Milligan and Sokol (1980). A point-biserial value is presented for each data point as it was deleted in turn. For example, the first value of .555 was obtained by clustering data points #2-#30 with data point #1 removed. Larger values of the point-biserial correlation suggest that a stronger cluster structure is present in the data. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Summary Statistics: Data Point Point Biserial Internal Inf. Removed Statistic Measure -------------------------------------------------- 1 0.555 1.000 2 0.562 1.000 3 0.571 1.000 4 0.552 1.000 5 0.560 1.000 6 0.554 1.000 7 0.561 1.000 8 0.550 1.000 9 0.564 1.000 10 0.569 1.000 11 0.564 1.000 12 0.554 1.000 13 0.556 1.000 14 0.560 1.000 15 0.563 1.000 16 0.552 1.000 17 0.567 1.000 18 0.562 1.000 19 0.549 1.000 20 0.548 1.000 21 0.548 1.000 22 0.568 1.000 23 0.541 0.928 <----- 24 0.567 1.000 25 0.555 1.000 26 0.553 1.000 27 0.552 1.000 28 0.570 1.000 29 0.552 1.000 30 0.555 1.000 End of Method +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ We wish you success in your analyses! References Anderberg, M.R. (1973). Cluster Analysis for Applications. New York: Academic Press. Cheng, R., and Milligan, G.W. (1995a). Measuring the Influence of Individual Data Points in a Cluster Analysis. Jounral of Classification, in press. Cheng, R., and Milligan, G.W. (1995b). Mapping Influence Regions in Hierarchical Clustering. Multiavariate Behavioral Research, in press. Cheng, R., and Milligan, G.W. (1995c). Hierarchical Clustering Algorithms with Influence Detection. Educational and Psychological Measurement, 54, in press. Milligan, G.W. (1981). A Monte Carlo Study of Thirty Internal Criterion Measures for Cluster Analysis. Psychometrika, 46, 187-199. Milligan, G.W., and Cooper, M.C. (1988). A Study of Variable Standardization in Cluster Analysis. Journal of Classification, 5, 181-204. Milligan, G.W., and Sokol, L. (1980). A Two Stage Clustering Algorithm with Robust Recovery Characteristics. Educational and Psychological Measurement, 40, 755-759.