---> USER NOTES <--- HCINFLU.EXE GENERALIZED AGGLOMERATIVE HIERARCHICAL CLUSTERING ALGORITHMS WITH INTERNAL INFLUENCE MEASUREMENT PROGRAMMED BY GLENN W. MILLIGAN REVISED BY RICHARD CHENG OHIO STATE UNIVERSITY VERSION 2.0 -- OCT. 1993 List of Files on Diskette: HCINFLU.EXE - Execute File for DOS / 486 Class System HCINFLU.F - FORTRAN Source File (ASCII Format) TEST1.DAT - Example Data File (ASCII Format) README.HC - USER NOTES File (ASCII Format) Introduction This diskette contains both source code and execution files for Hierarchical Clustering with Internal Influence Measurement (see Cheng & Milligan, 1994a, 1994b). An overview article on the software can be found in Cheng and Milligan (1994c). The DOS- based execution file, HCINFLU.EXE, was complied using Microway's NDP FORTRAN (508-746-7341) with the Ergo DOS extender. The execution file must be run from the DOS prompt. It will not run from within WINDOWS. The NDP execute file is optimized to run on a machine using a 486-class Intel processor. It may not execute on a machine using a different processor. The operating system should be DOS 3.3 or a later version. At least 4MB of memory configured as expanded memory is required. A mouse is not used. A color graphics monitor is recommended, but not required. The program will execute in two modes, a graphics mode and a text mode. The graphics mode is fairly general and should run on any EGA, VGA, or SVGA class system. The graphics mode is default. The graphics mode is a 650 by 350 screen using a 16 color palette. If the user is using a CGA or LCD monitor, then the text mode can be invoked by added an argument value of 1 to the command line: C:\> hcinflu 1 Text mode makes a brief call to a 640 by 200 screen based on a 2 color palette. We understand that this mode is supported on CGA, EGA, MCGA, VGA, and SVGA monitors. After the call, the software should revert to the text mode of operating system. History and Modifications The source code file, HCINFLU.F, represents a revised version of a software system first released by Milligan and Sokol (1980). Users interested in compiling and linking the software on a different system are free to do so. We warn you that this may require source code modification. From experience, the authors have found that the portability of source code can be problematic from one FORTRAN compiler to another. For example, the source code calls several routines that may be specific to the NDP compiler. Comments in the source code indicate that these calls can be disabled if necessary. The user can also attempt to increase the array sizes by changing the specifications in all five (5) PARAMETER statements in the code. This will allow the user to analyze data sets larger than current program limits. Similarly, users can insert their own code for the computation of a new proximity measure in the software. Two sections must be modified and are indicated in the source code. Alternatively, the user can specify a different clustering rule. Again, the source code is annotated to show where this is possible. Finally, the source code is provided for advanced users with FORTRAN programming capabilities who might want to modify or further develop the software. The authors make no representation as to warranty of the software. To the best of our knowledge, the code executes as intended. However, the reader is warned that the use of the software is at the user's own risk. The authors will not be able to support code modification or compiling the software on other computers and operating systems. Program Control The program presents a simple interactive interface. The program will elicit a variety of control parameters before conducting the data analysis. The data file is assumed to exist as a separate ASCII file to be found in the local directory. Output of the analysis is directed to an ASCII file written to the local directory. Alternatively, complete path names can be supplied to write to other user directories. Error handling routines are in place to detect user input values that are out of range or otherwise not acceptable. After an input error has been detected, the user is prompted to enter a new value. This request will be repeated until an acceptable value is entered. The error correction process is fairly thorough. However, it will not test the user's format statement [see input specification (13) below]. If the statement is not a legal FORTRAN format, then an execution error will occur when reading the data file. If the format is acceptable, data is read and the analysis continues. In order to provide a check on the user's format, the first two lines of data as read by the program are printed in the output file for verification purposes. If the user makes a data entry error, program execution can be terminated by entering a -99 at any prompt for numerical information. This is the preferred way to abort program execution. The user's monitor display should be returned to the original settings in effect before the program was started. The user may abort program execution by using the SHIFT-BREAK or CONTROL-C key combination. The break feature has not been disabled in the compiled version of the software. However, this action should only be taken in desperation! The abrupt termination of the program may not result in the return of the system video defaults. That is, the screen colors and video mode may not be reset. Data Input Specifications The input control parameters ask the user for at least 15 specifications. In some cases, default values are listed in the selections. These can be automatically selected by simply pressing the ENTER or RETURN key with no other key strokes. The input sequence: (1) A one line "Title" for the analysis. (2) The sample size of the data set. The current limit is 200. (3) The number of variables to be included in the analysis. The current limit is 30. (4) Select the proximity measure. Up to 4 choices are available. Specify 2 to have the program compute Pearson correlations. Specify 1 if Euclidean distances are to be computed from the input variables. These options assume that the data is in the form of a matrix where each row represents an observation and a column is present for each variable. A specification of 0 indicates that a lower-half triangular input matrix is to be provided by the user. This latter option allows the user to input data based on other measures of proximity that are not included in the clustering software. Finally, option 3 is listed as a user-specified proximity measure. This requires the user to insert their own FORTRAN code in the source listing and re-compilation of the software. Unless this action is taken, selection of this option results in program termination. (5) If alternative 1 is selected in #4 above, the user is asked whether the data should be standardized by range (see Milligan and Cooper, 1988, formula Z5). (6) If alternatives 0 or 3 are selected in #4 above, the user is asked to specify 1 for dissimilarities (such as distances), or -1 for similarities (such as correlations). These values are automatically set if alternatives 1 or 2 are selected in #4. (7) Select one or more clustering methods from a set of 10 options: 1 = Single Linkage (Nearest Neighbor) 2 = Complete Linkage (Furthest Neighbor) 3 = Group Average Method (UPGMA) 4 = Weighted Average Method (WPGMA) 5 = Centroid Method (UPGMC) 6 = Median Method (WPGMC) 7 = Ward's Minimum Variance Method 8 = Lance and William's Beta-Flexible Method 9 = Belbin, Faith, and Milligan's Beta-Flexible Method 10 = User Specified Method Information on Methods 8 and 9 can be found in Belbin, Faith, and Milligan (1992). If 8 or 9 is selected, subsequent prompts ask the user to specify the beta value to be used for the selected method. Alternative 10 requires coding in the source program and re- compilation of the software. In the absence of such modification, the program terminates execution if the option is selected. (8) Specify the range over which detailed information regarding the clusters is desired. Such information includes a list of cluster memberships, cluster centroids for variables data, and the measure of internal influence for each data point. The latter is reported only if option (11) is requested. If the user specifies the same level for both the upper and lower bounds, then the cluster centroids are written to a data file called CENT.DAT using the format (20G15.5). This file can be used as input for another analysis program, such as a k-means clustering procedure (see Milligan and Sokol, 1980). (9) Indicate the width, in columns, of the hardware device that you will use to print the output. For screen display only, the default of 80 columns can be used. Minimum is 75 columns. (10) Specify whether the point-biserial clustering tendency test is to be conducted. See Milligan and Sokol (1980), and Milligan (1981) for further information. This simulation-based test can be time consuming for larger size data sets. Default is to by-pass this option. (11) Specify whether the measure of internal influence is to be computed for each data point at the cluster levels specified in #8 above. This can be time consuming for larger size data sets. Again, default is to by-pass this option. (12) Indicate the type of tree output display that you prefer. Current options allow for either a skyline or an icicle plot, or to by-pass tree output. Default is the skyline plot. See the discussion below regarding tree plot types. (13) Specify the name of the file containing the data values. The program will verify that the file exists before proceeding. (14) Specify the FORTRAN format for reading the data file. For variables data, the format specifies how the data for a single observation is to be read. For a lower-half triangular proximity matrix, the format indicates how the longest data record is to be read. The format must be in parentheses and it must conform to FORTRAN format conventions. For example, the format (6x,8f4.0) indicates that the first six spaces are to be skipped, and then eight number fields exactly four columns wide are to be read, with 0 decimal places assumed present for each number. Skipping selected columns allows the user to omit variables or other information from the analysis. For example, in the TEST1.DAT data set, the first 6 columns are used to encode observation identification information. When specifying the format for reading the data, be sure to use a floating point field (F, G, or E formats). Do not use an integer data field (I format). (15) Specify the name of the file to which the clustering output will be directed. After determining the input specifications, the program presents a verification screen to the user. If any selection is found to be incorrect, the user can request to re-enter the specification. After the verification screen has been accepted, the program reads the data file and conducts the analysis. A series of informational lines are displayed at the user's screen to indicate the progress of the analysis. Example Input Below is a partial listing of user input specification from an example run. The listing shows the control parameters requested by the user: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ HCINFLU Generalized Agglomerative Hierarchical Clustering Algorithms with Internal Infleunce Measurement Programmed by Glenn W. Milligan Revised by Richard Cheng Ohio State University Version 2.0 -- Oct. 1993 Example run of test1.dat User Input Specifications: number of Entities = 30 Number of Variables = 8 Proximity Measure Type = 1 Method Codes Requested = 3 Centroids Computed for the 2 to 3 Cluster Solution The Specified Format is (6x,8f4.0) Skyline Plot Output Selected The First Two Data Lines Were Read as Follows: 21. 47. 40. 34. 16. 30. 44. 17. 25. 27. 46. 61. 14. 24. 50. 11. Starting the Group Average Method +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Example Output: Tree Output Two plotting options are available for producing the tree output. The first option is the traditional skyline plot used to represent a hierarchical clustering process. Example output for a 10 cluster data set is presented below. (Note, the brief tree displays do not correspond to the input listing given above.) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Item Number: 0 0 0 0 0 0 0 0 0 1 4 8 1 2 3 7 5 9 6 0 Level 9 . . . . XXX . . . . 8 . . . . XXX XXX . . 7 XXX . . XXX XXX . . 6 XXX . . XXXXXXX . . 5 XXX . . XXXXXXX XXX 4 XXX . XXXXXXXXX XXX 3 XXX XXXXXXXXXXX XXX 2 XXXXXXXXXXXXXXX XXX 1 XXXXXXXXXXXXXXXXXXX +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ In order to interpret the plot, several features should be noted. First, The columns of the plot correspond to a specified data point. The input order of the data elements corresponds to the "Item Number" assignment of each observation. Reading vertically, the first data element is #4, the second is #8, and the last is #10. The "Level" indicates the number of clusters present at the selected stage of the hierarchical solution. The clustering starts with merging data points #3 and #7. This produces a 9 cluster solution from the original 10 data points. Notice that when points are merged together, they are represented by a series of "XXX" values in the display. When data points are not clustered together, the period notation " . . " is used. At Level 8, data points #5 and #9 merge. This process continues until Level 1 is reached, where data points [4,8,1,2,3,7,5,9] are merged with elements [6,10] to produce a one cluster solution. The second plotting option is called an icicle plot. This plotting procedure was developed by Kruskal and Landwehr (1983). We are grateful to Joe Kruskal for providing source code that was used to produce this output option. We urge the reader to reference the original Kruskal and Landwehr article for an in- depth presentation of this graphical procedure. We have not implemented all of the original options. The interested reader with programming skills can attempt to extend the graphics if necessary. An example icicle display is presented below. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Item Number: 0 0 0 0 0 0 0 0 0 1 4 8 1 2 3 7 5 9 6 0 Level Proximity Value 1 47.394 0=0=0=0=0=0=0=0=0=1 2 30.393 4=8=1=2=3=7=5=9 6=0 3 20.707 &=& &=&=&=&=&=& &=& 4 19.795 0=0 0=0=0=0=0 0=1 5 17.117 4=8 3=7=5=9 6=0 6 13.097 &=& &=&=&=& 7 8.4853 0=0 0=0 0=0 8 6.4031 3=7 5=9 9 6.3246 &=& +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ As for interpreting the icicle plot, several features are shared in common with the skyline display. Note that the columns of the plot correspond to a specified data point. The input order of the data elements corresponds to the "Item Number" assignment of each observation. Reading vertically, the first data element is #4, the second is #8, and the last is #10. The "Level" indicates the number of clusters present at the selected stage of the hierarchical solution. However, note that the clustering sequence is the reverse of that found for skyline plots. That is, the top of the tree diagram corresponds to the one cluster solution. For the skyline display, the last row is the single group solution. reading from the bottom of the icicle plot, the clustering starts with merging data points #3 and #7. This produces a 9 cluster solution from the original 10 data points. Notice that when points are merged together, they are connected by an "=" sign. When data points are not clustered together, then nothing is printed between the columns. The absence of a printed character has been called a "wall" that separates distinct clusters. At Level 8, data points #5 and #9 merge. This process continues until Level 1 is reached, where data points [4,8,1,2,3,7,5,9] are merged with elements [6,10] to produce a one cluster solution. Two additional features of the icicle plot should be noted. First, the data point identification is carried down the column of printed characters. For example, item #3 is identified three times down its respective icicle. Each item number is separated vertically by an "&" sign. This makes the plot easier to read for larger size data sets. Secondly, the "Proximity Value" is indicated for each level in the clustering process. The number is the smallest fusion value found in the updated proximity matrix. This value represents the similarity between the two nearest clusters. These clusters are selected to merge at the corresponding hierarchy level. One final comment regarding skyline and icicle plots is in order. The tree information presented in both displays is identical. The icicle plot is an upside-down mirror image of the skyline tree. If you print both plots on paper and hold them together up to a light source, you can obtain a perfect overlay by reversing one plot upside-down over the other display. The only exception occurs when ties occur in the proximity matrix. The skyline plot will print two or more tied mergers at the same level. The icicle plot will produce a separate line for each merger. However, the ordering of the mergers is completely arbitrary. Example Output: Cluster Statistics & Internal Influence Presented below are summary statistics from the hierarchical levels selected for detailed analysis in the TEST1.DAT data set. (The output correspond to the example input given above.) The output below corresponds to the 3 cluster solution as requested in the input specification. (The output for the 2 cluster solution is similar and will not be presented here.) The first part of the output provides a detailed listing of cluster membership on a point-by-point basis. This information is redundant with that presented in the tree output, but it is in a format that is easier to use for some purposes. The second part of the output presents centroids computed for each of the three clusters based on the variables used in the cluster analysis. These values can be used for descriptive purposes, or as starting seeds for another clustering algorithm, such as a K-means method (see Milligan & Sokol, 1980). Finally, the internal influence measure for each data point is presented - assuming that this option had been requested by the user in option (10). In the output below, it appears that a number of influence points are present (#7, #9-#11, #16, #22-#24, #26). See the papers by Cheng and Milligan (1994a, 1994b) for further information about the measurement of internal influence. Since the magnitude of the user's data may vary from study to study, the FORTRAN output frequently uses the G format. The G format will determine the best way to present the output data depending on the number of output columns allowed in the format. On occiasion, the user may find output values that use E notation. For example: .4587E-1 is the same as .4587 times 10 to the -1 power, or .04587. .4587E+2 is equal to .4587 times 10 to the 2nd power, or 45.87. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Cluster Membership of Data Points for the 3 Cluster Solution (Hierarchy Level 3) Point Assigned to Cluster ------------------------------ 1 1 2 2 3 3 4 2 5 2 6 2 7 2 8 2 9 2 10 2 11 2 12 2 13 1 14 2 15 2 16 2 17 2 18 2 19 2 20 2 21 2 22 1 23 3 24 2 25 2 26 1 27 2 28 2 29 2 30 2 Centroids for the 3 Cluster Solution Centroids for Cluster 1 ------------------------- Variable 1 16.750 Variable 2 43.250 Variable 3 37.500 Variable 4 29.750 Variable 5 21.500 Variable 6 29.250 Variable 7 30.250 Variable 8 9.250 Centroids for Cluster 2 ------------------------- Variable 1 27.333 Variable 2 29.042 Variable 3 47.333 Variable 4 67.792 Variable 5 12.292 Variable 6 19.708 Variable 7 39.542 Variable 8 11.000 Centroids for Cluster 3 ------------------------- Variable 1 31.500 Variable 2 29.500 Variable 3 51.000 Variable 4 82.000 Variable 5 17.500 Variable 6 16.000 Variable 7 14.500 Variable 8 8.500 Internal Influence Measure POINT MEASURE ---------------------- 1 1.00000 2 1.00000 3 1.00000 4 1.00000 5 1.00000 6 1.00000 7 0.46294 8 1.00000 9 0.41262 10 0.46294 11 0.41262 12 1.00000 13 1.00000 14 1.00000 15 1.00000 16 0.52258 17 1.00000 18 1.00000 19 1.00000 20 1.00000 21 1.00000 22 0.85665 23 0.46214 24 0.52258 25 1.00000 26 0.55776 27 1.00000 28 1.00000 29 1.00000 30 1.00000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Example Output: Goodness-of-Fit Measure The remaining output relates to the methodology described by Milligan (1981) and Milligan and Sokol (1980). A separate row is reported for each level in the hierarchy. The first three columns present the level (number of clusters), proximity value, and stepsize. Stepsize is the increase in the selected fusion value from one level to the next. The fourth column presents the value for the point-biserial goodness-of-fit measure at the specified cluster level. Larger values of the point-biserial correlation suggest that a stronger cluster structure is present in the data. Finally, if requested by the user in #10 above, a randomization test is conducted. Multivariate uniform random data is generated to simulate data sets lacking any significant cluster structure. These data sets reflect a null hypothesis of no significant cluster structure in the data. The same number of variables and observations as found in the user's data is present in the simulated data. A total of 100 random data sets are created and subjected to the clustering process. The point-biserial measure is computed at each hierarchy level for each random data set. The mean and standard deviation for the point-biserial obtained from the 100 data sets is reported in the last two columns of the output. The "P-Value" is the proportion of simulated data sets with a larger value for the point-biserial measure than that found for the user's data. For example, at cluster level 18, the P-VALUE is .9200, meaning that the observed point-biserial value of .345 was exceeded by 92 of the 100 simulated data sets. This suggests that the clustering in the user's data at this level is not significantly different from that found in random data. However, at the level of 2 clusters, the observed point-biserial of .668 produced a P-VALUE of .0000, indicating that it was larger than all 100 randomly generated data sets. This suggests that a significant cluster structure may be present in the user's data set. More precisely, a structure seems to be present that is different from the structure found in the simulated data sets. We suggest that users do not over-interpret the results. The user also should note the problem relating to the overall Type I error rate when repeated hypothesis tests are conducted. We wish you success in your analyses! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Output Statistics: Number of Proximity Stepsize Point P-Value Simulated Clusters Value Biserial Mean Std. Dev --------------------------------------------------------------------------- 29 7.746 0.000 0.088 1.0000 0.140 0.0139 28 7.810 0.064 0.124 1.0000 0.192 0.0194 27 9.965 2.155 0.168 1.0000 0.231 0.0210 26 10.630 0.665 0.186 1.0000 0.264 0.0242 25 11.793 1.163 0.214 1.0000 0.291 0.0247 24 12.450 0.657 0.226 1.0000 0.314 0.0268 23 13.491 1.041 0.236 1.0000 0.338 0.0292 22 13.856 0.366 0.246 1.0000 0.358 0.0315 21 14.318 0.461 0.255 0.9900 0.374 0.0342 20 16.368 2.050 0.313 0.9700 0.388 0.0369 19 17.004 0.636 0.335 0.9400 0.396 0.0400 18 17.606 0.603 0.345 0.9200 0.409 0.0463 17 18.138 0.532 0.350 0.9300 0.417 0.0467 16 18.439 0.301 0.354 0.9000 0.424 0.0517 15 19.127 0.688 0.362 0.8800 0.429 0.0530 14 19.461 0.334 0.391 0.7200 0.424 0.0590 13 20.688 1.228 0.394 0.7000 0.423 0.0593 12 21.817 1.129 0.397 0.5700 0.408 0.0663 11 22.166 0.349 0.424 0.4200 0.403 0.0720 10 22.427 0.262 0.453 0.2000 0.392 0.0733 9 22.561 0.134 0.457 0.1300 0.378 0.0694 8 24.111 1.550 0.473 0.0800 0.366 0.0748 7 25.523 1.412 0.483 0.0600 0.342 0.0878 6 26.937 1.414 0.573 0.0100 0.320 0.0896 5 30.387 3.450 0.575 0.0000 0.294 0.0864 4 31.156 0.768 0.581 0.0000 0.270 0.0817 3 33.637 2.481 0.626 0.0000 0.231 0.0850 2 36.226 2.589 0.668 0.0000 0.165 0.0809 1 51.842 15.616 0.000 End of Method ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ References Belbin, L., Faith, D., and Milligan, G.W. (1992). A Comparison of Two Approaches to Beta-Flexible Clustering. Multivariate Behavioral Research, 27, 417-433. Cheng, R., and Milligan, G.W. (1994a). Measuring the Influence of Individual Data Points in a Cluster Analysis. Under Publication Review. Cheng, R., and Milligan, G.W. (1994b). Identifying Influence Regions in Hierarchical Clustering. Under Publication Review. Cheng, R., and Milligan, G.W. (1994c). Hierarchical Clustering Algorithms with Influence Detection. Educational and Psychological Measurement, 54, in press. Kruskal, J.B. and Landwehr, J.M. (1983). Icicle Plots: Better Displays for Hierarchical Clustering. The American Statistician, 37, 162-168. Milligan, G.W. (1981). A Monte Carlo Study of Thirty Internal Criterion Measures for Cluster Analysis. Psychometrika, 46, 187-199. Milligan, G.W., and Cooper, M.C. (1988). A Study of Variable Standardization in Cluster Analysis. Journal of Classification, 5, 181-204. Milligan, G.W., and Sokol, L. (1980). A Two Stage Clustering Algorithm with Robust Recovery Characteristics. Educational and Psychological Measurement, 40, 755-759.