<- file stat 97kappa.html -> Kappa - problems (1997) Miscellaneous problems: Structural zero. Extreme counts. Paired comparisons? (Also, see "ROC" notes on sensitivity, specificity.)
  • Kappa and structural zero
  • =======================Hume Winzar, 07 Jul 1993==========sms Subject: Re: Cohen's kappa & structural zero Message-ID: <winzar.139.0@newsman> In article <1993Jul6.150449.1@lure.latrobe.edu.au> ortihs@lure.latrobe.edu.au writes: >My question concerns the use of Cohen's Kappa in designs with structural zeros >(missing really). The study was designed to assess inter-tester reliability >in the diagnosis of the presence and/or type of trigger points in back muscles. >The 3x3 design inevitably contains a cell for those points missed by both >therapists. The number missed is unknown. Is it still possible to estimate >reliability for the 3x3? Kappa is a test of Reliability, not Accuracy, so the short answer is YES. My longer is NO. A problem with Kappa, I found in an entirely different context but very similar to yours here, is that Kappa can be severely understated with Skewed data, that is, where there isn't an approximately even frequency distribution across ALL cells. (See: Landis, J.R. and G.G. Koch "The Measurement of Observer Agreement for Categorical Data," *Biometrics* 33 (1977) pp 159-174.) If you expected to see some observations categorized into your "Missing" cell then you can make an assumption Expecting equal cell sizes and calculate an Adjusted Kappa: (Kappa subscript n) (See: Brennan, R.L. and D.J. Prediger "Coefficient Kappa: Some Uses, Misuses, and Alternatives," *Educational and Psychological Measurement* 41 (1981) pp 187-197.) If you're prepared to flaunt convention, and risk the eternal derision of your colleagues, then you could could go to a relatively unorthodox source, for medical research anyway: Where Kappa tries to account for the possibility of "Chance Agreement," you might be more concerned with the degree of "Unreliable Disagreement" between your two sets of judgements. With this approach you can construct an "Index of Reliability." (See: Perrault, W.D.Jr. and L.E. Leigh "Reliability of Nominal Data on Qualitative Judgements," *Journal of Marketing Research* 26 (1989) pp 135-148.) Hope this helps. *--------
  • Comparing kappas?
  • =======================Rich Ulrich, 15 Apr 1997==========ssc From: wpilib+@pitt.edu (Richard F Ulrich) Subject: Re: comparing kappas Message-ID: <5j0luj$h8t@usenet.srv.cis.pitt.edu> Chris Penta (penta@psychiatry.uchc.edu) wrote: : I am a researcher at the friendly local medical school (University of : Connecticut). We have a data set composed of diagnoses which were made : on the same patients by different clinicians, according to different : sets of criteria. We have kappas (an accepted test for the reliability : of diagnostic tests) for the various criteria sets, as they were applied : by the diagnosing clinicians. That's easy: We give different clinicians : the same patient and the same test and see if they agree. : Now the hard part: We are hoping to compare the differences in the : kappas themselves to see if some diagnostic criteria sets yielded : significantly more reliable results. SPSS has no function for this. We -- The main situation to use kappas is when you have a 2x2 table, since kappas are badly affected by having different marginal frequencies. The reasonable way to investigate your own reliability is to look at each Dx alone, that is, to use 2x2 tables. (This is good, standard practice for research; but it might bother some people who like single summary numbers, which overwhelm the rare problem-diagnoses: the 'system' looks complete because it has a lot of categories; the 'reliability' looks good because 90% of the cases fall into the same two or three common categories; and that is why kappa is not a good multi-Dx statistic.) Now, I wish that someone would write a good tool for creating multiple kappas from a single table, and McNemar's as a statistic for "difference". If you have different diagnoses, it would USUALLY be the case that you are testing a new system that is only barely different from another system; so you should probably take the matched pairs into account. That is, your 'reliability comparison' would depend only on the cases that differed between systems. What you need to say can be rather complicated, at times. But it can be simplified at times, since it does depend on the number and kind of disagreements that you have -- for instance, it is simpler if one System is entirely a subset of another. * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
  • FAQ top.
  • Ulrich home page.
  • Ulrich FAQ. http://www.pitt.edu/~wpilib/stats99.html