- file stat 97kappa.html ->
Kappa - problems (1997)
Miscellaneous problems: Structural zero. Extreme counts.
Paired comparisons? (Also, see "ROC" notes on sensitivity,
specificity.)
Kappa and structural zero
=======================Hume Winzar, 07 Jul 1993==========sms
Subject: Re: Cohen's kappa & structural zero
Message-ID:
In article <1993Jul6.150449.1@lure.latrobe.edu.au> ortihs@lure.latrobe.edu.au writes:
>My question concerns the use of Cohen's Kappa in designs with structural zeros
>(missing really). The study was designed to assess inter-tester reliability
>in the diagnosis of the presence and/or type of trigger points in back muscles.
>The 3x3 design inevitably contains a cell for those points missed by both
>therapists. The number missed is unknown. Is it still possible to estimate
>reliability for the 3x3?
Kappa is a test of Reliability, not Accuracy, so the short answer is YES.
My longer is NO.
A problem with Kappa, I found in an entirely different context but very
similar to yours here, is that Kappa can be severely understated with
Skewed data, that is, where there isn't an approximately even frequency
distribution across ALL cells.
(See: Landis, J.R. and G.G. Koch "The Measurement of Observer Agreement for
Categorical Data," *Biometrics* 33 (1977) pp 159-174.)
If you expected to see some observations categorized into your "Missing"
cell then you can make an assumption Expecting equal cell sizes and
calculate an Adjusted Kappa: (Kappa subscript n)
(See: Brennan, R.L. and D.J. Prediger "Coefficient Kappa: Some Uses,
Misuses, and Alternatives," *Educational and Psychological Measurement* 41
(1981) pp 187-197.)
If you're prepared to flaunt convention, and risk the eternal derision of
your colleagues, then you could could go to a relatively unorthodox source,
for medical research anyway: Where Kappa tries to account for the
possibility of "Chance Agreement," you might be more concerned with the
degree of "Unreliable Disagreement" between your two sets of judgements.
With this approach you can construct an "Index of Reliability."
(See: Perrault, W.D.Jr. and L.E. Leigh "Reliability of Nominal Data on
Qualitative Judgements," *Journal of Marketing Research* 26 (1989)
pp 135-148.)
Hope this helps.
*--------
Comparing kappas?
=======================Rich Ulrich, 15 Apr 1997==========ssc
From: wpilib+@pitt.edu (Richard F Ulrich)
Subject: Re: comparing kappas
Message-ID: <5j0luj$h8t@usenet.srv.cis.pitt.edu>
Chris Penta (penta@psychiatry.uchc.edu) wrote:
: I am a researcher at the friendly local medical school (University of
: Connecticut). We have a data set composed of diagnoses which were made
: on the same patients by different clinicians, according to different
: sets of criteria. We have kappas (an accepted test for the reliability
: of diagnostic tests) for the various criteria sets, as they were applied
: by the diagnosing clinicians. That's easy: We give different clinicians
: the same patient and the same test and see if they agree.
: Now the hard part: We are hoping to compare the differences in the
: kappas themselves to see if some diagnostic criteria sets yielded
: significantly more reliable results. SPSS has no function for this. We
-- The main situation to use kappas is when you have a 2x2 table,
since kappas are badly affected by having different marginal frequencies.
The reasonable way to investigate your own reliability is to look
at each Dx alone, that is, to use 2x2 tables. (This is good, standard
practice for research; but it might bother some people who like single
summary numbers, which overwhelm the rare problem-diagnoses: the
'system' looks complete because it has a lot of categories; the
'reliability' looks good because 90% of the cases fall into the same
two or three common categories; and that is why kappa is not a
good multi-Dx statistic.)
Now, I wish that someone would write a good tool for creating multiple
kappas from a single table, and McNemar's as a statistic for "difference".
If you have different diagnoses, it would USUALLY be the case that
you are testing a new system that is only barely different from another
system; so you should probably take the matched pairs into account.
That is, your 'reliability comparison' would depend only on the
cases that differed between systems.
What you need to say can be rather complicated, at times. But it can
be simplified at times, since it does depend on the number and kind of
disagreements that you have -- for instance, it is simpler if one
System is entirely a subset of another.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
FAQ top.
Ulrich home page.
Ulrich FAQ.
http://www.pitt.edu/~wpilib/stats99.html