Statistical tests for agreement

From Ganfyd

(Redirected from Kappa test)
Jump to: navigation, search

This is a sub-page of Medical statistics.

The kappa (κ) test is a test of agreement - e.g. between experts, sphygmomanometers. The resulting κ value is also known as Cohen's kappa co-efficient, named after the Jacob Cohen, the original author.[1]

For example, radiologists’ assessment of CXRs:

Illustration of Kappa coefficient
Radiologist B
Radiologist A Normal Benign Suspected


Cancer Total
Normal 21 12 0 0 33
Benign 4 17 1 0 22
Suspected cancer 3 9 15 2 29
Cancer 0 0 0 1 1
Total 28 38 16 3 85

The actual agreements were 21, 17, 15, and 1: total 54 out of 85 = 0.64 (64%) of films. As a simple percentage, there is 64% agreement.

However, the kappa calculation also takes into account expected agreements by chance. This is seen as a weakness of the test as it produces a more conservative measure of agreement. The chance agreement is calculated for each category. By way of explanation, radiologist A has classified 33 out of 85 as normal and radiologist B classified 28 out 85 as normal. By chance alone, the probability of a normal result by radiologist A is 33/85=0.39 and for radiologist B 28/85=0.33. The chances of agreement by chance are these two probabilities multiplied, i.e. 0.39*0.33=0.128. This calculation is done for each category and added together at the end.

For ease of calculation, it is sometimes easier to calculate this figure using [row total] * [column total] / [grand total], adding each value and dividing by the grand total only at the end. This figure represents the expected frequency of agreement by chance, i.e. for the normal category, 10.87 cases are expected to agree by chance.

Normal = 33*28/85 = 10.87
Benign = 22*38/85 = 9.84
Suspected cancer = 29*16/85 = 5.46
Cancer = 1*3/85 = 0.04
Total = 26.20

26.20/85 = 0.31: agreement by chance would be expected in 31% of the films. The maximum possible agreement is 1. So the radiologists scope for doing better than by chance is 1.00 - 0.31.

\kappa = \frac{(actual\ agreement - agreement\ expected\ by\ chance)}{(scope\ for\ doing\ better\ than\ by\ chance)}

For more observers, the process is repeated for each observer and category.


=Mathematical Notation

In mathematical terms, if there are n observations in g categories, then the observed proportional agreement is given by

p_{e}={\sum_{i=1}^{g}}f_{ii} / n (where fii = the number of agreements for category I)

The expected proportion of agreements by chance is given by:

p_{e}={\sum_{i=1}^{g}}r_{i}c_{i} / n^2 (where ri and ci are the row and column totals for category for the ith category)

The index of agreement, kappa, is given by:

κ = (po-pe)/(1-pe)
where po = probability or proportion expected by chance and pe = probability or proportion observed.

Interpretation of kappa:

Agreement is generally split into categories. Of note, the cut-offs are arbitrary with no particular evidence behind them.[2]

Value of κ Strength of agreement
≤0.20 Poor
0.21 - 0.40 Fair
0.41-0.60 Moderate
0.61-0.80 Good/Substantial
0.81-1.00 Very good/Almost perfect

Confidence intervals and standard error for kappa

C.I.s, standard error can be calculated for κ, but their use is limited.

The approximate standard error of κ is:

se(\kappa)=\sqrt{{p_{o}(1-p_{o})} \over {n(1-p_{e})^{2}}}

Weighted kappa

Weighted kappa is obtained by giving weights to the frequencies in each cell of the table according to their distance from the diagonal that indicates agreement.[3] This approach is helpful where there categories are in some order, e.g. in the example above, although 'suspected cancer' and 'cancer' are likely to result in similar management, they are counted as disagreements in the same way as a disagreement between 'normal' and 'cancer' would be counted, whereas if weighted, the disagreement is considered to be smaller.

There are two ways of weighting: linear and quadratic. Linear weighting is preferable where moving from any of the categories to another counts as a significant step, e.g. a difference in category 1 and 2 is equally significant as a difference in category 3 and 4. The quadratic approach is better where the distinction between categories at either ends are less significant (e.g. clear fail, borderline fail, borderline pass, pass, good pass, excellent pass, distinction).

For details see a text book such as Practical Statistics for Medical Research by Douglas G Altman (published by Chapman & Hall).


Personal tools