I’m not a big fan of kappa statistics, to say the least. I point out several problems with kappa statistics right after the initial studies in this talk on annotation modeling.
I just got back from another talk on annotation where I was ranting again about the uselessness of kappa. In particular, this blog post is an attempt to demonstrate why a high kappa is not necessary. The whole point of building annotation models a la Dawid and Skene (as applied by Snow et al. in their EMNLP paper on gather NLP data with Mechanical Turk) is that you can create a highreliability corpus without even having high accuracy, much less acceptable kappa values — it’s the same kind of result as using boosting to combine multiple weak learners into a strong learner.
So I came up with some R code to demonstrate why a high kappa is not necessary without even bothering with generative annotation models. Specifically, I’ll show how you can wind up with a highquality corpus even in the face of low kappa scores.
The key point is that annotator accuracy fully determines the accuracy of the resulting entries in the corpus. Chance adjustment has nothing at all to do with corpus accuracy. That’s what I mean when I say that kappa is not predictive. If I only know the annotator accuracies, I can tell you expected accuracy of entries in the corpus, but if I only know kappa, I can’t tell you anything about the accuracy of the corpus (other than that all else being equal, higher kappa is better; but that’s also true of agreement, so kappa’s not adding anything).
First, the pretty picture (the colors are in honor of my hometown baseball team, the Detroit Tigers, clinching a playoff position).
What you’re looking at is a plot of the kappa value vs. annotator accuracy and category prevalence in a binary classification problem. (It’s only the upperright corner of a larger diagram that would let accuracy run from 0 to 1 and kappa from 0 to 1. Here’s the whole plot for comparison.
Note that the results are symmetric in both accuracy and prevalence, because very low accuracy leads to good agreement in the same way that very high accuracy does.)
How did I calculate the values? First, I assumed accuracy was the same for both positive and negative categories (usually not the case — most annotators are biased). Prevalence is defined as the fraction of items belonging to category 1 (usually the “positive” category).
Everything else follows from the definitions of kappa, to result in the following definition in R to compute expected kappa from binary classification data with a given prevalence of category 1 answers and a pair of annotators with the same accuracies.
kappa_fun = function(prev,acc) {
agr = acc^2 + (1  acc)^2;
cat1 = acc * prev + (1  acc) * (1  prev);
e_agr = cat1^2 + (1  cat1)^2;
return((agr  e_agr) / (1  e_agr));
}
Just as an example, let’s look at prevalence = 0.2 and accuracy = 0.9 with say 1000 examples. The expected contingency table would be

Cat1 
Cat2 
Cat1 
170 
90 
Cat2 
90 
650 
and the kappa coefficient would be 0.53, below anyone’s notion of “acceptable”.
The chance of actual agreement is the accuracy squared (both annotators are correct and hence agree) plus one minus the accuracy squared (both annotators are wrong and hence agree — two wrongs make a right for kappa, another of its problems).
The proportion of category 1 responses (say positive responses) is the accuracy times the prevalence (true category is positive, correct response) plus one minus accuracy times one minus prevalence (true category is negative, wrong response).
Next, I calculate expected agreement a la Cohen’s kappa (which is the same as Scott’s pi in this case because the annotators have identical behavior and hence everything’s symmetric), which is just the resulting agreement from voting according to the prevalences. So that’s just the probability of category 1 squared (both annotators respond category 1) and the probability of a category 2 response (1 minus the probability of a category 1 response) squared.
Finally, I return the kappa value itself, which is defined as usual.
Back to the plot. The white border is set at .66, the lowerend threshold established by Krippendorf for somewhat acceptable kappas; the higherend threshold of acceptable kappas set by Krippendorf was 0.8, and is also indicated on the legend.
In my own experience, there are almost no 90% accurate annotators for natural language data. It’s just too messy. But you need well more than 90% accuracy to get into acceptable kappa range on a binary classification problem. Especially if prevalence is high, because as prevalence goes up, kappa goes down.
I hope this demonstrates why having a high kappa is not necessary.
I should add that Ron Artstein asked me after my talk what I thought would be a good thing to present if not kappa. I said basic agreement is more informative than kappa about how good the final corpus is going to be, but I want to go one step further and suggest you just inspect a contingency table. It’ll tell you not only what the agreement is, but also what each annotator’s bias is relative to the other (evidenced by asymmetric contingency tables).
In case anyone’s interested, here’s the R code I then used to generate the fancy plot:
pos = 1;
K = 200;
prevalence = rep(NA,(K + 1)^2);
accuracy = rep(NA,(K + 1)^2);
kappa = rep(NA,(K + 1)^2);
for (m in 1:(K + 1)) {
for (n in 1:(K + 1)) {
prevalence[pos] = (m  1) / K;
accuracy[pos] = (n  1) / K;
kappa[pos] = kappa_fun(prevalence[pos],accuracy[pos]);
pos = pos + 1;
}
}
library("ggplot2");
df = data.frame(prevalence=prevalence,
accuracy=accuracy,
kappa=kappa);
kappa_plot =
ggplot(df, aes(prevalence,accuracy,fill = kappa)) +
labs(title = "Kappas for Binary Classification\n") +
geom_tile() +
scale_x_continuous(expand=c(0,0),
breaks=c(0,0.25,0.5,0.75,1),
limits =c(0.5,1)) +
scale_y_continuous(expand=c(0,0),
breaks=seq(0,10,0.1),
limits=c(0.85,1)) +
scale_fill_gradient2("kappa", limits=c(0,1), midpoint=0.66,
low="orange", mid="white", high="blue",
breaks=c(1,0.8,0.66,0));