Well-calibrated predictions are not enough
Calibration is an important idea in statistical prediction. I'll let Nate Silver explain
"One of the most important tests of a forecast - I would argue that it is the single most important one - is called calibration. Out of all the times you said there was a 40 percent chance of rain, how often did rain actually occur? If, over the long run, it really did rain about 40 percent of the time, that means your forecasts were well calibrated. If it wound up raining just 20 percent of the time instead, or 60 percent of the time, they weren’t."
- Nate Silver, The Signal and the Noise
While this is a desirable property in a prediction, it's not necessarily the most important. And usually, it's trivial to achieve (although we'll see this is somewhat pointless).
Consider the following example: we flip two fair coins and want to predict how many are heads. Alice gives the probabilities [0.25, 0.5, 0.25] for 0, 1, and 2 heads respectively. This is perfectly calibrated as it's the true distribution. But now Bob peeks at the first coin. If it's heads he gives probabilities [0.0, 0.5, 0.5] and if it's tails he gives probabilities [0.5, 0.5, 0.0]. This too is perfectly calibrated as it's the true conditional distribution. With both Alice and Bob being perfectly calibrated, it would seem we have to defer to other means of judging these predictions.
This is an example of a simple, but not necessarily immediately obvious fact, there are an infinite number of perfectly-calibrated predictions as any conditional distribution will achieve perfect calibration. In our example, Bob is predicting from a conditional distribution $P(X1 + X2 \mid X1). Alice also predicts from a conditional distribution as we can view the marginal \(P(X_{1} + X_{2})\) as just conditional distribution where nothing is conditioned upon. We could even have Charlie flip a coin of his own and condition on it (even though it gives no additional information) and that would be perfectly calibrated.
This shows up in practice as well. I'm currently working on an astrostatistics project focused on comparing several prediction methods. One of the tools they are using to compare methods is calibration on a test set. Knowing this, I suggested that they look at the performance of the train set marginal distribution. And, surprise!, it has almost perfect calibration (never mind that it literally predicts the same outcome for every input).
So it's clear that calibration needs to be supplemented with something if we want to select between all of these potential predictions. And it's equally clear that if we're actually interested in doing something with our predictions we'd want to go with Bob over Alice. To bring this back to Nate Silver's example, a prediction which is just the long-run rate of rain every day achieves perfect calibration but doesn't help me decide whether or not to bring my umbrella.
You might say that we should pick the model which conditions on the most data. The only reason Bob beats Ann and the weather service beats the farmer's almanac is due to the availability of more information. This seems true: more information will generally improve your predictions (ideally I'd like a proof for this, but will assume this for now). But then this leads to questions of which data to condition upon and how well we can fit models with more and more covariates. It doesn't help me to condition upon everything if that starts to degrade my model performance. But also, I might want to tolerate some poor calibration because it lets me condition on more covariates.
One solution is to utilize proper scoring rules: functions which "reward" good predictions. To compare two models we simply look at which optimizes the score (which is so much better than staring at two slightly-off calibration curves). Examples are the Brier score and logarithmic score.
Interestingly enough, calibration often plays a role in these scoring rules showing up as a term in the scoring rule. We'll examine the Brier curve specifically, but other scoring rules have a similar structure.
For binary responses the Brier score is defined as
\[ \text{BS}_{\hat{p}} = E[\hat{p} - X)^{2}] \]
where \(\hat{p}\) is the prediction probability and \(X\) is the binary outcome. You can straightforwardly extend to categorical responses (like our coin example), but for simplicity we'll stick to the binary case.
This can be decomposed as
\[ \text{BS}_{\hat{p}} = E[(\hat{p} - \pi(\hat{p}))^{2}] - E[(\pi(\hat{p}) - \bar{\pi})^{2}] + \bar{\pi}(1 - \bar{\pi}) \]
where \(\pi(\hat{p}) = P(X \mid \hat{p})\) is the actual probability for a given predicted probability and \(\bar{pi} = P(X)\) is the marginal probability of \(X\).
The first term \(E[(\hat{p} - \pi(\hat{p}))^{2}]\) is referred to as reliability: how close are you to perfect calibration? For this Alice and Bob are tied as both have perfect calibration.
The second term \(E[(\pi(\hat{p}) - \bar{\pi})^{2}]\) is referred to as resolution: how far away are you from the base rate. This is where Bob beats out Alice: Bob has varying predictions while Alice only predicts the base rate.
The final term \(\bar{\pi}(1 - \bar{\pi})\) is referred to as uncertainty: this is the irreducible prediction error. Given that it doesn't depend upon \(\hat{p}\) we can ignore it when comparing models.
Using proper scoring rules we can distinguish between perfectly calibrated predictions and see which is better. This encompasses calibration but also goes beyond it. We can tolerate some poor calibration if it helps improve our resolution.
This all comes back to a common thought I express (usually in our History of Stats or Sports Stats reading groups) that the best way to compare models is to just figure out which minimizes a loss function. In this case the Brier score is our loss, but we could swap it out with something else like the logarithmic loss or, ideally, some loss expressing the actual problem we're trying to solve. Properties like calibration are appealing, but at the end of the day better predictions are what we're shooting for.
Nate Silver has an interesting observation related to this: the Weather Channel forecasts actually aren't well-calibrated as a reported 20% chance of rain will result in rain only 5% of the time. Their "loss function" reflects the fact that people get angry if you predict a low chance of rain and then it does rain and ruin their plans. Their model is objectively more "wrong", but it proves to be more "useful".