Brier score and prediction accuracy: A practical guide

Joshua Ravington
April 25, 2026

When a weather model tells you there’s a 90% chance of rain and it stays dry, how bad was that forecast, really? Turns out, there’s a precise way to find out. The Brier score has been quietly doing this job since Glenn Brier introduced it in 1950, and it remains one of the most widely used metrics in fields from meteorology to machine learning.

The median Brier score for prediction markets is well below the “good” threshold of 0.125 and signals strong calibration and information‑aggregation power. For traders, this means the score is not just an academic metric but a practical way for judging how well crowds prices information.

In this article, we’ll cover:

  • What the Brier score is and how its formula works
  • How it stacks up against other common metrics
  • Its real world applications from weather forecasting to prediction markets
  • The most common misconceptions that trip people up

Definition of the Brier score

The Brier score measures the mean squared difference between predicted probabilities and actual outcomes. For a binary outcome, it looks like this:

BS = (1/n) × Σ(fₜ − oₜ)²

Where fₜ is the predicted probability and oₜ is the actual outcome (1 if the event happened, 0 if it didn’t). The score ranges from 0 to 1 with 0 being a perfect score and 1 a terrible one. A completely uninformative model that always predicts 0.5 scores 0.25, which acts as a baseline to beat.

Prediction accuracy of the Brier score

Now that the math is out of the way, here’s what makes the Brier score genuinely useful. Because it squares the error, it punishes overconfident wrong predictions far more harshly than cautious ones. If a model says there’s a 95% chance of an event and it doesn’t happen, that prediction racks up a penalty of (0.95 − 0)² = 0.9025. Compare that to a hedged 60% prediction on the same miss: (0.60 − 0)² = 0.36, less than half the damage.

This property pushes forecasters to be honest about uncertainty rather than doubling down on confident guesses. It rewards the kind of calibration that actually helps decision-makers act on forecasts.

Brier score vs. other accuracy metrics

Now it’s worth sizing the Brier score up against the alternatives. Below is a brief comparison of popular models: 

  • Simple accuracy rate only cares whether the prediction was above or below 50%, it throws away all the probability information.
  • Log loss is similar in spirit to the Brier score but penalizes extreme, wrong predictions much more aggressively, approaching infinity, which can destabilize model training. 
  • AUC-ROC evaluates how well a model ranks outcomes but says nothing about whether the probabilities themselves are well-calibrated.

The Brier score hits a practical sweet spot: it’s sensitive to calibration, easy to interpret, and doesn’t blow up numerically. That said, it’s not always the right tool, more on that below.

Real-world applications

The Brier score shows up across a surprisingly wide range of fields, and each brings out different strengths of the metric.

In weather forecasting, meteorological agencies have used it for decades to track down systematic errors in their models and hold forecasters accountable to probabilistic standards.

In clinical medicine, risk models for sepsis, cardiac events and surgical complications are routinely evaluated with the Brier score. A well-calibrated model can mean the difference between a patient receiving the right level of care or being overlooked.

In machine learning, it’s a go-to evaluation metric whenever the output is a probability, particularly in binary classification tasks where you care about more than just which class wins.

In prediction markets, platforms may use the Brier score to rank forecasters over time. It levels the playing field by rewarding participants who assign well-reasoned probabilities rather than those who simply guess the direction. The score also helps aggregate crowd-sourced forecasts, filtering out overconfident contributors who might otherwise skew the consensus.

Common misconceptions

Even experienced practitioners get tripped up by a few misunderstandings. First, a low Brier score doesn’t always mean a model is useful, if the base rate of an event is very rare (say, 1%), a model that always predicts 0.01 can score impressively without ever really doing anything. This is the base rate problem, and it catches people off guard.

Second, many assume the Brier score works equally well for multi-class problems. It can be extended, but the interpretation gets more complex and isn’t always a drop-in replacement.

Third, it’s tempting to compare Brier scores across datasets as if they’re on a universal scale. They’re not, a score of 0.10 in one domain can mean something very different from 0.10 in another, since the difficulty of the task varies.

Conclusion

The Brier score won’t solve every forecasting challenge, but it brings something genuinely valuable to the table: a single, honest number that reflects not just whether a model got it right, but how confidently it got it right or wrong.

In prediction markets, it’s worth adding the Brier score to your toolkit, and knowing exactly what it’s telling you when it does its job.

Frequently asked questions

What is a good Brier score? 

As a general rule of thumb, anything below 0.25 beats a completely uninformative model. Scores closer to 0 indicate strong predictive accuracy.

Is a lower Brier score always better?

Yes. The Brier score ranges from 0 to 1, where 0 is a perfect forecast and 1 is the worst possible.

How is the Brier score different from accuracy? 

Standard accuracy only cares whether your prediction landed above or below 50%. The Brier score takes the full probability into account, so a confident wrong prediction is penalized far more than a cautious one.

Can the Brier score be used for multi-class problems?

Yes, but it needs to be extended to handle multiple outcome categories. The multi-class version sums the squared errors across all classes for each prediction.

Why does the Brier score matter in prediction markets?

The Brier score holds traders accountable to their confidence levels, not just their outcomes. It filters out lucky guessers and consistently rewards forecasters who are genuinely well-calibrated over time.

Does a good Brier score mean my model is useful? 

Not necessarily. If the event you’re predicting is very rare, a model that always predicts a low probability can score well without adding any real value. Always benchmark your Brier score against the base rate of the outcome you’re forecasting.

You may be interested in these articles:

Anyone else who might be interested?