Ch 15: Evaluation of performance

Brier score: difficult interpretation?

The Brier Score is a composite measure of discrimination and calibration, summarizing performance for a prediction model. The Brier Score is defined as a squared distance between observed outcome (0/1) and prediction (a probability between 0 and 1). A perfect model would have distances of zero, so a Brier score or zero. Most models are far from perfect. A problem is that the Brier score depends on the event rate (number of events / number of subjects). With an event rate of 0.5, the maximum Brier score for a non informative model is 0.5^2=0.25. Gary Weissman has an excellent post on interpretation of the Brier score.

where N is the number of observations, yi is the observed outcome, either 0 or 1, and ŷi is the prediction, a probability for the ith observation

The scaled Brier score

As a reference, we need to calculate the maximum Brier score, which we subsequently may use for the scaled Brier score. Three approaches were found to be equivalent, while the intuition for the maximum score may be best for the first formulation, as discussed at |Twitter.

R code for (scaled) Brier score calculation

# the Brier score

brier_score <- function(obs, pred) { mean((obs - pred)^2) }
# obs: 0/1 outcome y; pred: prediction, a probability p̂

# three variants of the scaled Brier score

scaled_brier_score_1 <- function(obs, pred) {

1 - (brier_score(obs, pred) / brier_score(obs, mean(obs))) } # mean(obs): ȳ

scaled_brier_score_2 <- function(obs, pred) {

1 - (brier_score(obs, pred) / (mean(obs) * (1 - mean(obs)))) }

scaled_brier_score_3 <- function(obs, pred) {

1 - (brier_score(obs, pred) / (mean(obs) * (1 - mean(obs))^2 + (1 - mean(obs)) * mean(obs)^2)) }

# these variants give identical results

Traditional and modern assessment of performance

Traditional performance assessment includes calibration and discrimination. Modern assessment includes Net Benefit, which expresses the ability of a model to make better decisions.

Binary outcomes

A key paper (pdf) was published in 2010 with a wide group of experts.

Survival outcomes

In the context of STRATOS, 2 related papers were published, one on competing risk model evaluation, and another on survival model evaluation.

Epidemiology 2010: https://pubmed.ncbi.nlm.nih.gov/20010215/

Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW.
Assessing the performance of prediction models: a framework for traditional and novel measures.

Epidemiology. 2010 Jan;21(1):128-38. doi: 10.1097/EDE.0b013e3181c30fb2

Calibration plots

The package can be installed in R as:

require(devtools)
install_git(“https://github.com/BavoDC/CalibrationCurves”)

It produces nice graphics