Ch 15: Evaluation of performance
Brier score: difficult interpretation?
The Brier Score is a composite measure of discrimination and calibration, summarizing performance for a prediction model. The Brier Score is defined as a squared distance between observed outcome (0/1) and prediction (a probability between 0 and 1). A perfect model would have distances of zero, so a Brier score or zero. Most models are far from perfect. A problem is that the Brier score depends on the event rate (number of events / number of subjects). With an event rate of 0.5, the maximum Brier score for a non informative model is 0.5^2=0.25. Gary Weissman has an excellent post on interpretation of the Brier score.
where N is the number of observations, yi is the observed outcome, either 0 or 1, and ŷi is the prediction, a probability for the ith observation
The scaled Brier score
As a reference, we need to calculate the maximum Brier score, which we subsequently may use for the scaled Brier score. Three approaches were found to be equivalent, while the intuition for the maximum score may be best for the first formulation, as discussed at |Twitter.
R code for (scaled) Brier score calculation
# the Brier score
brier_score <- function(obs, pred) { mean((obs - pred)^2) }
# obs: 0/1 outcome y; pred: prediction, a probability p̂
# three variants of the scaled Brier score
scaled_brier_score_1 <- function(obs, pred) {
1 - (brier_score(obs, pred) / brier_score(obs, mean(obs))) } # mean(obs): ȳ
scaled_brier_score_2 <- function(obs, pred) {
1 - (brier_score(obs, pred) / (mean(obs) * (1 - mean(obs)))) }
scaled_brier_score_3 <- function(obs, pred) {
1 - (brier_score(obs, pred) / (mean(obs) * (1 - mean(obs))^2 + (1 - mean(obs)) * mean(obs)^2)) }
# these variants give identical results
Traditional and modern assessment of performance
Traditional performance assessment includes calibration and discrimination. Modern assessment includes Net Benefit, which expresses the ability of a model to make better decisions.
Binary outcomes
A key paper (pdf) was published in 2010 with a wide group of experts.
Survival outcomes
In the context of STRATOS, 2 related papers were published, one on competing risk model evaluation, and another on survival model evaluation.
Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW.
Assessing the performance of prediction models: a framework for traditional and novel measures.
Epidemiology. 2010 Jan;21(1):128-38. doi: 10.1097/EDE.0b013e3181c30fb2
Calibration plots
The package can be installed in R as:
require(devtools)
install_git(“https://github.com/BavoDC/CalibrationCurves”)
It produces nice graphics