Chapter 17: Internal validation
Split-sample, cross-validation, and the bootstrap
Several techniques are at the analyst's disposal for internal validation. In a classic JCE paper (Steyerberg, Harrell, 2001, pdf), we compared variants of split-sample validation, cross-validation, and bootstrap validation.
We found that:
Split-sample validation analyses gave overly pessimistic estimates of performance, with large variability;
Cross-validation on 10% of the sample had low bias and low variability;
Bootstrapping provided stable estimates with low bias.
We concluded that split-sample validation is inefficient, and recommended bootstrapping for estimation of internal validity of a predictive logistic regression model. This conclusion was largely confirmed in a later study, with a minor note of caution: “with lower events per variable or lower C statistics, bootstrapping tended to be optimistic but with lower absolute and mean squared errors than .. cross-validation”. In a recent JCE Commentary (Steyerberg&Harrell, 2015, pdf), we stated that 'split-sample validation only works when not needed', i.e. in the situation of very large sample size. This is because a stable prediction model can then be developed on the training part of the data, and a reliable assessment of performance can be obtained in a large test data set. In such a situation of large training and large test part, a better solution is to develop a model on the total data set, with the apparent performance as the best estimate of future performance.
Machine learning and high-dimensional model validation
Machine learning techniques are increasingly used for prediction in high-dimensional data, i.e. all kinds of 'omics' data. We should realize that machine learning techniques are relatively data hungry, and that high-dimensional data pose far more risk of overfitting than classical situations of say 5 to 20 predictors. Hence, validation is very important.
Researchers in this field often resort to old-fashioned split sample techniques. An extreme example was published in Nature Medicine in 2018. Here, researchers split a data set with 54 patients with leukemia in a training part with 44 patients, and a validation part with 10 patients. The endpoint of interest was relapse, which occurred in only 3 patients among the 10 patients in the training set. One does not have to be a statistician to understand that such a small sample size for validation leads to a highly unreliable assessment of performance (JCE 2018).
Sample size for validation studies
Sample size recommmendations suggest at least 100 (Vergouwe 2005) or 200 (Collins 2016) events in validation samples. Others suggested that lower numbers might suffice, depending on the specific requirements for validation. Some additional simulations confirm that 100 events is an absolute minimum for reliable assessment of performance (Steyerberg 2018).
More refined guidance to sample size at validation has recently been proposed, based on the precision of the C-statistic for discrimination and calibration slope and calibration in the large (Pavlou SiM 2021); Riley et al. SiM 2021, SiM 2022).
Internal vs external validation:
what is the purpose?
what is the purpose?
Internal validation aims to quantify optimism in model performance; we consider performance for a single underlying population.
External validation aims to assess generalizability to 'similar, related populations' ( Justice 1999, pdf; Debray 2015, pdf).
If random splits are made, we assess internal validity; this practice should be abolished. Cross-validation and bootstrapping are dominant. If non-random splits are made, e.g. in time, or by place (centers, countries), we assess generalizability. Here our aim should be to quantify heterogeneity in performance rather than a single estimate of 'performance in new data' (Austin 2016, 2017; Riley 2016; Steyerberg 2019).
Problems with 'internal validation'
A modeling study was published in Ann Int Med in 2005 that claimed to correct for statistical optimism by bootstrapping. The results make no sense (pdf). Neither did a published 'correction'.