Chapter 17: Internal validation

Split-sample, cross-validation, and the bootstrap

Several techniques are at the analyst's disposal for internal validation. In a classic JCE paper (Steyerberg, Harrell, 2001, pdf), we compared variants of split-sample validation, cross-validation, and bootstrap validation.

We found that:

  1. Split-sample validation analyses gave overly pessimistic estimates of performance, with large variability;

  2. Cross-validation on 10% of the sample had low bias and low variability;

  3. Bootstrapping provided stable estimates with low bias.

We concluded that split-sample validation is inefficient, and recommended bootstrapping for estimation of internal validity of a predictive logistic regression model. This conclusion was largely confirmed in a later study, with a minor note of caution: “with lower events per variable or lower C statistics, bootstrapping tended to be optimistic but with lower absolute and mean squared errors than .. cross-validation”. In a recent JCE Commentary (Steyerberg&Harrell, 2015, pdf), we stated that 'split-sample validation only works when not needed', i.e. in the situation of very large sample size. This is because a stable prediction model can then be developed on the training part of the data, and a reliable assessment of performance can be obtained in a large test data set. In such a situation of large training and large test part, a better solution is to develop a model on the total data set, with the apparent performance as the best estimate of future performance.
Figure 1 Schematic representation of apparent, split-sample, and bootstrap validation. Suppose we have a development sample of 1,000 subjects (numbered 1,2,3,..1000). Apparent validation assesses performance of a model estimated in these 1000 subjects on the sample. Split-sample validation may consider 50% for model development, and 50% for validation. Bootstrapping involves sampling with replacement (e.g., subject number 1 is drawn twice, number 2 is out, etcetera), with validation of the model developed in the bootstrap sample (Sample*) in the original sample.
Figure 2 Schematic representation of internal-external cross-validation and external validation. Suppose we have 4 centers (a – d) in our development sample. We may leave 1 center out at a time to cross-validate a model developed in the other centers. One such validation is illustrated: for a model based on 750 subjects from centers b, c, and d, on 250 subjects from center a. Since the split is not at random, this qualifies as external validation. The final model is based on all data, and can subsequently be validated externally when new data become available for analysis after publication of the model. This approach is best when there is a large number of small centers.

Machine learning and high-dimensional model validation

Machine learning techniques are increasingly used for prediction in high-dimensional data, i.e. all kinds of 'omics' data. We should realize that machine learning techniques are relatively data hungry, and that high-dimensional data pose far more risk of overfitting than classical situations of say 5 to 20 predictors. Hence, validation is very important.

Researchers in this field often resort to old-fashioned split sample techniques. An extreme example was published in Nature Medicine in 2018. Here, researchers split a data set with 54 patients with leukemia in a training part with 44 patients, and a validation part with 10 patients. The endpoint of interest was relapse, which occurred in only 3 patients among the 10 patients in the training set. One does not have to be a statistician to understand that such a small sample size for validation leads to a highly unreliable assessment of performance (JCE 2018).

Estimates of C-statistics in 100,000 simulations of validation of a prediction model with a true C-statistic (indicating discriminative ability) of either 0.7, 0.8, or 0.9, in a situation with 500 events (1167 non-events), 100 events (233 non-events), or 3 events (7 non-events). We note an extremely wide distribution of estimates with 3 events in 10, with a spike at 1.0.

Sample size for validation studies

Sample size recommmendations suggest at least 100 (Vergouwe 2005) or 200 (Collins 2016) events in validation samples. Others suggested that lower numbers might suffice, depending on the specific requirements for validation. Some additional simulations confirm that 100 events is an absolute minimum for reliable assessment of performance (Steyerberg 2018).

More refined guidance to sample size at validation has recently been proposed, based on the precision of the C-statistic for discrimination and calibration slope and calibration in the large (Pavlou SiM 2021); Riley et al. SiM 2021, SiM 2022).

Internal vs external validation:
what is the purpose?

  1. Internal validation aims to quantify optimism in model performance; we consider performance for a single underlying population.

  2. External validation aims to assess generalizability to 'similar, related populations' ( Justice 1999, pdf; Debray 2015, pdf).

If random splits are made, we assess internal validity; this practice should be abolished. Cross-validation and bootstrapping are dominant. If non-random splits are made, e.g. in time, or by place (centers, countries), we assess generalizability. Here our aim should be to quantify heterogeneity in performance rather than a single estimate of 'performance in new data' (Austin 2016, 2017; Riley 2016; Steyerberg 2019).

Types of prediction model studies covered by the TRIPOD statement

Problems with 'internal validation'

A modeling study was published in Ann Int Med in 2005 that claimed to correct for statistical optimism by bootstrapping. The results make no sense (pdf). Neither did a published 'correction'.