Chapter 19: Generalizability

We consider some extra scenarios of invalidity of model predictions, expanding from section 19.9. Table 19.4 summarized the results with respect to calibration, discrimination, and clinical usefulness.

Below more details are provided for scenario 1 (Change of setting: more or less severe z, X effects slightly different) and scenario 2 (RCT vs survey: difference in heterogeneity in x, more or less severe z, X effects slightly different).

Impact of change in setting

The interpretation for these graphs is as follows. A prediction model is applied in a different context which is not fully represented in the predictors X. For example, we may try to determine the validity of a model to predict indolent prostate cancer for a screening setting, while the model was developed in a clinical setting. See e.g. a prostate cancer case study. The case-mix will be different, and may not fully be captured by the predictors. The combination of more severe case mix in z and slightly different coefficients would make the prediction model poorly calibrated (a|b=1 = 0.67, slope = 0.87), the c statistic smaller (0.78) and decision making worse than a ‘treat all’ strategy (left panel). Remarkably, a less severe case-mix in z combined with different coefficients would still be associated with substantial clinical usefulness (Net Benefit +0.098) despite the miscalibration (right panel). More systematic evaluations of the impact of miscalibration is provided by Van Calster & Vickers 2015.

RCT vs survey

The second scenario is on the setting of developing a model on data from a RCT, and applying the model in a less selected population, such as a survey or registry. Again, we may think of RCTs and surveys in traumatic brain injury, as considered in SiM 2019. Differences according to X are obvious, with a broader case-mix in surveys.

a) But possibly some missed predictors have a more severe distribution in the surveys, since physicians may tend to exclude very poor patients from RCTs, using clinical judgments that are hard to capture in formal criteria. We find that such a combination leads to a systematically miscalibrated model (a|b=1 = 0.64), with substantial c statistic (0.88), but poor clinical usefulness (Net Benefit 0.027, Upper left panel).

b) If a more severe case-mix in z would coincide with a less heterogeneous distribution of X, the model performance would be poor: poor calibration, only modest discrimination, harmful for decision making (Net Benefit -0.036, Upper right panel).

c) A more favorable scenario is a less severe case-mix in z combined with a more heterogeneous case-mix in x. This causes a calibration-in-the-large problem, but adequate discrimination and surprisingly high clinical usefulness (Lower left panel).

d) Applying a model from a survey in a RCT might lead to a less severe distribution of z and less heterogeneity in X. This would imply poor calibration and only modest discrimination. But the model would have some clinical usefulness (Net Benefit 0.037, Lower right panel).

We note that in addition to causing systematically poorer performance with respect to calibration and discrimination, applying a model in a different case-mix makes predictions for individual subjects a type of extrapolation:.We can see such application as extrapolation in the multivariate space as defined by the predictor values. The model is applied for patterns of X that were relatively sparse at model development, causing epistemic uncertainty <some references: WikiPedia, 1, 2>.

Case study in TBI

An exercise is available related to the Traumatic Brain Injury (TBI) example in Table 19.5:
Validate prognostic models in TBI (from US to International) with background document and data set.