« BackRadiotherapy & Oncology
Article in Press

Multivariate modeling of complications with data driven variable selection: Guarding against overfitting and effects of data set size

  • Arjen van der Schaaf

      Affiliations

    • Department of Radiation Oncology, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
    • Corresponding Author InformationCorresponding author. Address: Department of Radiation Oncology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands.
  • ,
  • Cheng-Jian Xu

      Affiliations

    • Department of Radiation Oncology, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
  • ,
  • Peter van Luijk

      Affiliations

    • Department of Radiation Oncology, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
  • ,
  • Aart A. van’t Veld

      Affiliations

    • Department of Radiation Oncology, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
  • ,
  • Johannes A. Langendijk

      Affiliations

    • Department of Radiation Oncology, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
  • ,
  • Cornelis Schilstra

      Affiliations

    • Department of Radiation Oncology, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
    • Radiotherapy Institute Friesland, Leeuwarden, The Netherlands

Received 9 June 2011; received in revised form 3 November 2011; accepted 12 December 2011. published online 24 January 2012.
Corrected Proof

Abstract 

Purpose

Multivariate modeling of complications after radiotherapy is frequently used in conjunction with data driven variable selection. This study quantifies the risk of overfitting in a data driven modeling method using bootstrapping for data with typical clinical characteristics, and estimates the minimum amount of data needed to obtain models with relatively high predictive power.

Materials and methods

To facilitate repeated modeling and cross-validation with independent datasets for the assessment of true predictive power, a method was developed to generate simulated data with statistical properties similar to real clinical data sets. Characteristics of three clinical data sets from radiotherapy treatment of head and neck cancer patients were used to simulate data with set sizes between 50 and 1000 patients. A logistic regression method using bootstrapping and forward variable selection was used for complication modeling, resulting for each simulated data set in a selected number of variables and an estimated predictive power. The true optimal number of variables and true predictive power were calculated using cross-validation with very large independent data sets.

Results

For all simulated data set sizes the number of variables selected by the bootstrapping method was on average close to the true optimal number of variables, but showed considerable spread. Bootstrapping is more accurate in selecting the optimal number of variables than the AIC and BIC alternatives, but this did not translate into a significant difference of the true predictive power. The true predictive power asymptotically converged toward a maximum predictive power for large data sets, and the estimated predictive power converged toward the true predictive power. More than half of the potential predictive power is gained after approximately 200 samples. Our simulations demonstrated severe overfitting (a predicative power lower than that of predicting 50% probability) in a number of small data sets, in particular in data sets with a low number of events (median: 7, 95th percentile: 32). Recognizing overfitting from an inverted sign of the estimated model coefficients has a limited discriminative value.

Conclusions

Despite considerable spread around the optimal number of selected variables, the bootstrapping method is efficient and accurate for sufficiently large data sets, and guards against overfitting for all simulated cases with the exception of some data sets with a particularly low number of events. An appropriate minimum data set size to obtain a model with high predictive power is approximately 200 patients and more than 32 events. With fewer data samples the true predictive power decreases rapidly, and for larger data set sizes the benefit levels off toward an asymptotic maximum predictive power.

Keywords: Complication, Risk, NTCP modeling, Multivariate, Variable selection, Predictive power

To access this article, please choose from the options below

Login to an existing account or Register a new account.

  • Purchase this article for 31.50 USD (You must login/register to purchase this article)

    Online access for 24 hours. The PDF version can be downloaded as your permanent record.

  • Subscribe to this title

    Get unlimited online access to this article and all other articles in this title 24/7 for one year.

  • Claim access now

    For current subscribers with Society Membership or Account Number.

  • Visit SciVerse ScienceDirect to see if you have access via your institution.
 

PII: S0167-8140(11)00741-9

doi:10.1016/j.radonc.2011.12.006

« BackRadiotherapy & Oncology