Vulnerabilities of radiomic signature development: The need for safeguards

  • Mattea L. Welch
    Affiliations
    Department of Medical Biophysics, University of Toronto, Canada

    The Techna Institute for the Advancement of Technology for Health, Toronto, Canada

    Princess Margaret Cancer Centre, University Health Network, Toronto, Canada
    Search for articles by this author
  • Chris McIntosh
    Affiliations
    Radiation Medicine Program, Princess Margaret Cancer Centre, Toronto, Canada

    The Techna Institute for the Advancement of Technology for Health, Toronto, Canada

    Princess Margaret Cancer Centre, University Health Network, Toronto, Canada
    Search for articles by this author
  • Benjamin Haibe-Kains
    Affiliations
    Department of Medical Biophysics, University of Toronto, Canada

    Ontario Institute of Cancer Research, Toronto, Canada

    Princess Margaret Cancer Centre, University Health Network, Toronto, Canada

    Vector Institute, Toronto, Canada
    Search for articles by this author
  • Michael F. Milosevic
    Affiliations
    Department of Radiation Oncology, University of Toronto, Canada

    Radiation Medicine Program, Princess Margaret Cancer Centre, Toronto, Canada

    Princess Margaret Cancer Centre, University Health Network, Toronto, Canada
    Search for articles by this author
  • Leonard Wee
    Affiliations
    Department of Radiation Oncology (MAASTRO), GROW Research Institute, Maastricht University, The Netherlands
    Search for articles by this author
  • Andre Dekker
    Affiliations
    Department of Radiation Oncology (MAASTRO), GROW Research Institute, Maastricht University, The Netherlands
    Search for articles by this author
  • Shao Hui Huang
    Affiliations
    Department of Radiation Oncology, University of Toronto, Canada

    Princess Margaret Cancer Centre, University Health Network, Toronto, Canada
    Search for articles by this author
  • Thomas G. Purdie
    Affiliations
    Department of Radiation Oncology, University of Toronto, Canada

    Radiation Medicine Program, Princess Margaret Cancer Centre, Toronto, Canada

    The Techna Institute for the Advancement of Technology for Health, Toronto, Canada

    Princess Margaret Cancer Centre, University Health Network, Toronto, Canada
    Search for articles by this author
  • Brian O'Sullivan
    Affiliations
    Department of Radiation Oncology, University of Toronto, Canada

    Princess Margaret Cancer Centre, University Health Network, Toronto, Canada
    Search for articles by this author
  • Hugo J.W.L. Aerts
    Affiliations
    Dana-Farber Cancer Institute, Brigham and Women’s Hospital, Harvard Medical School, Boston, USA
    Search for articles by this author
  • David A. Jaffray
    Correspondence
    Corresponding author at: 190 Elizabeth St. RFE, 1S-411, University Health Network, Toronto, Ontario M5G 2M9, Canada.
    Affiliations
    Department of Medical Biophysics, University of Toronto, Canada

    Department of Radiation Oncology, University of Toronto, Canada

    IBBME, University of Toronto, Canada

    Radiation Medicine Program, Princess Margaret Cancer Centre, Toronto, Canada

    The Techna Institute for the Advancement of Technology for Health, Toronto, Canada

    Princess Margaret Cancer Centre, University Health Network, Toronto, Canada
    Search for articles by this author
Open AccessPublished:November 08, 2018DOI:https://doi.org/10.1016/j.radonc.2018.10.027

      Highlights

      • Presented Safeguards ensure productive progress of the radiomic field.
      • Radiomic models and features should be tested to determine added prognostic and predictive accuracy compared to accepted clinical factors.
      • Radiomic features are susceptible to underlying dependencies and multi-collinearity within models.
      • Open-source software should be used in radiomic developments to increase development accountability and facilitate inter-institutional research.

      Abstract

      Purpose

      Refinement of radiomic results and methodologies is required to ensure progression of the field. In this work, we establish a set of safeguards designed to improve and support current radiomic methodologies through detailed analysis of a radiomic signature.

      Methods

      A radiomic model (MW2018) was fitted and externally validated using features extracted from previously reported lung and head and neck (H&N) cancer datasets using gross-tumour-volume contours, as well as from images with randomly permuted voxel index values; i.e. images without meaningful texture. To determine MW2018’s added benefit, the prognostic accuracy of tumour volume alone was calculated as a baseline.

      Results

      MW2018 had an external validation concordance index (c-index) of 0.64. However, a similar performance was achieved using features extracted from images with randomized signal intensities (c-index = 0.64 and 0.60 for H&N and lung, respectively). Tumour volume had a c-index = 0.64 and correlated strongly with three of the four model features. It was determined that the signature was a surrogate for tumour volume and that intensity and texture values were not pertinent for prognostication.

      Conclusion

      Our experiments reveal vulnerabilities in radiomic signature development processes and suggest safeguards that can be used to refine methodologies, and ensure productive radiomic development using objective and independent features.

      Keywords

      Cancer imaging features used to inform medical decisions are generated by experienced radiologists, often involving qualitative and experiential interpretation [
      • O’Connor J.P.B.
      • Rose C.J.
      • Waterton J.C.
      • Carano R.A.D.
      • Parker G.J.M.
      • Jackson A.
      Imaging intratumor heterogeneity: role in therapy response, resistance, and clinical outcome.
      ,
      • Orel S.G.
      • Kay N.
      • Reynolds C.
      • Sullivan D.C.
      BI-RADS categorization as a predictor of malignancy.
      ]. However, utilization of quantified patient imaging data for pattern recognition has recently increased. Radiomics involves the automated extraction of imaging features for use in multivariate predictions models and has demonstrated promise in defining predictive and prognostic factors [
      • Chong Y.
      • et al.
      Quantitative CT variables enabling response prediction in neoadjuvant therapy with EGFR-TKIs: Are they different from those in neoadjuvant concurrent chemoradiotherapy?.
      ] for disease relapse and mortality after treatment [
      • Fried D.V.
      • et al.
      Prognostic Value and reproducibility of pretreatment CT texture features in stage III non-small cell lung cancer.
      ,

      Coroller, TP, et al., Radiomic phenotype features predict pathological response in non-small cell lung cancer, vol. 119, pp. 480–486, 2016.

      ,

      Vallières, M, et al., Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer, pp. 1–33, 2017.

      ,
      • Aerts H.J.W.L.
      • et al.
      Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.
      ], and biological correlates [
      • Grossmann P.
      • et al.
      Defining the biological basis of radiomic phenotypes in lung cancer.
      ,

      Panth KM, et al., Is there a causal relationship between genetic changes and radiomics-based image features ? An in vivo preclinical experiment with doxycycline inducible GADD34 tumor cells, vol. 116, pp. 462–466, 2015.

      ]. The largest study to date was published in 2014 by Aerts et al. [
      • Aerts H.J.W.L.
      • et al.
      Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.
      ], and showed the prognostic power of a radiomic signature for overall survival in lung, and head and neck (H&N) cancer. Radiomic features corresponding to shape, texture and image intensity were extracted from within the gross-tumour-volume (GTV) of 1019 pre-treatment CT images. A four-feature signature (First order-Energy, Shape-Compactness, Texture-Gray level non-uniformity (GLNU), and Wavelet HLH-GLNU) was defined by focusing on the most stable features for prognostication in a lung dataset, and validated using independent lung and H&N patient cohorts.
      Aerts et al.’s study [
      • Aerts H.J.W.L.
      • et al.
      Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.
      ] provided detailed descriptions of their methods and data, permitting usage of their signature by multiple groups. Leijenaar et al. [
      • Leijenaar R.T.H.
      • et al.
      External validation of a prognostic CT-based radiomic signature in oropharyngeal squamous cell carcinoma.
      ] externally validated the model and signature using a H&N dataset from The Princess Margaret Cancer Centre. Grossman et al. [
      • Grossmann P.
      • et al.
      Defining the biological basis of radiomic phenotypes in lung cancer.
      ] and Vallières et al. [

      Vallières, M, et al., Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer, pp. 1–33, 2017.

      ] also successfully used the model and signature for lung and H&N prognostication and outcome predictions. However, the original model was developed using in-house Matlab code [

      “MATLAB.” The MathWorks Inc., Natick, USA; 2013.

      ], which was not openly released and thus limited reproducibility studies to groups with access to this software.
      Accessibility of software is a substantial concern since radiomic platforms are not interchangeable [
      • Bogowicz M.
      • et al.
      Post-radiochemotherapy PET radiomics in head and neck cancer – The influence of radiomics implementation on the reproducibility of local control tumor models.
      ]; therefore, reducing generalization and potential impact of published studies and models. Thus to provide the community with an open and modular platform, van Griethuysen et al. recently released PyRadiomics [

      van Griethuysen, JJM, Al, E, Computational radiomics system to decode the radiographic phenotype,” Submitted.

      ], an open-source implementation of the methods contained in the original radiomic Matlab code [
      • Aerts H.J.W.L.
      • et al.
      Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.
      ]. Open-source software that complies with accepted standards will improve generalization of published studies and models, while promoting further independent development and evaluation of published models. The availability of open-source software, like PyRadiomics, and the original lung training dataset [

      Aerts HJWL, et al., Data From NSCLC-Radiomics. The Cancer Imaging Archive.”.

      ] used by Aerts [
      • Aerts H.J.W.L.
      • et al.
      Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.
      ] permits further investigation of the prognostic and predictive value of these radiomic features.
      Through a detailed analysis of a radomic model this manuscript describes broad vulnerabilities and critical learnings in the application of established radiomic methodologies for the development and testing of radiomic signatures. These were observed using the original lung dataset [

      Aerts HJWL, et al., Data From NSCLC-Radiomics. The Cancer Imaging Archive.”.

      ], and validated using the H&N dataset used by Leijenaar et al. [
      • Leijenaar R.T.H.
      • et al.
      External validation of a prognostic CT-based radiomic signature in oropharyngeal squamous cell carcinoma.
      ]. Safeguards to defend against underlying dependences revealed through exhaustive testing of model and feature multicollinearity are reported here. These are designed for adoption by the community to improve the interpretability and accuracy of future radiomic signatures and models.

      Materials and methods

      Our analysis of a refitted model and its corresponding features deconstructs the prognostic accuracy by working in a “backwards” manner. By beginning our analysis at the final model and moving towards the univariate features we can suggest safeguards that will protect future radiomic developments from underlying dependencies and feature multicollinearity. This section details the datasets and methods used for our analysis.

       Datasets

      This study used two datasets: (1) The lung cancer dataset [

      Aerts HJWL, et al., Data From NSCLC-Radiomics. The Cancer Imaging Archive.”.

      ] (Lung1) used by Aerts et al. [
      • Aerts H.J.W.L.
      • et al.
      Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.
      ] was obtained by the MAASTRO Clinic. Pre-treatment CT images and GTV contours were available for all original 422 patients; however, one patient had undergone surgical resection of their lung and was not used for this study. Disease status details for the 421 patients can be found in (Supplementary Table 1). The 421 patients were treated with induction concurrent chemoradiotherapy, radiotherapy alone, or chemotherapy followed by radiotherapy. Lung1 was used for model training; (2) The H&N cancer dataset (H&N1) reported by Leijenaar [
      • Leijenaar R.T.H.
      • et al.
      External validation of a prognostic CT-based radiomic signature in oropharyngeal squamous cell carcinoma.
      ] (previously named PMH1) was obtained from The Princess Margaret Cancer Centre with institutional research board approval to validate the fitted model. H&N1 comprised radiotherapy planning CT images and primary GTVs for 527 oropharyngeal squamous cell carcinoma patients treated with concurrent chemoradiotherapy or radiotherapy alone between 2005 and 2010 at The Princess Margaret Cancer Centre (Supplementary Table 2). GTVs were contoured by attending radiation oncologists based on available clinical-radiological evidence of disease extent, and used for IMRT treatment. Diagnostic magnetic resonance imaging (MRI) was often fused with planning CT to aid target delineation. Radiotherapy target volumes, including GTV were peer-reviewed (slice-by-slice comparison of planning CT and MRI) during weekly H&N Radiation Oncology Quality Assurance Rounds as part of routine clinical practice. Patients with multiple GTVs were excluded from both datasets, this resulted in a reduction of patients in the H&N1 dataset from 542 to 527.
      All methods and experimental protocols were in accordance with relevant guidelines and regulations. The study was approved by the Institutional Research Ethic board (IRB). Informed consent of the subjects was waived by our IRB since this is a retrospective review; a waiver would not affect the rights and welfare of the study subjects.

       Radiomic feature extraction

      PyRadiomics [

      van Griethuysen, JJM, Al, E, Computational radiomics system to decode the radiographic phenotype,” Submitted.

      ] was used to extract features from Lung1 and H&N1 GTVs. We did not select new features, and instead used the four features with the same name as those described previously by Aerts et al. [
      • Aerts H.J.W.L.
      • et al.
      Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.
      ] (Supplementary Table 3): (1) Energy, from the first order feature class of PyRadiomics, calculates the overall density of the tumour volume. This feature is also commonly referred to as an intensity histogram-based uniformity feature [

      Zwanenburg, A, Leger, S, Vallieres, M, Lock, S, Image biomarker standardisation initiative, arXiv161207003; 2016.

      ]; (2) Shape compactness, compactness of tumour volume relative to sphere. We chose to use Compactness 1 in this work which has been recently used in other reproducibility studies [

      Vallières, M, et al., Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer, pp. 1–33, 2017.

      ,
      • Leijenaar R.T.H.
      • et al.
      External validation of a prognostic CT-based radiomic signature in oropharyngeal squamous cell carcinoma.
      ]; (3) Gray level non-uniformity (GLNU, defined as GLN in PyRadiomics documentation), intra-tumour heterogeneity from gray level run length matrices. A gray level run length matrix is calculated for all directions in the 3D imaging volume, and a corresponding GLNU feature is calculated for each direction. The resulting GLNU values for the 13 different directions are then averaged together to obtain the final GLNU features; and (4) HLH-GLNU, the same as (3), but after wavelet decomposition of the original image along the x, y and z-axes using a coif wavelet (Fig. 1). For consistency with Aerts et al. [
      • Aerts H.J.W.L.
      • et al.
      Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.
      ] we did not interpolate patient images to uniform voxel sizes in all three dimensions or between patients, and used a fixed bin width of 25 for discretization of images and wavelets with a starting bin edge of 0, the default for PyRadiomics. More detailed explanations of feature calculations can be found in Pyradiomics documentation [

      van Griethuysen, JJM, Al, E, Computational radiomics system to decode the radiographic phenotype,” Submitted.

      ] and the Image Biomarker Standardization Initiative [

      Zwanenburg, A, Leger, S, Vallieres, M, Lock, S, Image biomarker standardisation initiative, arXiv161207003; 2016.

      ].
      Figure thumbnail gr1
      Fig. 1Workflow for training and usage of MW2018 Model: (1) CT images and GTVs are selected for Lung1 and H&N1 datasets. The selected DICOM data are converted to Python compatible file types; (2) Signature features are extracted using Pyradiomics; (3) Lung1 signature features are used to train a Cox model; (4) MW2018 model coefficients are used with extracted H&N1 features to calculate patient specific prognostic indices (PI). The resulting concordance index (c-index) is calculated.
      Additional tumour volume (in mm3, the default of PyRadiomics) and tumour volume normalized features were also extracted. Volume normalized feature calculation methods were added to PyRadiomics, and defined by adding a term to the original feature equation that normalized by the total number of voxels in the GTV [
      • Fave X.
      • et al.
      Impact of image preprocessing on the volume dependence and prognostic potential of radiomics features in non-small cell lung cancer.
      ] (Supplementary Table 3). These features were used later to test for feature volume dependencies.

       MW2018 model fitting and validation with and without voxel randomization

      Due to possible implementation differences [
      • Bogowicz M.
      • et al.
      Post-radiochemotherapy PET radiomics in head and neck cancer – The influence of radiomics implementation on the reproducibility of local control tumor models.
      ] between the original feature definitions in Aerts et al. [
      • Aerts H.J.W.L.
      • et al.
      Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.
      ], and the open-source toolbox, PyRadiomics, we refit the Cox model-weights using Lung1 imaging features and the Survival R package (version 2.41–3). Our model was fitted to predict overall survival and is henceforth referred to as the MW2018 model to represent the first author’s initials and year of publication.
      The MW2018 model was externally validated using H&N1 features. Concordance index (c-index) values were calculated using prognostic index (PI) values (Fig. 1), where a high PI corresponded to low expected survival. PI is a linear predictor of the Cox model that expresses the risk for an event, and is output by the model as the sum of the products of the model coefficients multiplied by their corresponding features. We used the survcomp R package (version 1.28.4) [
      • Schröder M.S.
      • Culhane A.C.
      • Quackenbush J.
      • Haibe-Kains B.
      survcomp: An R/Bioconductor package for performance assessment and comparison of survival models.
      ] to compute the point estimate of c-index, its 95% confidence interval (CI) and p-value (two-sided, α = 0.05) using the noether approach [
      • Penciana M.J.
      • D’Agostino R.B.
      Overall C as a measure of discrimination in survival analysis: Model specific population value and confidence interval estimation.
      ]. Additionally, Kaplan–Meier curves were generated for MW2018 PI values, volume and, univariate features to describe overall survival since date of diagnosis. Log-rank tests compared survival distributions of high and low risk cohorts.
      To examine MW2018’s reliance on intensity features we tested it on simulated images devoid of meaningful intensity signals. Image intensity values of each patient’s CT were shuffled by randomly permuting voxel index values across the entire 3D dataset (Fig. 2). The newly voxel-randomized image, along with the unaltered clinical GTV contour, were then used for radiomic feature extraction. Ten sets of signature features were extracted from 10 separate Lung1 and H&N1 voxel-randomized datasets (Fig. 2). The c-index was calculated for all voxel-randomized univariate signature features for all ten sets of features. Additionally, MW2018 was applied to H&N1 extracted voxel-randomized features, and the c-index was calculated for the resulting PI values.
      Figure thumbnail gr2
      Fig. 2Workflow for analysing MW2018’s reliance on intensity features: (1) CT image voxel locations are randomized in all Lung1 and H&N1 datasets. (A) 2D axial image slice with contour before voxel-randomization; (B) 3D image volume with a rendered contour before voxel-randomization; (C) 2D voxel-randomized axial image slice with contour; (D) 3D voxel-randomized image volume with rendered contour; (2) The unaltered contours and their corresponding voxel-randomized images are used for feature extraction with PyRadiomics; (3) Univariate c-index values are calculated for signature features in both datasets. Our MW2018 model is applied to the signature features extracted from H&N1. The c-index values calculated with and without voxel-randomized images are compared to determine if the features and MW2018 are as prognostic when using images without meaningful intensity values.

       Analysis of signature tumour volume dependence and multicollinearity

      The c-index of MW2018 was statistically (two-sided) compared to tumour volume using the compareC R package (version 1.3.1) [
      • Kang L.
      • Chen W.
      • Petrick N.A.
      • Gallas B.D.
      Comparing two correlated C indices with right-censored survival outcome: a one-shot nonparametric approach.
      ] in the H&N1 dataset using unaltered and voxel-randomized images; statistically significant differences in c-index values were indicated by the calculated p-value. Correlation of tumour volume to signature features and MW2018 PIs was tested using Spearman rank correlation values (ρ), which was selected to ensure analysis of all potential monotonic correlations. Highly correlated features were normalized for volume and ρ was recalculated. Univariate c-index and log-rank test p-values for features before and after normalization were compared to determine dependence on volume.
      Independent prognostic information and interactive effects of features weakly correlated with volume were determined by fitting two Cox models and analysing covariant coefficient p-values. Model 1 was built using volume and a weakly correlated feature. Model 2 was built using volume, a weakly correlated feature, and the interactive term between the two as covariates. The feature vectors were normalized to unit vectors prior to fitting the Cox models by dividing each element (i.e. feature) by the vector magnitude. This was achieved using Eq. (1) where k is a feature vector and k is the norm, or magnitude, of the vector. The vector norm is calculated by taking the square root of the sum of the squared vector values. By normalizing the feature vectors prior to Cox model fitting, the calculated hazard ratios better reflected relative importance of the features.
      k^=kk
      (1)


      Extent of coefficient standard error inflation due to signature feature multicollinearity was explored in the training dataset using Variance Inflation Factors (VIF) [
      • Kutner M.H.
      • Machtsheim C.J.
      10.5 multicollinearity diagnostics – variance inflation factor.
      ,
      • Farrar D.E.
      • Glauber R.R.
      Multicollinearity in regression analysis: the problem revisited.
      ]. VIFs provide a measurement of how estimated regression coefficient variance is increased by collinearity, and were calculated for each of the previously selected signature features in Lung1 using Eq. (2). For a given feature, k, a linear regression model was fit to predict k using the other three features. Rk2 is the coefficient of determination for the linear regression model fit without feature k.
      VIFk=11-Rk2
      (2)


      The square root of a VIF indicates how much larger the standard error is in comparison to the case in which the feature was uncorrelated with other features in the model.

      Results

       Model validation with and without Voxel-Randomized images

      The MW2018 model had a c-index of 0.59 (CI = 0.55–0.63 p-value < 0.05) (Table 1) in the Lung1 data when predicting overall survival. MW2018 yielded a c-index of 0.64 (CI = 0.60–0.68, p-value < 0.05) in the H&N1 dataset (Table 1 and Fig. 3d). Univariate feature data for energy, compactness 1, GLNU, and HLH-GLNU had c-index of 0.57, 0.45, 0.61 and 0.60, respectively in Lung1, and 0.61, 0.42, 0.64, and 0.63 (CI and p-values in Table 1), respectively in H&N1. Kaplan–Meier curves for the four univariate features are in Supplementary Figs. 3 and 4.
      Table 1c-Index and p-values from original and voxel-randomized Lung1 and H&N1 datasets. 95% CI are shown for point-estimated c-index values. Standard deviations are reported for c-index values calculated across multiple runs, namely the application of MW2018 to the 10 randomized datasets, and the univariate calculations on the 10 randomized datasets.
      Lung1H&N1
      FeatureImagec-Index (95% CI)p-Valuec-Index (95% CI)p-Value
      UnivariateEnergyOriginal0.57 (0.53–0.61)<0.050.61 (0.57–0.65)<0.05
      Compactness 10.45 (0.41–0.49)<0.050.42 (0.38–0.47)<0.05
      GLNU0.61 (0.57–0.64)<0.050.64 (0.59–0.68)<0.05
      HLH GLNU0.60 (0.56–0.64)<0.050.63 (0.59–0.68)<0.05
      Volume0.60 (0.56–0.64)<0.050.64 (0.59–0.68)<0.05
      EnergyRandomized0.60 ± 0.006<0.0010.63 ± 0.0004<0.05
      Compactness 10.45 ± 0.00.0270.42 ± 0.0<0.05
      GLNU0.60 ± 0.006<0.0010.64 ± 0.0003<0.05
      HLH GLNU0.60 ± 0.006<0.0010.64 ± 0.0003<0.05
      Normalized EnergyOriginal0.43 (0.39–0.47)<0.050.46 (0.42–0.50)0.058
      Normalized GLNU0.58 (0.54–0.62)<0.050.56 (0.52–0.61)<0.05
      Normalized HLH GLNU0.57 (0.53–0.61)<0.050.57 (0.52–0.61)<0.05
      MW2018 ApplicationOriginal0.59 (0.55–0.63)<0.050.64 (0.60–0.68)<0.05
      Randomized0.60 ± 0.006N/A0.64 ± 0.0003N/A
      Figure thumbnail gr3
      Fig. 3Kaplan–Meier curves displaying radiomic signature performance during stratification of patients into high and low risk groups: (A) good performance in Lung1 with training concordance index (c-index) of 0.59 (CI = 0.55–0.63). The log-rank test of the Kaplan–Meier curve had a p-value = 6.19e−06; (B) good performance in randomized Lung1 image volumes with a c-index across 10 runs of 0.60 ± 0.006 and an example log-rank test p-value = 9.62e−03; (C) good performance of tumour volume in Lung1 with c-index = 0.60 (CI = 0.56–0.64) and a log-rank test p-value = 6.41e−07; (D) good performance MW2018 in H&N1 with c-index = 0.64 (CI = 0.60–0.68) and a log-rank test p-value = 2.83e−10; E) good performance in 10 randomized iterations with H&N1 image volumes with c-index = 0.64 ± 0.0003 and an example log-rank test p-value = 4.73e−08; (F) good performance of tumour volume in H&N1 with c-index = 0.64 (CI = 0.59–0.68) and a log-rank test p-value = 2.25e−7.
      The same MW2018 model had c-index values of 0.60 and 0.64 when applied to voxel-randomized Lung1 and H&N1 features, respectively (Fig. 3b and e); the 10 different voxel- randomizations resulted in c-index standard deviations less than 0.006 (Supplementary Fig. 2). Individual features of energy, compactness 1, GLNU, and HLH-GLNU had c-index = 0.60, 0.45, 0.60, 0.60, respectively in Lung1 and c-index = 0.63, 0.42, 0.64, 0.64, respectively in H&N1; standard deviation was less than 0.006 for all univariate feature CIs (Supplementary Fig. 2). Kaplan–Meier curves for the four univariate features extracted from voxel-randomized images are in Supplementary Figs. 5 and 6.

       Signature tumour volume dependence analysis and multicollinearity

      Tumour volume alone had a c-index of 0.60 (CI = 0.56–0.64, p-value < 0.05) and 0.64 (CI = 0.59–0.68, p-value < 0.05) (Fig. 3c and f) in Lung1 and H&N1, respectively. When the c-index of univariate volume is compared to the c-index of MW2018 with unaltered H&N1 images we cannot conclude that MW2018 is more prognostic than univariate volume (p-value = 0.90). Additionally, based on our data and experiments, we cannot conclude that the model applied to voxel-randomized images is significantly different from univariate volume (average p-value = 0.95).
      We calculated spearman rank values for tumour volume versus multivariate MW2018 PI, energy, GLNU, and HLH-GLNU (Table 2). The spearman rank values of energy, GLNU and HLH-GLNU all changed after volume normalization (Table 2). c-index values of volume-normalized features were also reduced in both Lung1 and H&N1 (c-index values and p-values in Table 1). Kaplan–Meier curves for univariate features before and after volume normalization are in Supplementary Figs. 3 and 4. The VIF values for energy, compactness 1, GLNU, and HLH-GLNU in Lung1 are 1.94, 1.41, 56.56, and 62.68, respectively (Supplementary Table 4), indicating high multicollinearity between GLNU and HLH-GLNU.
      Table 2Spearman rank correlation values comparing tumour volume to patient MW2018 PIs and individual radiomic signature features before and after tumour volume normalization in Lung1 and H&N1.
      Feature nameSpearman rank value
      Pre volume normalizationPost volume normalization
      H&N1First Order: Energy0.830.28
      Shape: Compactness 1−0.38N/A
      Texture: GLNU0.95−0.51
      Wavelet: HLH GLNU0.99−0.28
      MW2018 PI0.79N/A
      Lung1First Order: Energy0.76−0.49
      Shape: Compactness 1−0.50N/A
      Texture: GLNU0.970.58
      Wavelet: HLH GLNU0.990.49
      MW2018 PI0.87N/A
      Compactness 1 and tumour volume had spearman rank values of ρ = −0.50 and −0.38, in Lung1 and H&N1, respectively. Tumour volume, compactness and an interaction term between the two were tested using multivariate Cox models, as summarized in Table 3. Interaction terms were not significant in either Lung1 or H&N1 (p-value = 0.91 and 0.83 respectively), indicating that the prognostic effect of compactness does not depend on volume significantly, and vice versa. When tumour volume and compactness without the interaction term were tested in a Cox model, both provided independent prognostic information in H&N1 (Volume: HR = 2.21e + 05, p-value < 0.001; Compactness: HR = 1.08e−09, p-value = 0.02) but only volume was significant in Lung1 (Volume: HR = 9.30e+02, p-value < 0.001; Compactness: HR = 5.47e−04, p-value = 0.21).
      Table 3Coefficients for Cox model with weakly correlated covariates. Interactive effects are tested for by multiplying two covariates of interest (represented by *). Significant p-values (α = 0.05) are highlighted. All feature vectors have been scaled to unit vectors.
      ModelFeatureCoefficient (95% CI)Hazard Ratio (95% CI):p-Valuec-Index
      H&N11Volume1.23e+01

      (9.38e+00–1.52e+01)
      2.21e+05

      (1.19e+04–4.12e+06)
      <0.0010.64
      Compactness 1-2.06e+01 (−3.52e+01 to −6.10e+00)1.08e−09

      (5.24e−16–2.23e−03)

      0.02
      2Volume1.03e+01

      (−5.28e+00–2.58e+01)
      2.88e+04

      (5.06e−03–1.63e+11)
      0.280.64
      Compactness 1−2.26e+01

      (−4.32e+01–−1.97e+00)
      1.52e−10

      (1.67e−19–1.39e−01)
      0.07
      Volume* Compactness 14.93e+01

      (−3.19e−02 to 4.18e+02)
      2.67e+21

      (2.29e−139–3.11e+181)
      0.83
      Volume6.84e+00

      (4.29e+00–9.38e+00)
      9.30e+02

      (7.29e+01–1.19e+04)
      <0.0010.60
      Lung11Compactness 1−7.51e+00

      (−1.73e+01–2.27e+00)
      5.47e−04

      (3.10e−08–9.65e+00)
      0.21
      Volume6.02e+00

      (−6.30e+00–1.83e+01)
      4.13e+02

      (1.85e−03–9.26e+07)
      0.42
      Compactness 1−8.01e+00

      (−2.03e+01–4.25e+00)
      3.33e−04

      (1.58e−09–7.02e+01)
      0.280.60
      2Volume*

      Compactness 1
      1.69e+01

      (−2.33e−02–2.67e−02)
      2.10e+07

      (4.24e−102–1.04e+116)
      0.91

      Discussion

      Refinement of methodology and increased transparency are integral to the future of radiomics. In this work, a Cox model was fitted that had a prognostic accuracy of 0.64 when externally validated, similar to previous publications utilizing the same signature [

      Vallières, M, et al., Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer, pp. 1–33, 2017.

      ,
      • Aerts H.J.W.L.
      • et al.
      Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.
      ,
      • Grossmann P.
      • et al.
      Defining the biological basis of radiomic phenotypes in lung cancer.
      ],
      • Leijenaar R.T.H.
      • et al.
      External validation of a prognostic CT-based radiomic signature in oropharyngeal squamous cell carcinoma.
      ,
      • Leger S.
      • et al.
      A comparative study of machine learning methods for time-to-event survival data for radiomics risk modelling.
      . However, it was determined that tumour volume alone was as prognostic as the model in H&N1 (c-index = 0.64 (CI = 0.59–0.68) vs. c-index = 0.64 (CI = 0.60–0.68), p-value = 0.90), and was embedded in three of the four signature features, suggesting a lack of feature independence (Table 1, Table 2). In the course of the work described here, several “safeguard” tests were formulated to guide the development of future radiomic signatures and models to ensure productive model development.
      Radiomic models and signatures embedded or confounded by underlying features are a concern when drawing conclusions between features and disease relationships. Underlying dependencies may also impede prognostic and predictive accuracy since resulting models will have been fit using multiple surrogates of the same feature. With the above concerns in mind, we recommend the following safeguards be followed in future radiomic studies:
      • (1)
        Open-source software, such as PyRadiomics, should be used to increase development accountability and facilitate inter-institutional research, transparency in methodological detail, and further refinement and testing of published work and methodologies. The software should comply with accepted software development and imaging biomarker standards.
      • (2)
        Models and features should be tested to determine added prognostic and predictive accuracy compared to accepted clinical factors (e.g. tumour volume [
        • Chao K.S.C.
        • et al.
        Intensity-modulated radiation therapy for oropharyngeal carcinoma: Impact of tumor volume.
        ]). This safeguard is best done with clinical engagement.
      • (3)
        Multicollinearity of features should be investigated during model development with a training dataset. This will increase data variance description and improve precision of estimated regression coefficients, depending on modelling technique.
      • (4)
        Features should be tested for underlying dependencies using statistical analysis or by perturbing the data in controlled ways (as was demonstrated in this work). Perturbing data is a straightforward approach, but caution is warranted since it is difficult to control for confounding factors.
      • (5)
        Pre-processing of data should be done to ensure understanding of image signal quality (e.g. presence of dental artifacts). This will reduce the possibility of calculating erroneous features as demonstrated in the voxel-randomization portions of this work; and
      • (6)
        Manual contouring protocols should be included to describe any prevalent imaging signals (e.g. diagnostic MR images used in the determination of GTVs) utilized during delineation; particularly when volume remains highly influential in radiomic features.
      Possible differences in radiomic feature implementation for the analysis of the four-feature prognostic model prevented exact reproduction of previous results, as the model itself is sensitive to small variations. However, signature analysis was permitted using PyRadiomics extracted Lung1 features to fit an updated model, MW2018, with an external-validation accuracy of 0.64 (0.60–0.68). Volume dominance in MW2018 was tested with data perturbation. Permuting image voxels decimated intensity feature components, while preserving volume and shape information; therefore, if MW2018 remained prognostic it was not because of patient specific intensity values and texture present in the GTV, but because of the GTV. In this scenario, MW2018 remained prognostic with c-index = 0.60 and 0.64 for Lung1 and H&N1, respectively, with standard deviations less than 0.006 across different voxel-randomization iterations. This implies that image intensity dependent features (energy, GLNU and HLH-GLNU), in conjunction with the cox model, did not capture relevant tumour phenotype information associated with texture, but were capturing the clinician delineated tumour volume, an established prognostic factor [
      • Chao K.S.C.
      • et al.
      Intensity-modulated radiation therapy for oropharyngeal carcinoma: Impact of tumor volume.
      ]. Additionally, the volume feature alone was as prognostic as MW2018 in Lung1 and H&N1 (c-index = 0.60 (0.56–0.64) and 0.64 (0.59–0.68)). These findings demonstrate the importance of careful and controlled experimentation, and a deeper exploration of proposed features. It should be noted that this does not mean that these features cannot be prognostic under a different signature or method of measurement.
      It was determined that volume was driving the prognostic accuracy of univariate energy, GLNU and HLH-GLNU when the features were normalized for volume and had decreased univariate prognostic accuracies (Table 1 and Supplementary Figs. 3 and 4). This is consistent with recent published work demonstrating strong correlation between tumour volume and radiomic features in H&N [

      Vallières, M, et al., Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer, pp. 1–33, 2017.

      ] and lung [
      • Fave X.
      • et al.
      Impact of image preprocessing on the volume dependence and prognostic potential of radiomics features in non-small cell lung cancer.
      ] cancer. However, it should be noted that only linear dependencies were explored and higher order dependencies may affect other features. This dependency may be the result of the initial feature selection process which selected features that remained stable between two scans [
      • Aerts H.J.W.L.
      • et al.
      Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.
      ]; therefore, biasing selection towards volume dependent features that would otherwise be unstable, such as wavelet features. Furthermore, high VIF values for GLNU and HLH-GLNU in the Lung1 dataset indicate that they describe the same data variance. Future model development should investigate structural multicollinearity amongst features to yield more stable models, while also potentially increasing prognostic accuracy above univariate tumour volume by including features that provide additional information. However, caution is required to ensure that unintended analysis of validation data does not occur, particularly when performing internal validation in the form of hold-out sets and out-of-bag bootstrapping.
      The volume dependence of the signature’s intensity based features may also be attributed to a variety of experimental processes, namely discretization of images and the original feature selection method. For this work we used a fixed bin width of 25 for discretization, the same methods employed by Aerts et. al. [
      • Aerts H.J.W.L.
      • et al.
      Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.
      ]. Discretization of images is used for texture analysis and image noise reduction [

      Aerts HJWL, et al., Data From NSCLC-Radiomics. The Cancer Imaging Archive.”.

      ], and requires appropriate bin widths to avoid over or under binning an image. A bin width of 25 is appropriate for the original images, but dynamic binning is the preferred method for wavelet images since the HU scale of the image is lost after filtering. Using narrow bin widths the images are subjected to over binning, resulting in the HLH-GLNU feature essentially counting the number of voxels in the ROI. Furthermore, interpolation to uniform voxel sizes are important in wavelet filtered images to ensure spatial frequencies are the same in every direction (although care must be taken that new artifacts are not introduced [
      • Aganj I.
      • Yeo B.T.T.
      • Sabuncu M.R.
      • Fischl B.
      On removing interpolation and resampling artifacts in rigid image registration.
      ,
      • Tsao J.
      Interpolation artifacts in multimodal image registration based on maximization of mutual information.
      ]), a method that was not used in our, or Aerts et. al’s [
      • Aerts H.J.W.L.
      • et al.
      Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.
      ], work. Methods for voxel size interpolation are readily available in platforms like PyRadiomics, but dynamic binning methods are still needed; until these methods are standardized and implemented in radiomic platforms, utilization of our proposed methods will aid in safeguarding signatures and models from features with underlying volume dependencies.
      Furthermore, manual tumour contours need to be considered since they are often determined from more than signal delineation on a single modality. Best practices for clinical contouring involve embedded and confounding parameters that influence delineations in ways that are hard to quantify, including additional patient characteristics, patterns of spread, and disease atlases; this is particularly true for H&N cancer [

      Grégoire, V, et al., Delineation of the neck node levels for head and neck tumors: a 2013 update, vol. 110, pp. 172–181, 2013.

      ,
      • Grégoire V.
      • Eisbruch A.
      • Hamoir M.
      • Levendag P.
      Proposal for the delineation of the nodal CTV in the node-positive and the post-operative neck.
      ]. Utilization of semi and fully-automated contouring software [
      • Velazquez E.R.
      • et al.
      Volumetric CT-based segmentation of NSCLC using 3D-Slicer.
      ] could provide a more objective disease delineation, but could not be combined with clinical contours since they would not contain the nuanced information that is often important in clinical contours. Therefore, involvement of clinicians and detailed descriptions of the contouring protocols employed in radiomic analyses would ensure the importance of various features is appropriately attributed.
      This work demonstrates the importance of being cautiously optimistic about radiomic signatures. Our findings do not imply that texture is not prognostic or predictive, only that enforced usage of standardization protocols and the adherence to the above safeguards is essential in avoiding incorrect feature interpretation and overly optimistic claims of robustness and generalizability. Radiomics is a relatively young field of study and requires transparency through high fidelity sharing of data and software to ensure progressive science. Radiomics dramatically expands our capacity to integrate imaging information into personalized cancer medicine, but it also represents a methodological shift within cancer research towards employment of highly automated approaches. The establishment of safeguards will serve to accelerate true progress in radiomics by educating researchers on the pitfalls and subtleties of these approaches and by encouraging broader engagement in the development and evaluation of the results.

      Acknowledgements

      The authors thank Dr. John Waldron and Mr. Biu Chan for their assistance in obtaining and curating the head and neck dataset used in this study. This work was supported by the Natural Sciences and Engineering Research Council to MLW; the Strategic Training in Transdisciplinary Radiation Science for the 21st Century program to MLW; the Canadian Institutes for Health Research ; the Ontario Institute for Cancer Research ; and the Terry Fox Research Institute . DAJ is the Mary and Orey Fidani Family Chair in Radiation Physics.

      Conflict of interest statement

      Dr. Dekker has a patent pending for systems, method and devices for analysing quantitative information obtained from radiological images. All other authors have no disclosures related to this work.

      Appendix A. Supplementary data

      The following are the Supplementary data to this article:

      References

        • O’Connor J.P.B.
        • Rose C.J.
        • Waterton J.C.
        • Carano R.A.D.
        • Parker G.J.M.
        • Jackson A.
        Imaging intratumor heterogeneity: role in therapy response, resistance, and clinical outcome.
        Clin Cancer Res. 2015; 21: 249-257
        • Orel S.G.
        • Kay N.
        • Reynolds C.
        • Sullivan D.C.
        BI-RADS categorization as a predictor of malignancy.
        Radiology. 1999; 211: 845-850
        • Chong Y.
        • et al.
        Quantitative CT variables enabling response prediction in neoadjuvant therapy with EGFR-TKIs: Are they different from those in neoadjuvant concurrent chemoradiotherapy?.
        PLoS One. 2014; 9: 1-8
        • Fried D.V.
        • et al.
        Prognostic Value and reproducibility of pretreatment CT texture features in stage III non-small cell lung cancer.
        Radiat Oncol Biol. 2014; 90: 834-842
      1. Coroller, TP, et al., Radiomic phenotype features predict pathological response in non-small cell lung cancer, vol. 119, pp. 480–486, 2016.

      2. Vallières, M, et al., Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer, pp. 1–33, 2017.

        • Aerts H.J.W.L.
        • et al.
        Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.
        Nat Commun. 2014; 5: 4006
        • Grossmann P.
        • et al.
        Defining the biological basis of radiomic phenotypes in lung cancer.
        Elife. 2017; : 1-22
      3. Panth KM, et al., Is there a causal relationship between genetic changes and radiomics-based image features ? An in vivo preclinical experiment with doxycycline inducible GADD34 tumor cells, vol. 116, pp. 462–466, 2015.

        • Leijenaar R.T.H.
        • et al.
        External validation of a prognostic CT-based radiomic signature in oropharyngeal squamous cell carcinoma.
        Acta Oncol (Madr). 2015; 54: 2016
      4. “MATLAB.” The MathWorks Inc., Natick, USA; 2013.

        • Bogowicz M.
        • et al.
        Post-radiochemotherapy PET radiomics in head and neck cancer – The influence of radiomics implementation on the reproducibility of local control tumor models.
        Radiother Oncol. 2017;
      5. van Griethuysen, JJM, Al, E, Computational radiomics system to decode the radiographic phenotype,” Submitted.

      6. Aerts HJWL, et al., Data From NSCLC-Radiomics. The Cancer Imaging Archive.”.

      7. Zwanenburg, A, Leger, S, Vallieres, M, Lock, S, Image biomarker standardisation initiative, arXiv161207003; 2016.

        • Fave X.
        • et al.
        Impact of image preprocessing on the volume dependence and prognostic potential of radiomics features in non-small cell lung cancer.
        Transl Cancer Res. 2016; 5: 349-363
        • Schröder M.S.
        • Culhane A.C.
        • Quackenbush J.
        • Haibe-Kains B.
        survcomp: An R/Bioconductor package for performance assessment and comparison of survival models.
        Bioinformatics. 2011; 27: 3206-3208
        • Penciana M.J.
        • D’Agostino R.B.
        Overall C as a measure of discrimination in survival analysis: Model specific population value and confidence interval estimation.
        Stat Med. 2004; 23: 2109-2123
        • Kang L.
        • Chen W.
        • Petrick N.A.
        • Gallas B.D.
        Comparing two correlated C indices with right-censored survival outcome: a one-shot nonparametric approach.
        Stat Med. 2015; 34: 685-703
        • Kutner M.H.
        • Machtsheim C.J.
        10.5 multicollinearity diagnostics – variance inflation factor.
        Applied linear regression models. 4th ed. McGraw-Hill Irwin, 2004
        • Farrar D.E.
        • Glauber R.R.
        Multicollinearity in regression analysis: the problem revisited.
        Rev Econom Stat. 1967; 49: 92-107
        • Leger S.
        • et al.
        A comparative study of machine learning methods for time-to-event survival data for radiomics risk modelling.
        Sci Rep. 2017; 7: 13206
        • Chao K.S.C.
        • et al.
        Intensity-modulated radiation therapy for oropharyngeal carcinoma: Impact of tumor volume.
        Int J Radiat Oncol Biol Phys. 2004; 59: 43-50
        • Aganj I.
        • Yeo B.T.T.
        • Sabuncu M.R.
        • Fischl B.
        On removing interpolation and resampling artifacts in rigid image registration.
        IEEE Trans Image Process. 2013; 22: 816-827
        • Tsao J.
        Interpolation artifacts in multimodal image registration based on maximization of mutual information.
        IEEE Trans Med Imaging. 2003; 22: 854-864
      8. Grégoire, V, et al., Delineation of the neck node levels for head and neck tumors: a 2013 update, vol. 110, pp. 172–181, 2013.

        • Grégoire V.
        • Eisbruch A.
        • Hamoir M.
        • Levendag P.
        Proposal for the delineation of the nodal CTV in the node-positive and the post-operative neck.
        Radiother Oncol. 2006; 79: 15-20
        • Velazquez E.R.
        • et al.
        Volumetric CT-based segmentation of NSCLC using 3D-Slicer.
        Sci Rep. 2013; 3: 3529