A quality assurance framework for routine monitoring of deep learning cardiac substructure computed tomography segmentation models in radiotherapy

Xiyao Jin; Yao Hao; Jessica Hilliard; Zhehao Zhang; Maria A Thomas; Hua Li; Abhinav K Jha; Geoffrey D Hugo

doi:10.1002/mp.16846

A quality assurance framework for routine monitoring of deep learning cardiac substructure computed tomography segmentation models in radiotherapy

Med Phys. 2024 Apr;51(4):2741-2758. doi: 10.1002/mp.16846. Epub 2023 Nov 28.

Authors

Xiyao Jin¹, Yao Hao¹, Jessica Hilliard¹, Zhehao Zhang¹, Maria A Thomas¹, Hua Li¹, Abhinav K Jha^{2

3}, Geoffrey D Hugo¹

Affiliations

¹ Department of Radiation Oncology, Washington University in St. Louis School of Medicine, St. Louis, Missouri, USA.
² Department of Biomedical Engineering, Washington University in St. Louis, St. Louis, Missouri, USA.
³ Mallinckrodt Institute of Radiology, Washington University in St. Louis, St. Louis, Missouri, USA.

PMID: 38015793
DOI: 10.1002/mp.16846

Abstract

Background: For autosegmentation models, the data used to train the model (e.g., public datasets and/or vendor-collected data) and the data on which the model is deployed in the clinic are typically not the same, potentially impacting the performance of these models by a process called domain shift. Tools to routinely monitor and predict segmentation performance are needed for quality assurance. Here, we develop an approach to perform such monitoring and performance prediction for cardiac substructure segmentation.

Purpose: To develop a quality assurance (QA) framework for routine or continuous monitoring of domain shift and the performance of cardiac substructure autosegmentation algorithms.

Methods: A benchmark dataset consisting of computed tomography (CT) images along with manual cardiac substructure delineations of 241 breast cancer radiotherapy patients were collected, including one "normal" image domain of clean images and five "abnormal" domains containing images with artifact (metal, contrast), pathology, or quality variations due to scanner protocol differences (field of view, noise, reconstruction kernel, and slice thickness). The QA framework consisted of an image domain shift detector which operated on the input CT images and a shape quality detector on the output of an autosegmentation model, and a regression model for predicting autosegmentation model performance. The image domain shift detector was composed of a trained denoising autoencoder (DAE) and two hand-engineered image quality features to detect normal versus abnormal domains in the input CT images. The shape quality detector was a variational autoencoder (VAE) trained to estimate the shape quality of the auto-segmentation results. The output from the image domain shift and shape quality detectors was used to train a regression model to predict the per-patient segmentation accuracy, measured by Dice coefficient similarity (DSC) to physician contours. Different regression techniques were investigated including linear regression, Bagging, Gaussian process regression, random forest, and gradient boost regression. Of the 241 patients, 60 were used to train the autosegmentation models, 120 for training the QA framework, and the remaining 61 for testing the QA framework. A total of 19 autosegmentation models were used to evaluate QA framework performance, including 18 convolutional neural network (CNN)-based and one transformer-based model.

Results: When tested on the benchmark dataset, all abnormal domains resulted in a significant DSC decrease relative to the normal domain for CNN models ( $p < 0.001$ ), but only for some domains for the transformer model. No significant relationship was found between the performance of an autosegmentation model and scanner protocol parameters ( $p = 0.42$ ) except noise ( $p = 0.01$ ). CNN-based autosegmentation models demonstrated a decreased DSC ranging from 0.07 to 0.41 with added noise, while the transformer-based model was not significantly affected (ANOVA, $p = 0.99$ ). For the QA framework, linear regression models with bootstrap aggregation resulted in the highest mean absolute error (MAE) of $0.041 \pm 0.002$ , in predicted DSC (relative to true DSC between autosegmentation and physician). MAE was lowest when combining both input (image) detectors and output (shape) detectors compared to output detectors alone.

Conclusions: A QA framework was able to predict cardiac substructure autosegmentation model performance for clinically anticipated "abnormal" domain shifts.

Keywords: auto‐segmentation; domain shift; quality assurance.

MeSH terms

Breast
Deep Learning*
Heart / diagnostic imaging
Humans
Image Processing, Computer-Assisted / methods
Neural Networks, Computer
Tomography, X-Ray Computed / methods