Practice Points
- Clinical trials typically use semiautomated computer software to identify new lesions on MRI, while clinical practice typically relies on a human radiologist who usually does not have access to lesion identification software.
- There is discordance in detecting new lesions using semiautomated methods and human readers; semiautomated software detects many more lesions than a human.
- Clinicians should consider this lesion counting discordance when reviewing the new lesion rates reported in clinical trials.
MRI has been a mainstay of multiple sclerosis (MS) clinical practice and research for several decades. New or enlarging hyperintense lesions on T2-weighted images are a key marker of inflammatory injury and disease activity in MS.1-4 In clinical practice, new lesions are identified by a radiologist or neurologist, while in clinical trials evaluating MS therapies, new lesions are usually identified by an automated or semiautomated evaluation.5 Little is known about how these 2 methods compare, which is important when translating clinical trial results into routine clinical practice. This study aimed to use a combination of qualitative and quantitative methods to compare physician and computer-based identification of new/enlarging brain T2 lesions obtained during a clinical trial.
Methods
Imaging Acquisition
Ethical approval was obtained from the Cleveland Clinic’s institutional review board. From November 2013 until May 2017, patients with progressive MS enrolled in the phase 2 SPRINT-MS trial (NCT01982942) underwent standardized MRI scans at baseline and every 24 weeks up to 96 weeks for a total of 5 MRIs each.5 MRIs were obtained using Siemens (MAGNETOM Trio or Skyra) or GE (version 12.x or higher) 3T scanners. Image acquisition included T1-weighted 3D-spoiled gradient-recalled echo, proton density weighted and T2-weighted 2D turbo/fast spin-echo, and 2D T2-weighted fluid-attenuated inversion recovery (FLAIR). Gadolinium was not used. Details of the MRI acquisition and overall results of the SPRINT-MS trial have been previously published.6 This analysis was conducted without accounting for treatment assignment.
Semiautomated Imaging Review
The semiautomated method (SAM) was performed by a central neuroimaging analysis center, which first conducted quality control on images received from clinical trial sites. Incorrect MRI acquisition parameters, incorrect head angle, and excessive motion were common reasons for scan rejection. T1-weighted, T2-weighted, T2-weighted FLAIR, and proton density weighted images at reference and follow-up time points were all rigidly coregistered to a common patient space at the native resolution of the T2-weighted images (1 × 1 × 3 mm). New/enlarging T2 lesions were automatically segmented by simultaneously considering coregistered T1-weighted, T2-weighted, T2-weighted FLAIR, and proton density weighted images at both reference and follow-up time points, using a voxelwise Bayesian-driven probabilistic classification, followed by random-forest–based classification of individual new/enlarging T2 lesions.7 Automatic segmentation was followed by human visual confirmation by an expert rater and manual correction where necessary.7 New/enlarging T2 lesion counts and volume of each new/enlarging lesion were recorded based on new/enlarging T2 lesion boundaries as determined in the common patient space.
Human Imaging Review
All MRI scans were reviewed by a single, blinded, board-certified neuroradiologist with more than 10 years of experience, who identified new/enlarging T2 lesions relative to previous images using the native, original Digital Imaging and Communications in Medicine (DICOM) images.
Statistical Analysis
The percentage of paired scans with nonzero new/enlarging lesion counts by either method was calculated and compared using a 2-proportion z test. Each paired scan set was treated independently since all comparisons were conducted independently of previous comparisons. To exclude the potential
confounding of small lesions, additional sensitivity analyses excluded lesions less than 50 mm3 or 14 mm3, as measured by the SAM method. Mean lesion counts per paired scan were compared using the paired t test for each subgroup; κ was calculated using the irr package. All analyses were conducted using R software.
Qualitative Comparison
To understand the reasons for any significant discordance between SAM and neuroradiologist comparisons, 7 sample paired scans with notable discordance were reviewed by a second neuroradiologist (board-certified neuroradiologist with 15 years of experience) who was unblinded to the previous human and semiautomated identification results. Additional advanced imaging tools were used, including 3D rigid coregistration and an image subtraction module within the DICOM viewer), to assist the second neuroradiologist.
Results
Of the 255 participants enrolled in the SPRINT-MS trial, 244 patients provided 887 MRI paired scans for analysis. Their average age was 55.6 years, and they were evenly split between diagnoses of primary progressive MS and secondary progressive MS (Table 1). New/enlarging lesions were identified by SAM on 19.7% of paired scans and by the neuroradiologist on 5.2% (P < .001; Figure 1A). Of the paired scans, 37 (20% of the 185 paired scans with at least 1 lesion by either method) were identified as having new/enlarging lesions by the human evaluation and by the SAM. In 9 paired scans, the human detected 1 or 2 new/enlarging lesions that were not detected by the SAM. Overall, κ was 0.18 (poor agreement), with 81% raw agreement across the 887 paired scans. Of the 887 MRI paired scans, 702 had no new/enlarging lesions by either method.
From the 185 paired scans with 1 or more lesions identified by either method, the mean number of new/enlarging T2 lesions per scan was 3.4 (median = 2) when identified by SAM and 0.4 (median = 0) on human evaluation (P < .001). Among this subset of 185 paired scans, κ was –0.06, indicating poor agreement, with a raw agreement of 7%. Analysis was also conducted by categorizing paired scans as active (2 or more lesions identified by either method) or inactive (fewer than 2 lesions identified by either method). Of the paired scans, 791 (89.1%) were identified as inactive by both methods, but only 11 (1.2%) were identified as active by both methods. Indicating poor agreement, κ was 0.18, with a raw agreement of 90.4%.
To explore the effect of small lesions, we excluded lesions identified by the SAM with volumes of less than 50 mm3 (approximately 17 voxels on 3-mm–thick slices). After excluding these small lesions, SAM identified new/enlarging lesions in 13.9% of paired scans, compared with the human-observed 5.2% (P < .001, Figure 1B). When all scan intervals were considered, κ increased to 0.27 (raw agreement, 87%), but when only paired scans with 1 or more lesions were included, it remained –0.06 (raw agreement, 13%). Much of the discordance was driven by scans with very high numbers of SAM lesions (Figure 2). The heat map analysis also shows that significantly fewer lesions were identified by the human method. When the lesion volume cutoff was 14 mm3, SAM identified new/enlarging lesions in 17.5% of paired scans.
Unblinded neuroradiologist review of select discordant paired scans found that more lesions were observed with advanced radiology tools, and the number of lesions was similar to that identified by SAM, including in 3 paired scans where initial neuroradiologist review detected zero new/enlarging lesions (Table 2). This qualitative assessment found that lesions were more likely to be missed on scans with very high total lesion load (data not shown).
Discussion
We compared new/enlarging T2 lesion identification by a human reader and by a semiautomated method and found poor agreement (κ = 0.18). Overall, SAM detection identified 3.7-fold more paired scans with new/enlarging T2 lesions than a human and 8.5-fold more lesions per scan that had new/enlarging lesions. This discordance became less pronounced when small lesions were excluded from the SAM analysis, suggesting that the human reviewer overlooked smaller lesions that the automated software was able to detect. Nonetheless, κ remained poor (–0.06) when only paired scans detecting a new lesion by either method were included. It was also poor (κ = 0.18) when paired scans were categorized as active (2 or more new/enlarging lesions) and inactive (fewer than 2 new/enlarging lesions).
These observations have clinical relevance because drug efficacy is typically assessed through clinical trials, which usually use semiautomated lesion identification. Clinical practice usually relies on a radiologist. Based upon these observations, clinicians should recognize that the frequency of new/enlarging lesions derived from clinically read MRIs in routine practice may be much lower than that reported in clinical trials.
Our study used MRI readings conducted by a board-certified neuroradiologist in a busy clinical practice setting, thus reducing variability that may have arisen from multiple human raters.8,9 Using multiple human reviewers in future studies may better replicate the reality of clinical practice. Our qualitative assessment of 7 discordant paired scans by a second neuroradiologist who was aware of the discordance found more lesions when using advanced radiology tools available on the DICOM viewer. The discordance was greatest in patients with advanced disease, demonstrating the increased difficulty of using the unaided eye to find new lesions with a background of numerous preexisting lesions. This challenge is further complicated when 2 scans that are separated in time are typically acquired in dissimilar slice planes, leaving only a portion of a single image slice in common between the paired scans across time intervals.
Advanced imaging preprocessing tools can aid and augment manual review. Advanced imaging modules can perform a rigid coregistration between 2 studies by interpolating a volumetric scan so common imaging planes are presented for side-by-side review. These advanced tools also often enable the viewer to rapidly alternate (or flicker) between coaligned scans, which improves the human eye’s ability to detect differences. Similarly, there are tools that can subtract aligned images, further enabling the visual appreciation of differences.
These findings correspond to previous research data that found a discrepancy in classifying treatment response on MRI between a neuroradiologist and a different semiautomated detection software.8 The efficacy of such software and tools should also be rigorously and systematically studied to assess their utility. Further, the advent of artificial intelligence has led to the development of even more sophisticated software for the detection of MS lesions and the analysis of MS progression on MRI, some of which have been demonstrated to be superior to human review.10-12 Our study examined the sum of new and enlarging lesions, as it was a study end point in the trial that provided the dataset, but a further study examining discordance in new lesion detection separately from enlarging lesion detection may yield further insights.
Our study is limited by the fact that our participants all had progressive MS, which has lower event rates of new/enlarging T2 lesions compared with relapsing MS: Only 19.7% of paired scans had 1 or more new/enlarging lesion by semiautomated detection, and only 5.2% had 1 or more new/enlarging lesion by human detection. The detection agreement in people with relapsing MS may be higher due to more numerous and larger-sized new/enlarging T2 lesions. However, the degree of discordance observed even in this population was substantial, underscoring its clinical importance. We only used 1 SAM approach and 1 neuroradiologist, so the generalizability of observations is unknown. We did not use a consensus of experts for lesion detection, and we did not use preprocessing techniques (eg, signal intensity standardization, coregistration). Although some of the discordance could be driven by the SAM detecting artifacts, expert reviewers verified SAM lesion identification, which makes this explanation for the discordance less likely.
Conclusions
We observed a large discordance between new/enlarging lesions identified by a human neuroradiologist and by a semiautomated lesion identification software. Although small lesions contributed to the discordance, the discordance remained even after the exclusion of small lesions. These findings have clinical relevance when applying new/enlarging lesion results from clinical trials to routine clinical practice and in the use of MRI for monitoring anti-inflammatory therapies.