Deep learning computer-aided detection system for pneumonia in febrile neutropenia patients: a diagnostic cohort study

Background Diagnosis of pneumonia is critical in managing patients with febrile neutropenia (FN), however, chest X-ray (CXR) has limited performance in the detection of pneumonia. We aimed to evaluate the performance of a deep learning-based computer-aided detection (CAD) system in pneumonia detection in the CXRs of consecutive FN patients and investigated whether CAD could improve radiologists’ diagnostic performance when used as a second reader. Methods CXRs of patients with FN (a body temperature ≥ 38.3 °C, or a sustained body temperature ≥ 38.0 °C for an hour; absolute neutrophil count < 500/mm3) obtained between January and December 2017 were consecutively included, from a single tertiary referral hospital. Reference standards for the diagnosis of pneumonia were defined by consensus of two thoracic radiologists after reviewing medical records and CXRs. A commercialized, deep learning-based CAD system was retrospectively applied to detect pulmonary infiltrates on CXRs. For comparing performance, five radiologists independently interpreted CXRs initially without the CAD results (radiologist-alone interpretation), followed by the interpretation with CAD. The sensitivities and specificities for detection of pneumonia were compared between radiologist-alone interpretation and interpretation with CAD. The standalone performance of the CAD was also evaluated, using area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. Moreover, sensitivity and specificity of standalone CAD were compared with those of radiologist-alone interpretation. Results Among 525 CXRs from 413 patients (52.3% men; median age 59 years), pneumonia was diagnosed in 128 (24.4%) CXRs. In the interpretation with CAD, average sensitivity of radiologists was significantly improved (75.4% to 79.4%, P = 0.003) while their specificity remained similar (75.4% to 76.8%, P = 0.101), compared to radiologist-alone interpretation. The CAD exhibited AUC, sensitivity, and specificity of 0.895, 88.3%, and 68.3%, respectively. The standalone CAD exhibited higher sensitivity (86.6% vs. 75.2%, P < 0.001) and lower specificity (64.8% vs. 75.4%, P < 0.001) compared to radiologist-alone interpretation. Conclusions In patients with FN, the deep learning-based CAD system exhibited radiologist-level performance in detecting pneumonia on CXRs and enhanced radiologists’ performance. Supplementary Information The online version contains supplementary material available at 10.1186/s12890-021-01768-0.


Background
Febrile neutropenia (FN) is observed in approximately 1% of patients receiving chemotherapy and has a mortality rate of approximately 10% [1][2][3][4][5]. Therefore, it is considered as a medical emergency that requires timely diagnosis and management [1,5]. Investigation of the infection focus is the key component of diagnostic workups in these patients, and pneumonia is one of the most common and important causes of FN [6][7][8][9]. Indeed, pneumonia is associated with a higher rate of treatment failure, longer hospitalization, and increased mortality in FN patients [1][2][3]10].
Chest X-ray (CXR) is the initial radiological examination for the evaluation of pneumonia at low cost and wide availability and thus considered a routine investigation in FN patients [1,[11][12][13]. However, a poor immune response may delay the development of radiographic infiltrates, causing difficulty in the sensitive detection of early pneumonia on CXR [14][15][16]. Furthermore, CXRs have high inter-observer variability among less-experienced readers [17,18], and timely interpretation of CXRs by expert radiologists is not always possible. In this regard, a computer-aided detection (CAD) system that can identify pulmonary infiltrates suggestive of pneumonia on CXRs may help managing FN patients.
Recently, deep learning technologies showed promising performance regarding the detection of pneumonia on CXRs [19][20][21][22]. However, those studies tested the performance of CAD using conveniently collected datasets, with enriched prevalence of pneumonia [19][20][21][22] and limited radiologic diversity (i.e., composed of CXRs with pneumonia and clearly normal CXRs) [19][20][21], which is definitely far from the situation in the real-world setting. Thus, it remains difficult to believe that deep learningbased CAD truly proved their clinical usefulness in pneumonia detection [22]. In this context, external validation in consecutive patients in a real clinical setting is necessary to confirm the real-world accuracy and clinical usefulness of deep learning-based CAD.
Therefore, we evaluated the performance of deep learning-based CAD in pneumonia detection in the CXRs of consecutive FN patients and investigated whether CAD could improve radiologists' diagnostic performance when used as a second reader.

Methods
This single-center, retrospective study was approved by the institutional review board. The requirement for patients' informed consents was waived.

Study population
We retrospectively included patients from a single tertiary referral hospital if they met the following inclusion criteria: (a) patients with a body temperature ≥ 38.3 °C or a sustained body temperature ≥ 38.0 °C for an hour between January and December 2017 [13]; (b) patients with absolute neutrophil count (ANC) < 500/mm 3 in peripheral blood on the day of fever [13]; and (c) patients who underwent CXRs on the day of fever.
In cases of multiple CXRs from a single patient, only the first CXR was included to represent the episode of FN. However, if there was an afebrile period of 30 days or longer between the days of fever, they were regarded as different episodes of FN and the first CXR in each episode was included.

CXR acquisition and CAD system
All CXRs were obtained using a fixed radiography system (Digital Diagnost, Philips Healthcare, Eindhoven, The Netherlands) or a portable radiography scanner (DRX-revolution, Carestream Health, Rochester, NY), depending on patients' condition. CXRs from a fixed radiography system were obtained in the erect position with posteroanterior projection, while CXRs from a portable radiography scanner were obtained in the supine or sitting position with anteroposterior projection.
We used a commercialized, regulatory-approved deep learning-based CAD (Lunit INSIGHT CXR 2, version 2.0.0.0, Lunit Inc., Seoul, Korea) to evaluate CXRs. The CAD was designed to detect radiographic abnormalities, including pulmonary nodules or masses, pulmonary infiltrates, and pneumothorax on a single frontal CXR. The CAD was initially trained using 54,221 normal CXRs and 35,613 abnormal CXRs including 6,903 CXRs from patients with pneumonia (see supplementary material for further information) [21].
The CAD provided probability scores for the presence of each target abnormality between 0 and 100% for a CXR. When the probability score was 15% or greater, the CAD also provided a heat map overlaid on the CXR for the localization of the abnormality (Fig. 1). In the present study, we used only probability scores for pulmonary infiltrate (see supplementary material for further information).
A CXR of a 68-year-old woman with febrile neutropenia (absolute neutrophil count, 366/mm 3 ; body temperature, 39.1 °C) showing increased opacity in the right upper lung (A, arrow). A chest computed tomographic scan obtained on the same day showing the corresponding consolidation and ground-glass opacity lesions in the upper lobe of the right lung, suggestive of pneumonia (B). The CAD system correctly identified the abnormality with a probability score of 84% (C). In the reader test, three of five radiologists correctly identified the pulmonary infiltrate on the CXR in the radiologist-alone interpretation, and all five radiologists identified the lesion in the interpretation with CAD.
A CXR of a 69-year-old man with febrile neutropenia (absolute neutrophil count, 489/mm 3 ; body temperature, 38.9 °C) showing subtle increased opacity in the right lower lung (D). A chest computed tomographic scan showing the corresponding consolidation and groundglass opacity lesions in the lower lobe of the right lung, suggestive of pneumonia (E). The CAD system correctly identified the abnormality with a probability score of 51% (F). In the reader test, two of five radiologists correctly identified the pulmonary infiltrate on the CXR in the radiologist-alone interpretation, and four radiologists including two radiologists who initially could not recognize the lesion correctly identified the lesion in the interpretation with CAD.

Reference standard
Two thoracic radiologists (C.M.P. and E.J.H., 21 and 9 years of experience in interpreting CXR and chest CT) who were blinded to the CAD results reviewed patients' medical records and radiological, microbial, and laboratory examination results to define, in consensus, the clinical diagnosis of pneumonia at the time of CXR. The presence of clinical features suggestive of respiratory infection with demonstrable pulmonary infiltrate by CXR or CT was regarded as a clinical diagnosis of pneumonia, regardless of the microbiological identification of the causative pathogen [23]. When a pulmonary infiltrate was identified on CT but invisible on CXR, such case was defined as positive for pneumonia based on CT findings.

Reader test
To compare the performance of CAD with that of radiologists and to evaluate whether CAD can enhance the performance of radiologists, we conducted a retrospective reader test. Five board-certified radiologists (general radiologists without subspecialty training for thoracic radiology, 1-3 years of experience after finishing residency) participated in the reader test, and 50% of CXRs among the entire cohort were randomly sampled for the test.
Each radiologist independently interpreted CXRs to classify them into those with or without pulmonary infiltrates suggestive of pneumonia. First, radiologists read the CXRs without CAD results (radiologist-alone interpretation). After finishing the first reading session for all CXRs, they re-interpreted the CXRs with the corresponding CAD results and were allowed to change their initial decision as needed (interpretation with CAD). The radiologists were informed that all CXRs were obtained Fig. 1 Identification of pneumonia on chest X-ray (CXR) using the computer-aided detection (CAD) system from FN patients; however, they were blinded to other clinical or laboratory information.

Subgroup analyses
For a more solid reference standard for the presence of pulmonary infiltrates, we separately evaluated the performance of CAD in patients with available chest CT obtained within 3 days from the CXRs with reference to CTs for the presence of pulmonary infiltrates.
To investigate the performance of CAD in patients with different clinical characteristics, we evaluated the performance of CAD in the following subgroup populations: (a) male vs. female patients; (b) patients aged < 60 years vs. ≥ 60 years; (c) CXRs from a fixed radiography system vs. CXRs from a portable radiography scanner.

Statistical analyses
The discriminative performance of CAD (ability to separate CXRs with and without pneumonia) was evaluated using the area under the receiver operating characteristic curve (AUC). The sensitivity and specificity of CAD were determined at the pre-defined threshold (probability score of 15%) recommended by the manufacturer. A comparison of AUCs of CAD between subgroups was performed according to the method by DeLong [24], while the chi-square test was used to compare the sensitivities and specificities of CAD between subgroups.
To compare the performance between radiologists and CAD, the sensitivity and specificity of CAD at the pre-defined threshold were compared with those of individual radiologists using generalized estimating equations, to consider the clustering effects caused by multiple CXRs obtained from a single patient and multiple interpretations by radiologists for a single CXR [25]. The sensitivities and specificities of radiologists in radiologist-alone interpretation and interpretation with CAD were also compared to investigate whether or not CAD can enhance radiologists' detection performance.
Calibration (degree of agreement between the predicted probabilities by CAD versus the observed probability of pneumonia) of CAD was evaluated by constructing a calibration plot. Inter-reader agreement among radiologists was evaluated using the Fleiss' kappa coefficient.
All statistical analyses were performed using R (version 3.6.3, R project for statistical computing, Vienna, Austria). A P value of less than 0.05 was considered to indicate a statistically significant difference.

Demographic and clinical information
A total of 525 CXRs obtained from 413 patients (216 men and 197 women; median age, 59 years [interquartile range (IQR), 48-67]) were analyzed in the present study. Based on the information in medical records, pneumonia was diagnosed in 128 (24.4%) FN episodes. Detailed demographic and clinical information are described in Table 1.
Regarding calibration, CAD tended to overestimate the probability of pneumonia (Fig. 2). Data of the sensitivities and specificities of CAD at different thresholds are For all patients, the CAD system showed an area under the receiver operating characteristic curve of 0.892 for the identification of pneumonia in the chest X-ray. At a pre-defined threshold probability score of 15%, the sensitivity and specificity of the CAD system were 88.3% and 68.3%, respectively (A). The calibration plot (B) shows that the CAD system tended to overestimate the probability of pneumonia. The density plot of the probability score from the CAD system according to the presence of pneumonia (C) shows that the CAD system can relatively clearly separate CXRs with pneumonia from those without pneumonia.

Reader test
A total of 263 (50.1%) CXRs were randomly sampled for the reader test. Among them, 67 (25.5%) were obtained from patients with pneumonia. Data of the performances of CAD and each radiologist are listed in Table 2 and Fig. 3.
In the radiologist-alone interpretation, radiologists exhibited average sensitivity and specificity of 75.2% and 75.4%, respectively. At the pre-defined threshold, the CAD exhibited significantly higher sensitivity (86.6% vs. 75.2%, P < 0.001) and lower specificity (64.8% vs. 75.4%, P < 0.001) than radiologist-alone interpretation. Regarding individual radiologists, the CAD showed significantly higher sensitivity and specificity than four and one radiologist, respectively. Meanwhile, the CAD showed significantly lower specificity than four radiologists (Table 2).
In the interpretation with CAD, average sensitivity of radiologists was significantly improved (75.4% to 79.4%, P = 0.003) while their specificity remained similar (75.4% to 76.8%, P = 0.101), compared to radiologist-alone interpretation. Regarding individual radiologists, significant improvements in sensitivity and specificity were observed in two and one radiologist, respectively ( Table 2).
Analyses after exclusion of radiologists who showed outlier sensitivity and specificity exhibited similar results with those including all radiologists (Additional file 1: Table S2 in the supplementary material).
In the reader test, the CAD system showed an area under the receiver operating characteristic curve of 0.895 for the identification of pneumonia in the chest X-ray. At a pre-defined threshold probability score of 15%, the sensitivity and specificity of the CAD system were 86.6% and 64.8%, respectively. In the radiologist-alone interpretation, radiologists exhibited average sensitivity and specificity of 75.2% and 75.4%, respectively. In the interpretation with CAD, average sensitivity of radiologists was significantly improved (79.4%), at the similar specificity (76.8%).

Subgroup analyses
The discriminative performance of CAD in different subgroups is shown in Table 3 and Fig. 4.
Chest CTs obtained within 3 days from the CXRs were available in 122 (23.2%) CXRs. Based on CT findings, pulmonary infiltrates suggestive of pneumonia were identified in 68 of 122 cases (55.7%). Among the pulmonary infiltrates, 57 were retrospectively visible on CXRs, while the other 11 were invisible on CXRs, even after correlation with chest CT findings. With the reference standard of chest CT, the CAD system exhibited an AUC of 0.852 (95% CI, 0.776-0.909). At the pre-defined threshold, CAD exhibited sensitivity and specificity of The CAD showed similar AUCs between men vs. women (0.892 vs. 0.899; P = 0.854), patients aged < 60 years vs. ≥ 60 years (0.906 vs. 0.867; P = 0.300), and CXRs from a fixed radiography system vs. CXRs from a portable radiography scanner (0.899 vs. 0.899; P = 0.831). However, the specificity of CAD was significantly higher in patients aged < 60 years than in those aged ≥ 60 years (79.8% vs. 52.7%; P < 0.001) and in CXRs from the fixed radiography system than in CXRs from the portable radiography scanner (79.5% vs. 63.1%; P = 0.001).
The CAD system exhibited a slightly lower area under the receiver operating characteristic curve (AUC: 0.852 vs. 0.917; P = 0.10), sensitivity (83.8% vs. 93.3%, P = 0.10), and specificity (59.3% vs. 69.7%, P = 0.13) in patients with a chest CT scan than in those without a chest CT scan, without a significant difference (A). The CAD system showed similar AUCs between men and women (0.892 vs. 0.899; P = 0.854; B), patients aged < 60 years and ≥ 60 years (0.906 vs. 0.867; P = 0.300; C), and chest X-rays (CXRs) from a fixed radiography system and CXRs from a portable radiography scanner (0.899 vs. 0.899; P = 0.83; D). However, the specificity of the CAD system was significantly higher in patients aged < 60 years than in those aged ≥ 60 years (79.8% vs. 52.7%; P < 0.001; Table 2 Performances of the computer-aided detection system and radiologists in the reader test

Numbers in parentheses indicate numerators/denominators, 95% confidence intervals
The experience of each radiologist indicates the length of their experiences in the interpretation of chest X-rays CAD, computer-aided detection a Performance of the CAD system at the predefined threshold (probability score of 15%) b Comparison of performance between the CAD system and the radiologist-alone interpretation c Comparison of performance between radiologist-alone interpretation and interpretation with CAD  Fig. 3 Performance of the computer-aided detection (CAD) system and radiologists in the reader test C), and CXRs from a fixed radiography system than CXRs from a portable radiography scanner (79.5% vs. 63.1%; P = 0.001; D).

Discussion
In our study, the deep learning-based CAD system showed good performance (sensitivity, 88.3%; specificity, 68.3%) in the identification of CXRs with pneumonia in consecutive febrile neutropenia patients. When compared with radiologists, the CAD system exhibited significantly higher sensitivity (86.6% vs. 75.2%) but lower specificity (64.8% vs. 75.4%). In the interpretation assisted by the CAD system, radiologists exhibited improvement in sensitivity (75.2% to 79.4%) at similar specificity (75.4% to 76.8%). Symptoms suggestive of respiratory infection and pulmonary infiltrates on CXR or CT are the two main pillars for diagnosing pneumonia [23]. In patients with neutropenia, however, symptoms of pneumonia may not be apparent due to their compromised host immune responses [8,11]; thus, diagnosis of pneumonia through imaging can be particularly critical [11,12]. However, CXRs are reported to have limited diagnostic performance for pneumonia in FN patients [14,15]. Furthermore, interpretation of CXR by an expert radiologist may not be readily available for timely management. In this regard, CAD may help in the accurate and timely diagnosis of pneumonia in these patients. Indeed, CAD used in our study could identify pneumonia on CXRs with higher sensitivity than radiologists (higher sensitivity than four of five radiologists at the pre-defined threshold). High sensitivity of the CAD in a consecutively-collected CXRs reflecting the prevalence of pneumonia and the spectrum of CXR findings of the actual clinical situation suggests this CAD can accurately identify pneumonia among CXRs of febrile neutropenia patients in the situation of daily practice.
Contrarily, CAD showed lower specificity than radiologist at the predefined threshold. The results suggest that the standalone interpretation by CAD without modification of the threshold may produce more false-positive results than radiologists' interpretation. However, sensitivity may have priority over specificity for diagnosing pneumonia in FN patients, considering that pneumonia is frequently overlooked on CXR [11,12,26] and is associated with poor prognosis in FN [2,3]. At the same sensitivity level as that of radiologists' interpretation, CAD exhibited higher specificity than most radiologists (four of five). The result suggests that adjustment of the threshold for positive results can improve specificity while maintaining radiologist-level sensitivity or higher.
Calibration of a deep learning-based algorithm is an important factor for its interpretability and credibility, but this aspect is frequently overlooked [27]. In our study, we found that CAD tended to overestimate the probability of pneumonia (Fig. 2B). The same CAD system exhibited a similar tendency of overestimation for the identification of referable abnormalities in CXRs obtained from patients in the emergency department [28]. Recalibration of the CAD result may correct the overestimation and improve the interpretability and credibility of CAD [28].
Assisting radiologists to enhance their performance is another key role of CAD [29]. In our study, average sensitivity of radiologists was significantly improved at similar specificity in the interpretation with CAD. The results suggest that CAD can help in improving the quality of radiologists' interpretation and management of patients. Preserved specificity while improving sensitivity suggests that false-positive detection by CAD can be appropriately arbitrated by radiologists. Improvement of inter-reader agreement in the interpretation with CAD suggests that the CAD can also reduce variability across radiologists in CXR interpretations. Prioritization of CXRs with high suspicion for critical finding is another potential application of CAD [30]. The relatively clear separation of CAD results between CXRs with and without pneumonia (Fig. 2C) suggests that such prioritization of CXR requiring early interpretation by a radiologist might be feasible with CAD. However, our retrospective study was not conducted for that purpose; further studies on this topic are warranted in the near future.
In patients with available chest CT as a firm reference standard for pneumonic infiltrates, CAD exhibited similar performance as that reported in a previous study, where the same CAD system was evaluated in patients suspected of having coronavirus disease (sensitivity 83.8% and specificity 59.3% in our study; sensitivity 81.5% and specificity 52.3% in the previous study) [31], suggesting the robustness of performance regardless of population characteristics. CAD exhibited slightly lower performances in patients who underwent chest CT than in patients who did not undergo chest CT although significant difference was not shown. This result is probably because chest CTs were obtained in patients with a questionable diagnosis of pneumonia or other causes of fever [1,11], while patients with a definite diagnosis of pneumonia on CXRs or those with clearly normal CXRs and no clinical suspicion of pneumonia may not have undergone CT.
Our study also showed that CAD missed a substantial proportion of pulmonary infiltrates (16.2%) on CXRs, which could be identified on CT. Although sensitivity improvement occurred in some radiologists, they remained below 90%. Therefore, the negative results by the CAD or radiologist may not necessarily indicate the absence of pneumonia, and CAD cannot be a substitute for chest CT. Chest CT should be performed in patients with persistent fever despite empirical antibiotics, or those with clinical suspicions of pneumonia but negative CXRs [1,11,32,33].
Consistency of performance in patients with different characteristics is an important factor for the reliability of CAD [29]. In our study, CAD exhibited similar performances between men and women. However, it exhibited lower specificities in older patients (age ≥ 60 years) than in younger patients, and CXRs obtained using a portable scanner than those obtained using a fixed radiography unit. Suboptimal image qualities of portable radiography and patient-related artifacts (e.g., limited inspiration and inappropriate positioning of patients) might have led to lower specificities. Previous studies reported that the identification of pulmonary infiltrates and differentiation from pleural effusion or atelectasis was difficult for human readers [34,35]. A higher prevalence of existing pulmonary abnormalities (e.g., scarring, interstitial fibrosis, and emphysema) may also have contributed to more false-positive detection by CAD. A higher rate of falsepositive detection should be considered when using CAD in clinical practice.
Our study has several limitations. First, because CAD was validated on a retrospectively collected dataset, the effect of CAD on the management and outcome of patients could not be evaluated. Second, since the reader test was conducted outside the daily practice, the reading behaviors of radiologists may be different from those in their daily practices. Third, since our study was performed in a single tertiary medical center, the generalizability of the results remains questionable. Finally, interpreting same CXRs twice in the reader test may have induced improvement of radiologists' performance in the second reading session, regardless of using the CAD. Therefore, it cannot be excluded that the added value of the CAD has been overestimated.

Conclusions
In conclusion, the deep learning-based CAD system exhibited radiologist-level performance in the identification of CXRs with pneumonia in consecutive FN patients. It enhanced the performance of radiologists for the identification of pneumonia. We believe that CAD system helps diagnose and manage pneumonia in FN patients.