Prediction of readmission in patients with acute exacerbation of chronic obstructive pulmonary disease within one year after treatment and discharge

Background To investigate the risk factors and construct a logistic model and an extreme gradient boosting (XGBoost) model to compare the predictive performances for readmission in acute exacerbation of chronic obstructive pulmonary disease (AECOPD) patients within one year. Methods In total, 636 patients with AECOPD were recruited and divided into readmission group (n = 449) and non-readmission group (n = 187). Backward stepwise regression method was used to analyze the risk factors for readmission. Data were divided into training set and testing set at a ratio of 7:3. Variables with statistical significance were included in the logistic model and variables with P < 0.1 were included in the XGBoost model, and receiver operator characteristic (ROC) curves were plotted. Results Patients with acute exacerbations within the previous 1 year [odds ratio (OR) = 4.086, 95% confidence interval (CI) 2.723–6.133, P < 0.001), long-acting β agonist (LABA) application (OR = 4.550, 95% CI 1.587–13.042, P = 0.005), inhaled corticosteroids (ICS) application (OR = 0.227, 95% CI 0.076–0.672, P = 0.007), glutamic-pyruvic transaminase (ALT) level (OR = 0.985, 95% CI 0.971–0.999, P = 0.042), and total CAT score (OR = 1.091, 95% CI 1.048–1.136, P < 0.001) were associated with the risk of readmission. The AUC value of the logistic model was 0.743 (95% CI 0.692–0.795) in the training set and 0.699 (95% CI 0.617–0.780) in the testing set. The AUC value of XGBoost model was 0.814 (95% CI 0.812–0.815) in the training set and 0.722 (95% CI 0.720–0.725) in the testing set. Conclusions The XGBoost model showed a better predictive value in predicting the risk of readmission within one year in the AECOPD patients than the logistic regression model. The findings of our study might help identify patients with a high risk of readmission within one year and provide timely treatment to prevent the reoccurrence of AECOPD. Supplementary Information The online version contains supplementary material available at 10.1186/s12890-021-01692-3.

Organization [2]. A national cross-sectional study in 2018 investigated the lung health status of adults > 20 years old in 10 provinces of China, and showed that the prevalence of COPD in adults > 20 years old was 8.6%, in adults > 40 years old was as high as 13.7%, causing a significant disease burden [3]. Acute exacerbation of COPD (AECOPD) refers to the aggravation of respiratory symptoms in patients, which is the main reason for hospitalization and medical expenditure of COPD patients [1,4]. Approximately 63% of COPD patients have at least one readmission due to exacerbation within 1 year after hospitalization [5,6]. AECOPD accelerate the progress of the disease, reduce the quality of life of patients and increase the risk of death [7]. Early identification of patients with high risk of AECOPD and readmission and timely interventions to reduce the incidence of AECOPD and readmission are of great clinical significance for improving the prognosis of COPD patients and delaying the progression of the disease.
Previously, the risk factors associated with AECOPD and readmission in patients were explored by several studies, which revealed that gender, hospital stay, medical aid care, duration of systemic steroid use were factors leading to the AECOPD and readmission [8]. Factors including age, tobacco use, diabetes mellitus, infections, obesity, and frequency of hospital visit were also reported to influencing the occurrence of AECOPD and readmission [9]. Currently, there was no international universal prediction model for predicting the readmission of AECOPD patients within one year after discharge. Prediction models of readmission in patients with AECOPD were established based on the data of USA or UK people and some of them were focused on predicting the risk of readmission of AECOPD patients within 30 days based on social factors or LACE index (length of stay, acuity of admission, co-morbidities, and emergency department visits within the last 6 months) [10,11]. Additionally, the prediction models for readmission of AECOPD patients within 90 days were also established based on PEARL (previous admissions, eMRCD score, Age, Right-sided heart failure and Left-sided heart failure) or COPD-2-HOME score (CAT score, hyperinflation, obstruction, prior admission, eosinophilia) [12,13]. A prediction model of readmission of AECOPD patients within 90 days considered the importance of multimorbidity, frailty and poor socioeconomic status in patients [14]. Njoku et al. [15] indicated that the prevalence of COPD-related readmission was about 2.6-82.2% within 30 days, 11.8-44.8% at 31-90 days, 17.9-63.0% at 6 months, and 25.0-87.0% at 12-month post-discharge [15], which suggested that the importance of not only predicted the readmission of AECOPD patients within 30 days or 90 days, but also one year. At present, a prediction model for one-year readmission of COPD patients was established but it had a low area under the curve (AUC) value and lacked validation of the results [16]. There was no prediction model for predicting the readmission of AECOPD patients within one year after discharge based on the data from Chinese population. To establish a prediction model for predicting the readmission of AECOPD patients within one year after discharge in China is of great value.
Gradient boosting machine (GBM) is a kind of machine learning algorithm helping assemble the weak learners into a strong learner. GBM increases the performance of the prediction model during the gradient descent process. The extreme gradient boosting (XGBoost) model is an extension of GBM, which combines several learning algorithms to achieve a better predictive performance than any of the constituent learning algorithms alone [17]. XGBoost applies a second-order Taylor expansion to the loss function and simultaneously implements the first derivative and the second derivative. Additionally, a regularization term is supplemented in the objective function to increase the generalizability of a single tree and decrease the complexity of the objective function [18]. XGBoost model is widely used for disease diagnosis and prediction due to its fast speed, excellent classification effect.
In our study, we collected the data of 650 patients with AECOPD from the Second Affiliated Hospital of Nanjing Medical University from Jan. 2016 to Dec. 2019 to investigate the risk factors and construct XGBoost model and logistic regression model to compare the predictive performance for readmission in AECOPD patients within one year after treatment and discharge.

Study population
In the current study, 650 patients with AECOPD were recruited from the Second Affiliated Hospital of Nanjing Medical University between Jan. 2016 and Dec. 2019. The data of the patients were retrospectively extracted from a broad coding records search and review of COPD assessments routinely completed by clinicians or nurses. After excluding 12 patients who readmitted into hospitals because of pneumonia, 1 patient who readmitted into hospitals due to congestive heart failure and 1 patient who readmitted into hospitals due to perianal condyloma acuminatum, 636 participants were finally included. A hospitalization for AECOPD was identified through International Classification of Diseases-10 codes (J44.1) [19]. Readmission one year after discharge means within one year from their first day of discharge to readmission day [20]. All subjects were divided into readmission group (n = 187) and non-readmission group (n = 449).
This study got the approval from the Ethics Committee of from the Second Affiliated Hospital of Nanjing Medical University, the approval number was (No. [2021]-KY-091-01). The screen process of the participants was shown in Fig. 1.

Definitions of the variables
Congestive heart failure referred to the symptoms and/or signs of present heart failure, and left ventricular ejection fraction (LVEF) < 40%; or LVEF ≥ 40% with elevated brain natriuretic peptide and meeting at least one of the following requirements: (1) left ventricular hypertrophy and/or left atrial enlargement; (2) abnormal diastolic function.
Diabetes was defined as patients with blood glucose level ≥ 11.1 mmol/L at any time after meal or fasting Fig. 1 The screen process of the participants in this study blood glucose level ≥ 7.0 mmol/L, or having blood glucose level ≥ 11.1 mmol/L in 2-h glucose tolerance test or glycosylated hemoglobin ≥ 6.3%.
Hypertension was defined as systolic blood pressure ≥ 140 mmHg and/or diastolic blood pressure ≥ 90 mmHg when blood pressure was measured three times on different days without using antihypertensive drugs; For patients with a history of hypertension and currently taking antihypertensive drugs, they were diagnosed with hypertension although the blood pressure was lower than 140/90 mmHg.
Smoking status: including never smoking, former smoking and current smoking. Non-smoking was defined as less than 100 cigarettes in a lifetime, former smoking referred to more than 1 year of smoking cessation, and number of years of smoking packets = number of smoking packets per day (20 cigarettes are counted as 1 packet) × smoking years [21].

XGBoost model
XGBoost model is an ensemble learning algorithm based on the gradient-boosted tree algorithm. XGBoost model processes sparse data via a sparsity-aware learning algorithm and weights quantile sketch to approximate tree learning [22].

Statistical analysis
All statistical tests were conducted by two-sided test. The measurement data of normal distribution were described by Mean ± standard deviation (Mean ± SD), the independent sample t test was applied for comparisons between groups. The non-normal distributed data were expressed by median and quaternary spacing [M (Q1, Q3)], and differences between groups were compared by the Mann-Whitney U rank sum test. The enumeration data were shown as n (%). Chi-square test or Fisher's exact probability method was used for comparison between groups. Random forest filling method was applied for filling in missing values with 100 trees via the missForest package in R© Version 3.5.1 (R Foundation for Statistical Computing, Vienna, Austria) [23]. Sensitivity analysis was performed before and after interpolation. To explore the risk factors for readmission in AECOPD patients within one year, the differences were firstly analyzed between groups, and the variables with statistical significance were included in the multivariate logistic model. Backward stepwise regression method was used to analyze the risk factors for readmission. For the establishment of prediction models, 70% of the samples were involved as the training set for construction of the models, and 30% of the samples were used as the testing set to test the diagnostic efficiency of the models [24,25], and the equilibria analysis was conducted between the training set and the testing set. Variables with P < 0.1 were included in the logistic model and the extreme gradient boosting (XGBoost) model, and the parameters were adjusted. After establishing the models, the area under the curve (AUC) value, kolmogorov-smirnov (KS), sensitivity, specificity and accuracy were used to evaluate the performance of models. The receiver operator characteristic (ROC) curves were plotted. SAS 9.4 and R 3.6 were employed for data analysis in our study, and P < 0.05 referred to be statistical significant.

The manipulation of missing data
Variables with a missing value ratio of more than 25% were removed (most of them were data related to discharge including partial arterial oxygen pressure, partial pressure of carbon dioxide in artery, arterial oxygenation, pH, and medications at discharge), and random forest filling method was used to fill in missing values for selected data. Sensitivity analysis before and after interpolation was shown in Table 1. There was no bias after interpolation of the missing data.

Comparisons of baseline data between readmission group and non-readmission group
As exhibited in Table 2, the age (72.21 years vs.

Risk factors of readmission in patients with AECOPD within one year
Variables with statistical significance in comparisons of the baseline data were included in the multivariate logistic regression analysis. Backward stepwise regression   Table 3).

The equilibrium test of training set and testing set
All samples were randomly divided into the training set and the testing set (7:3). The results of equilibrium analysis after division showed that there was no statistical significance in the differences of variables between the training set and the testing set ( Table 4).

Construction of logistic model and validation of the predicative value via the testing set
Variables with statistical differences were included in the logistic model. The stepwise backward method was used, and age and gender were included. The results were shown in Table 5 Table 6).

Construction of XGBoost model and validation of the predicative value via the testing set
Variables with P < 0.1 were selected into the XGBoost model, and age and gender were also involved in. After GridSearchCV search tuning, the optimal parameters of the model were: tree depth: 2, number of trees: 50, learning rate: 0.01. The weight method was used to evaluate the importance of variables via the number of split nodes in the model tree. The results depicted that variables including acute exacerbation in previous 1 year, the CAT score, and SABA and LABA application were more important in the XGBoost model (Fig. 3).

Comparisons of the predictive abilities of the two prediction models
The logistic model and XGBoost model were used to establish the prediction model, and the performance of the models were compared. The AUC value of the logistic

Discussion
This study collected the data of 650 patients with AECOPD and evaluated the risk factors for readmission within one year after treatment and discharge and constructed logistic model and XGBoost model to predict the risk of readmission within one year in AECOPD patients. The data revealed that acute exacerbation within the previous 1 year, LABA application, and the total CAT score were risk factors for readmission of AECOPD patients within one year while ICS application and higher ALT level were protective factors for readmission of AECOPD patients within one year. Additionally, we compared the predictive performances of logistic model and XGBoost model in predicting the risk of readmission within one year in AECOPD patients. The data delineated that the XGBoost model showed better predictive value.    Previously, the history of exacerbation was reported to an independent predictor for future exacerbations in patients [26]. A study of Bernabeu-Mora et al. indicated that the number of hospitalizations due to exacerbations in the previous year increased the risk of readmission by 4.44 times [27]. The results of these studies supported the findings in this study, showing that patients with acute exacerbations within the previous 1 year had a 4.086fold higher risk of readmission than those without acute exacerbations within the previous 1 year. The CAT score is a questionnaire as a simple, and quick instrument for measuring the severity and impact of symptoms in COPD patients and determining of the appropriate treatment for those patients in clinical practice [28]. Multiple studies have revealed that CAT score had good internal consistency and test-retest reliability for both stable and exacerbating COPD [29]. The CAT scores were higher in patients with a history of frequent exacerbations [30]. Herein, patients with higher CAT scores were associated with a higher risk of readmission within one year. This maybe because higher CAT scores were correlated with higher concentrations of serum C-reactive protein and plasma fibrinogen, demonstrating that systemic inflammation were more serious in patients with higher CAT scores [31]. For AECOPD patients with high CAT score, timely intervention should be provided and after discharge, follow-up should be conducted regularly to prevent the occurrence of readmission. At present, the aim of the treatment for COPD patients was to prevent the deterioration of lung function and alleviate symptoms to decrease the risk of exacerbations [32] and short-acting bronchodilators and long-acting bronchodilators including LAMA, LABA, SAMA, SABA and ICS are frequently applied in the treatment of COPD [33]. SABA is often applied on an as-needed basis for symptom relief in COPD patients [34]. However, previous studies also identified that the application of SABA might have a higher risk of readmission for than other therapies  including arformoterol tartrate [35,36]. In our study, SABA was identified as an important variable influencing the risk of readmission in AECOPD patients within one year after discharge. ICS application reduce the frequency and severity of exacerbations [37]. In the present study, patients receiving ICS treatment decreased the risk of readmission than patients without ICS treatment, which was supported by previous studies. For patients with AECOPD, ICS treatment should be appropriately provided to prevent the occurrence of readmission of those patients [38]. Another study also delineated that the incidence of adverse events was higher in patients with LABA treatment (50.2%) than patients without LABA treatments and exacerbation of COPD was the most commonly reported adverse events [39]. This may result in the increase of readmission of patients, which gave support to the results of our study, depicting that LABA treatment may cause a higher risk of readmission in patients [34,35]. For AECOPD patients, LABA usage should be applied with caution. This study measured the predictors for readmission in patients with AECOPD within one year and established two prediction models including logistic model and XGBoost model. Random forest filling method was used for dealing with the missing values via constructing multiple decision trees, and the data after filling have randomness and uncertainty, which can better reflect the real distribution of these unknown data. Random forest filling method can be well applied to high-dimensional data filling because each branch node selects random partial features instead of all features in the process of decision tree construction. High accuracy and reliability of the data after filling were ensured. The validation of the predictive values of the models were performed in the testing set. Due to the small sample size in our study, the training set included 70% of the samples and the testing set included 30% of the participants. This split ensured enough samples for   construction of more reliable models and meanwhile, there were still some samples to validate the performance of the model. The ROC curves were drawn to display the results of respective models. The AUC values of the logistic model were good in the training set but poor in the testing set while the XGBoost model showed good predictive abilities in both the training set and the testing set. This indicated that XGBoost model was better than logistic model in predicting the risk of readmission in patients with AECOPD within one year. Currently, the prediction of the risk of readmission in patients with AECOPD was focused on 30 days and 90 days after treatment and discharge [10][11][12]40], but 30 days or 90 days could not actually represent the disease procession. The readmission of AECOPD patients within one year indicated the long-term prognosis of patients. A previous study established a prediction model for one-year readmission of COPD patients, but the AUC value was only 0.703 and the results were not validated [16]. Compared with the former prediction models, we compared logistic model and XGBoost model and found the AUC values of the XGBoost model were higher in both the training set and the testing set. The variables involved in our model were common for clinicians to collect, and based on these variables, XGBoost model can quickly predict the possibility of readmission in AECOPD patients after discharge within one year. In addition, our prediction model was uploaded to the GitHub with a free access for everyone (https:// github. com/ shipi ngche nmedi cine1 23/ XGBoo st-model). The instructions for using the model was shown in Additional file 1: File 1. We welcome more clinicians to use our model to validate the results of our study. The findings of the current study might help early identify patients with a high risk of relapse and readmission, especially patients with moderate exacerbations, and provide timely intervention measures to reduce the incidence of AECOPD and readmission to improve the prognosis of COPD patients and delay the progression of the disease. The strengths of this study were that we dealt with the missing data and no bias were obtained, and the results might be more reliable. Internal validation was also conducted to verify the results of the present study. There were several limitations in our study. Firstly, the sample size was small and collected from a single center, which might decrease the statistical power especially in some variables with limited samples, such as patients SABA usage. Therefore, results of our study might be interpreted with caution. Secondly, external validation of the findings was not performed. Thirdly, subgroup analysis was not performed on patients with mild exacerbations, moderate exacerbations and severe

Conclusions
In the current study, we constructed two models to predict the risk of readmission within one year in the AECOPD patients based on the predictors including acute exacerbation within the previous 1 year, LABA, ICS application, ALT level and the total CAT score. The results showed that the XGBoost model showed better predictive value in predicting the risk of readmission within one year in AECOPD patients than the logistic model. Variables including acute exacerbation in previous 1 year, the CAT score, and SABA and LABA application were more important in the XGBoost prediction model. The findings of our study might help identify patients with a high risk of readmission in patients with AECOPD within one year and provide timely interventions and treatment to prevent the reoccurrence of AECOPD.