Development and validation of a prospective study to predict the risk of readmission within 365 days of respiratory failure: based on a random survival forest algorithm combined with COX regression modeling

Background There is a need to develop and validate a widely applicable nomogram for predicting readmission of respiratory failure patients within 365 days. Methods We recruited patients with respiratory failure at the First People’s Hospital of Yancheng and the People’s Hospital of Jiangsu. We used the least absolute shrinkage and selection operator regression to select significant features for multivariate Cox proportional hazard analysis. The Random Survival Forest algorithm was employed to construct a model for the variables that obtained a coefficient of 0 following LASSO regression, and subsequently determine the prediction score. Independent risk factors and the score were used to develop a multivariate COX regression for creating the line graph. We used the Harrell concordance index to quantify the predictive accuracy and the receiver operating characteristic curve to evaluate model performance. Additionally, we used decision curve analysiso assess clinical usefulness. Results The LASSO regression and multivariate Cox regression were used to screen hemoglobin, diabetes and pneumonia as risk variables combined with Score to develop a column chart model. The C index is 0.927 in the development queue, 0.924 in the internal validation queue, and 0.922 in the external validation queue. At the same time, the predictive model also showed excellent calibration and higher clinical value. Conclusions A nomogram predicting readmission of patients with respiratory failure within 365 days based on three independent risk factors and a jointly developed random survival forest algorithm has been developed and validated. This improves the accuracy of predicting patient readmission and provides practical information for individualized treatment decisions.


Introduction
Respiratory failure is a common medical emergency that leads to inadequate blood oxygen levels and/or an increase in blood carbon dioxide levels [1].Patients with respiratory failure often have multiple cardiovascular and pulmonary complications or may be simultaneously suffering from multiple diseases, and therefore often require prolonged hospitalization, ventilatory support, and intensive care unit admission, resulting in high mortality and high readmission rates [2,3].To better address these issues, many experts and clinicians have been dedicated to identifying prognostic models for respiratory failure [4,5].We chose readmission as our research indicator because it might be expensive for the healthcare system [6], and it represented an economic and clinical burden for patients [7].
A report on specific disease readmission rates suggested that strategies to reduce readmission rates could be successful, such as with heart failure [8,9].However, there are few studies on readmission for respiratory failure.Currently, most research focuses on subtypes of respiratory failure or is limited to early readmission studies [10].Studies also often include patients with respiratory failure who have other co-occurring conditions [11].Because respiratory failure is a complex chronic disease caused by multiple risk factors, building a predictive model using Cox regression may not be very effective in predicting individual disease risk [12].
Therefore, the establishment of the readmission model for respiratory failure with complex diseases requires not only accurate risk assessment, but also the interpretation of results based on the importance of covariates to evaluate risk factors, with the ultimate goal of developing better diagnostic and treatment strategies [13].In fact, important covariates may vary due to environmental factors, and from a clinical perspective, all selected covariates are meaningful [14].In this case, it is necessary to ensure that the designed model has high prediction accuracy without overfitting, and is universally applicable in clinical diagnosis in the real world [15,16].
In this study, we attempted to use the random forest model in machine learning to solve this problem, which has good processing capabilities for complex high-dimensional data.The purpose of this study was to establish a widely applicable line chart to predict the occurrence of readmission in patients with respiratory failure at 365 days.

Study design
The flowchart of this experiment for a multi-center prospective cohort study was depicted in Fig. 1.The study event was respiratory failure.The start time of the study was when the patient was admitted to the hospital with a diagnosis of respiratory failure based on blood gas analysis.The endpoint of the study were the occurrence of another respiratory failure event and the time point of hospitalization for the event.
First, the samples collected from the First People's Hospital of Yancheng were divided into a modeling dataset and an internal validation dataset, and then a line chart was proposed for the study work.Second, the model was evaluated using the internal validation dataset.Third, external validation was performed using data provided by the People's Hospital of Jiangsu Province.The current project followed the principles of the Helsinki Declaration.This study was approved by the ethics committees of the First People's Hospital of Yancheng (No.2020-K062) and the People's Hospital of Jiangsu Province (No. 2021-SR-346).In addition, participants from both hospitals provided written informed consent to support clinical research.

Participants and data collection
We selected 744 patients with respiratory failure who were hospitalized in the First People's Hospital of Yancheng from October 2020 to September 2021.For external validation, we used a dataset of 223 respiratory failure patients who were hospitalized in Jiangsu Provincial People's Hospital from October 2021 to December 2021.
The inclusion criteria for research patients were as follows: Arterial oxygen partial pressure (PaO2) was less than 8.0 kPa (60 mmHg) or arterial carbon dioxide partial pressure (PaCO2) was greater than 6.0 kPa (45 mmHg) based on blood gas analysis [17].
Patients with incomplete clinical data, age less than 18 years, death within 24 h, trauma, malignant tumors, malignant hematological diseases, or pregnancy were excluded.

Follow-up index
The primary indicator for respiratory failure patients was the time from discharge to readmission due to respiratory failure.The information was collected from the two centers mentioned above and followed up for 365 days after discharge.To better track patients, we conducted telephone interviews and used hospital systems to verify the patient's condition further.

Model specification
Firstly, the variables with non-zero coefficients were screened using LASSO regression, followed by multivariate COX regression to identify significant variables.Secondly, the variables excluded by setting their LASSO regression coefficient to 0 and performing multiariable COX regression together were used to construct a new model using the RSF method for prediction.Finally, the

Statistical analysis
To ensure comparability between the two groups of patients, 744 respiratory failure patients were randomly divided into two groups, with 70% and 30% of patients in each group, respectively.One group (n = 520) was used to develop the Nomogram, while the other group (n = 224) was used to verify the predictive ability of the constructed model.
To test the balance between two groups, categorical variables were expressed as frequencies and percentages, and their differences were compared using the Chisquare test.For continuous variables, if they followed a normal distribution, they were expressed as mean ± SD, and their differences were compared using the t-test.If they did not follow a normal distribution, they were expressed as median and quartiles, and their differences were compared using the Mann-Whitney test.
The LASSO regression method was utilized to select significant features from the modeling set for multivariate Cox proportional hazard analysis, screening independent risk factors.The Random Survival Forest (RSF) algorithm was employed to construct a model for the variables exhibiting a coefficient of 0 following LASSO regression, and subsequently compute the prediction Score.The predictive accuracy of the nomogram was quantitatively measured using the Harrell consistency index (C index), which calculated time-related receiver operating characteristic (ROC) curves and areas under the curve to evaluate the model's performance.The accuracy of the nomogram prediction was evaluated using the calibration curve.Additionally, decision curve analysis (DCA) was used to assess the clinical utility of the nomogram.
Individual risk scores were obtained based on the established nomogram.Risk stratification was determined using the ROC curve to identify the optimal threshold for risk score.The critical value divided patients into high-risk and low-risk groups and provided the best difference for survival analysis between the risk groups.A p-value < 0.05 was considered statistically significant.All statistical analyses were performed using R (version 4.1.3).

Baseline characteristics
In this study, we prospectively evaluated a total of 744 patients with respiratory failure who met the inclusion criteria.The evaluation data were summarized and randomly divided into two groups at a ratio of 7:3.Table 1 shows the baseline characteristics of patients in the modeling (n = 520) and validation (n = 224) cohorts.In the entire cohort, there were 487 males (65.46%) and 257 females (34.54%), with a median age of 74 years.Type 2 respiratory failure patients accounted for 68.15% of the cohort, while type 1 respiratory failure patients accounted for 31.85%.There were 448 (60.22%) smokers.The top three chronic diseases in terms of prevalence were COPD, with 536 (72.04%) patients, hypertension with 289 (38.84%) patients, and pneumonia with 169 (22.72%) patients.Additionally, 186 patients were readmitted, resulting in a readmission rate of 25%.The clinical characteristics between the two groups were well balanced and comparable.The external test set, which met the inclusion criteria, was obtained from the People's Hospital of Jiangsu Province.

Feature selection and nomogram construction
We applied the LASSO regression algorithm to each feature for feature selection in the modeling queue.The biased binomial's partial likelihood deviance reached its minimum, and the most suitable adjustment parameter λ for LASSO regression was 0.028.Figure 2A displayed the coefficient path generated by the logarithmic λ series values.The LASSO analysis retained 11 variables with non-zero coefficients (Fig. 2B): Glutamyltransferase, triglyceride, total cholesterol, myoglobin, lactic acid, carboxyhemoglobin, respiratory failure type, diabetes, cardiovascular disease, asthma, and pneumonia.Multivariate Cox proportional hazards analysis was performed using eleven variables.Ultimately, three independent risk factors were retained: Myoglobin, Diabetes, and Pneumonia (Table 2).The RSF algorithm was utilized to establish a model for the variables with a coefficient of 0 following LASSO regression and compute the prediction score.Then, the three independent risk factors and prognostic scores were integrated into a multivariate Cox regression model to construct a nomogram based on nomogram, showing the probability of recurrence (Fig. 3).To make the prediction model easy to use, we also developed a web-based format(https://respiratory.shinyapps.io/DynNomapp/).

Evaluation and validation of nomogram
A model used to estimate the probability of readmission at 3, 6, and 10 months demonstrated good predictive ability.The C-index was 0.927 (95% CI: 0.910-0.944) in the development cohort, 0.924 (95% CI: 0.901-0.948) in the internal validation cohort, and 0.922 (95% CI: 0.898-0.946) in the external validation cohort.

Clinical application of nomogram
As shown in Fig. 6, the DCA algorithm demonstrated promising clinical value in predicting the probability of readmission at 3, 6, and 10 months.This was evident in the modeling queue, internal validation queue, and external validation queue, where the algorithm outperformed the COX regression model established using three independent risk factors.Specifically, the DCA algorithm utilized a calibration curve to achieve superior performance.The time-dependent AUC revealed consistently higher AUC values for the training set, internal validation set, and external validation set at different time points, indicating a robust and stable discriminative ability of the prediction model across various time intervals (Fig. 7).
We used nomogram to calculate the risk value for each patient and then utilized the ROC curve to determine the optimal threshold.Based on this, patients were classified into high-risk (total score ≥ 38.33) and low-risk (total score < 38.33) groups for predicting readmission.The K-M curve illustrated that high-risk patients had a significantly higher readmission rate than low-risk patients across the training, internal validation, and external validation cohorts (Fig. 8).

Discussion
Our study includes demographic data and clinical information such as body mass index, gender, age, various rating scales, smoking history, and laboratory data (e.g., complete blood count, blood biochemistry, blood gas, etc.).We also consider common and important comorbidities like diabetes, hypertension, cardiovascular disease, and pneumonia.All the features mentioned above are used to predict outcome models.
Improving the predictive form of the RSF model can lead to more accurate individual patient prognoses when establishing a model [18].The Random Survival Forest algorithm was employed to construct a model for the variables that obtained a coefficient of 0 following LASSO regression, and subsequently determine the prediction score.Additionally, we employed independent risk factors and the score to establish a multivariate COX regression, which was something that traditional scoring systems were unable to achieve.In our study, we used the Cox model as a baseline predictive model due to its simplicity, allowing for reproducibility and universality [19].Afterwards, we conducted a random forest survival analysis, in which all predictive factors were included in a single model, using variable importance measures to assess the contribution of each variable to predicting survival [20].Through the predictive model of readmission risk at 3, 6, and 10 months for patients with respiratory failure, we found that the joint predictive model showed better calibration and discrimination than the single Cox regression model.This model can provide a basis for clinical decision-making.The predictive performance of the model was further validated by employing Time-dependent AUC curves.Our analysis provided insights into predictors of readmission for respiratory failure.We found that patients with acute respiratory failure often returned to the hospital, with pneumonia, diabetes, and myoglobin being identified as the most significant risk factors.
It has been confirmed that pneumonia is the most common cause of readmission for respiratory failure within a year, especially in subgroup analysis of patients undergoing invasive Home mechanical ventilation (HMV) [21].
According to the American Association of Respiratory Care practice guidelines, readmission is usually caused by the worsening of underlying diseases, respiratory tract infections, airway-related side effects, and ventilation failure [22].In a previous study involving children, 40-70% of discharged patients experienced unplanned readmissions within a short period of approximately 1-3 months after starting HMV, mainly due to pneumonia and respiratory issues [23].Pneumonia is the main cause of readmission for chronic obstructive pulmonary disease during a one-year period [24].It can be seen that these reasons for readmission are preventable.
Additionally, patients with acute respiratory failure frequently have elevated blood sugar levels [25].Diabetes is the most significant risk factor for respiratory failure [26].Currently, there is no research on the relationship between diabetes and readmission of respiratory failure patients [27,28].However, diabetes is a risk factor for readmission of patients with chronic obstructive pulmonary disease [29].This is mainly because patients with respiratory failure and diabetes are more prone to pulmonary infections, airway mucosal congestion, ciliary dysfunction, and airflow restriction, which can worsen respiratory failure and complicate treatment, necessitating timely intervention with ventilation measures [30].
Similarly, the association between myoglobin and readmission in patients with respiratory failure has not been studied.Myoglobin is an iron-and oxygen-binding protein that is involved in the regulation of the mitochondrial respiratory chain complex IV [31].Myoglobin plays an important role in oxygen storage in skeletal and cardiac muscles, especially in situations of hypoxemia [32].Long-term hypoxia can cause non-specific damage to multiple organs.This can lead to increased myoglobin expression, which is related to the degree of hypoxia [33].Elevated levels of myoglobin in critically ill patients with severe infections, burns, shock, and multiple traumas can predict survival rates and patient prognosis [34].Studies have shown that myoglobin is a predictive indicator of mortality and risk of deterioration in coronavirus disease 2019 (COVID-19) patients with respiratory failure [35].Moreover, it has been discovered that elevated levels of myoglobin may be due to other comorbidities, such as chronic obstructive pulmonary disease, cardiovascular disease, and so on [36].Myoglobin can rapidly be released into the bloodstream as a response to inflammatory stimuli [37].It is negatively correlated with the percentage of predicted forced expiratory volume in 1 s (FEV1) in chronic obstructive pulmonary disease (COPD) patients [38].Our results confirm that serum myoglobin levels may be useful in understanding the progression of critical illness, particularly in predicting readmission due to respiratory failure.
Our research has several advantages.Firstly, we provide a simple and feasible tool for identifying patients who are at risk of readmission.By doing so, interventions can be targeted towards reducing readmissions.Secondly, patients with respiratory failure are at high risk for readmission and should be studied as a group to benefit from personalized plans that reduce readmissions.
There are some limitations to this study.First, there may be potential selection bias, such as sample selection bias and patient inclusion bias.Second, larger-scale studies are needed to confirm the predictive ability of the respiratory failure prediction model.
In summary, this study employed a random survival forest algorithm to merge three independent risk factors for respiratory failure.The resulting model provided a simple and user-friendly tool for predicting the probability of readmission within 365 days for patients with respiratory failure.Internal and external validation further demonstrated the broad applicability and reliability of this model in the classification and management of respiratory failure patients, thereby assisting in timely clinical decision-making.

Fig. 1
Fig. 1 Flow chart of this study

Fig. 2
Fig. 2 LASSO regression model was used to select feature variables.(A) LASSO coefficient curves for the 11 features.(B) The adjustment parameter (lambda) in the Lasso regression was selected using 10-fold cross-validation

Fig. 6 Fig. 5
Fig. 6 Analysis of decision curve in nomogram.(A) the decision curve analysis of nomogram in the modeling set.(B) the decision curve analysis of nomogram in the internal validation set.(C) the decision curve analysis of nomogram in the external validation set.Solid red lines represent the columns

Fig. 4
Fig. 4 Detection of receiver operating characteristic (ROC) curve.(A) the ROC curve of Modeling set.(B) the ROC curve of internal validation set.(C) the ROC curve of external validation set.The red, yellow, and blue AUC curves show the discrimination of the model at 3, 6, and 10 months.The corresponding 95% confidence interval estimates are highlighted in black text

Fig. 8 Fig. 7
Fig. 8 Individual risk scores obtained from the established nomogram.In the modeling set (A), internal validation set (B) and external validation set (C), individual risk scores were obtained according to the established nomogram, and patients were divided into high-risk group and low-risk group according to the critical value to show the best difference in readmission analysis between risk groups

Table 1
Demographic and clinical characteristics of patients

Table 2
Parameters used to develop a predictive model for respiratory failure readmission Fig. 3 A nomogram predicting readmission risk for respiratory failure