Does the 2013 GOLD classification improve the ability to predict lung function decline, exacerbations and mortality: a post-hoc analysis of the 4-year UPLIFT trial

Background The 2013 GOLD classification system for COPD distinguishes four stages: A (low symptoms, low exacerbation risk), B (high symptoms, low risk), C (low symptoms, high risk) and D (high symptoms, high risk). Assessment of risk is based on exacerbation history and airflow obstruction, whatever results in a higher risk grouping. The previous system was solely based on airflow obstruction. Earlier studies compared the predictive performance of new and old classification systems with regards to mortality and exacerbations. The objective of this study was to compare the ability of both classifications to predict the number of future (total and severe) exacerbations and mortality in a different patient population, and to add an outcome measure to the comparison: lung function decline. Methods Patient-level data from the UPLIFT trial were used to analyze 4-year survival in a Weibull model, with GOLD stages at baseline as covariates. A generalized linear model was used to compare the numbers of exacerbations (total and severe) per stage. Analyses were repeated with stages C and D divided into substages depending on lung function and exacerbation history. Lung function decline was analysed in a repeated measures model. Results Mortality increased from A to D, but there was no difference between B and C. For the previous GOLD stages 2–4, survival curves were clearly separated. Yearly exacerbation rates were: 0.53, 0.72 and 0.80 for stages 2–4; and 0.35, 0.45, 0.58 and 0.74 for A-D. Annual rates of lung function decline were: 47, 38 and 26 ml for stages 2–4 and 44, 48, 38 and 39 for stages A-D. With regards to model fit, the new system performed worse at predicting mortality and lung function decline, and better at predicting exacerbations. Distinguishing between the sub-stages of high-risk led to substantial improvements. Conclusions The new classification system is a modest step towards a phenotype approach. It is probably an improvement for the prediction of exacerbations, but a deterioration for predicting mortality and lung function decline. Trial registration ClinicalTrials.gov NCT00144339 (September 2, 2005). Electronic supplementary material The online version of this article (doi:10.1186/1471-2466-14-163) contains supplementary material, which is available to authorized users.


Background
The Global initiative for chronic Obstructive Lung Disease (GOLD) classification for severity of chronic obstructive pulmonary disease (COPD) is used to classify individual patients, describe study populations, monitor disease progression, and guide individual treatment decisions.
Consensus has grown that the previous GOLD classification, which was entirely based on forced expiratory volume in 1 second as percentage of the predicted value for someone of the same gender, age and height (FEV 1 %-predicted), was an insufficiently reliable predictor of the variety of manifestations of the disease [1][2][3][4]. For example, frequent exacerbators are also found among patients with relatively mild forms of airway obstruction [5]. This is important, since exacerbations do not only predict future exacerbations but are also a risk factor of faster disease progression and mortality [6]. There have been pleas for a more explicit recognition of the variety of COPD phenotypes which should improve understanding of the impact of the disease and, more importantly, provide prognostic information and guide the selection of more appropriate therapies [7].
In 2011, GOLD presented a new classification system, which was adapted slightly in 2013 [8]. This new classification distinguishes four groups of patients, based on symptoms and exacerbation risk. The assessment of the latter can be based on either exacerbation history or degree of airflow limitation, whatever results in a higher risk. Symptoms are to be assessed using either the modified British Medical Research Council questionnaire (mMRC) [9], which measures breathlessness, or the COPD Assessment Test (CAT) [10], which provides a more comprehensive assessment of the symptomatic impact of COPD.
Recently, several studies investigated the prognostic value of the old and the new (2011/2013) systems with regards to a number of outcome measures. Mortality was predicted equally well by both systems in studies by Soriano et al., Agustí et al. and Johannessen et al. [11][12][13], whereas Leivseth et al. found that the old classification performed better [14]. Exacerbations and hospitalisations were predicted better by the new system according to Lange et al. and Agustí et al. [12,15], but Johannessen et al. saw no difference in performance between the systems [13]. So far, only one study examined the ability of the new system to predict lung function decline [12]. It did not find differences in predicted lung function decline across severity stages. However, in this study no comparison with the old system was made.
It is important that results from these studies are replicated or contradicted in different populations.
Data from the "Understanding Potential Long-term Impacts on Function with Tiotropium" (UPLIFT) trial [16][17][18] provide the opportunity to investigate the prognostic performance of the new classification system with four years of follow-up. This trial is especially suitable for this purpose, not only because of its duration, but also because of its size (almost 6,000 patients randomized), international origin, and high quality-controlled lung function data.
The aim of this study, therefore, was to compare the ability of the old and the new (i.e. 2013) COPD classification to predict future decline in lung function, mortality, the total number of exacerbations and the number of severe exacerbations.

Data
The UPLIFT trial was a multinational, randomized, double-blind, placebo-controlled trial, investigating the effect of tiotropium on the yearly rate of decline in FEV 1 in ≥40 years old, currently or formerly smoking patients (≥10 pack-years) with moderate to very severe COPD according to the old GOLD classification system (stages 2 to 4, post-bronchodilator FEV 1 of 70% or less of the predicted value) [16,17]. Key exclusion criteria were a history of asthma, a COPD exacerbation or respiratory infection within 4 weeks before screening, a pulmonary resection, use of supplemental oxygen for more than 12 hours per day, and coexisting illnesses that could preclude participation in the study or interfere with the study results.
Patients received either 18 μg of tiotropium or a matching placebo once daily. All respiratory medications, except other inhaled anticholinergic drugs, were permitted during the trial. Smoking cessation programs were offered to all patients before randomization.
Patients were recruited from 2003 to 2004 at 487 centres in 37 countries. The study protocol was approved by the ethics committee at each centre, and all patients provided written informed consent [17]. The follow-up period was four years, in which lung function, exacerbations, St. Georges Respiratory Questionnaire (SGRQ) [19] and mortality were recorded. Exacerbations were defined as an increase in or the new onset of more than one respiratory symptom (cough, sputum, sputum purulence, wheezing, or dyspnoea) lasting three days or more and requiring treatment with an antibiotic or a systemic corticosteroid, and/or a hospitalisation. Patients were assessed at randomisation, after one month, six months, and every six months thereafter. For the base case analyses the data from the two treatment groups (tiotropium and control) were combined.
Data from 5630 patients were used in the analysis.

Mortality
Time to death was analysed in Weibull regression models, with either the old or new GOLD classification as covariates as well as other prognostic factors. These were selected in an iterative backward selection process, in which the covariate with the highest p-value was excluded until all p-values were below 0.20. Candidates for inclusion in the model were age, gender, body mass index (BMI), smoking status and the presence or absence of several co-morbidities (coronary heart disease, arrhythmia, vascular disease, nervous disease, diabetes, depression and anaemia). The regression results were used to construct average adjusted survival curves, following the procedure proposed by Hernàn [20]. First, the model coefficients were used to fit multiple individual survival curves for each patient. Each curve assumed a different GOLD stage, irrespective of the actual classification of the patient. The other baseline characteristics were kept constant within patients. After this, mean survival probabilities per 6-month interval were calculated over all patients for each stage and each point in time. These probabilities were then used for constructing survival curves per stage. This was done to assure that differences in the curves would be due to different severity stage assignments only, and not to other differences (e.g. demographic differences) between the groups.
The models' performance was compared by visually inspecting the ranges over which 4-year mortality differed across stages, by using the Akaike Information Criterion (AIC) for model fit [21] and by Harrell's c-statistic for the measure of discrimination across stages [22,23]. A c-statistic of 0.5 means that a model has no predictive discrimination, in other words, that it has a 50% chance of correctly predicting which of two subjects in different risk categories has the highest probability of experiencing the event. There is no universally used interpretation of the value of the c-statistic. In the context of logistic regression, Hosmer et al. consider values of 0.7 to 0.8 to indicate acceptable discrimination, while discrimination is considered excellent between 0.8 and 0.9 and outstanding when the c-statistic ≥0.9 [24].
The AIC is a measure to compare the goodness-of-fit of different statistical models. Its absolute value has no interpretation. A difference in AIC of ≥4 is often considered an indication that the model with the higher AIC fits the data less well [25].

Exacerbations
Negative binomial regression with adjustment for treatment exposure was applied to analyse the total rate of exacerbations. The regression model contained either the old or the new GOLD stages, as well as other prognostic factors if necessary. In an iterative backward selection process, the covariate with the highest p-value was excluded from the model unless this led to a 10% change in the estimate of the annual exacerbation rate [26].
The regression results were used to estimate mean rates per GOLD stage. For each patient, the number of exacerbations per year was predicted for each stage, given the patient's characteristics but irrespective of the actual classification of the patient, and assuming 365.25 days per year. The individual predictions per disease stage were then averaged over all patients.
The performance of the new model was compared with that of the old model by visually inspecting the ranges over which rates differed across stages and by using the AIC for model fit. This was repeated for severe exacerbations, which were defined as COPD exacerbations requiring a hospital admission.

Lung function decline
Lung function decline, expressed as the deteriorating course of post-bronchodilator FEV 1 , was analysed in a linear random effects model. This analysis started at day 30 in order to take into account the fact that many patients experienced an initial post-randomisation improvement in lung function. Covariates were days since randomisation and interactions of GOLD stage and days. These interactions were used to describe decline for each stage. The intercepts and the slope for time since randomisation were assumed to be random with an unstructured covariance matrix and the interactions were modelled as fixed effects. Patients with at least three measurements from day 30 were included. The regression results were used to estimate mean annual lung function decline per GOLD stage. The annual rate of decline per disease stage was determined by multiplying the regression coefficient for this stage by 365. 25. The selection of covariates took place along the same lines as for exacerbations. The models' performance was compared by visually inspecting the ranges over which rates differed across severities and by using the AIC for model fit.

Classification
Patients were classified into GOLD stage 2 to 4, based on post-bronchodilator FEV 1 % predicted (50-70%, 30-50%, <30%) and into GOLD stage A to D, based on the 2013 GOLD classification [8]. Patients were considered a high risk for an exacerbation if they had a FEV 1 % predicted <50%, or experienced at least two exacerbations in the previous year, or had been admitted to the hospital with an exacerbation at least once during the previous year. The number of exacerbations in the year before randomization was defined as the number of courses of oral corticosteroids or antibiotics or the number of hospitalisations, whichever was the highest.
Since the dataset did not contain CAT or mMRC scores, on which the symptom dimension of the classification is supposed to be based, the Saint Georges Respiratory Questionnaire score (SGRQ) was used instead. The SGRQ measures perceived well-being in COPD patients and the impact of the disease on their activities. Patients with an SGRQ score ≥25 were placed in the 'high level' symptoms category. This threshold value was found by Han et al. to have the strongest correspondence with the CAT threshold ≥10 [27].

Substages
All analyses with the new GOLD classification were repeated with substages of C and D. Patients were assigned to substages based on the reason for being considered high-risk: FEV 1 % predicted <50% but no history of frequent exacerbations (C1 and D1), history of frequent exacerbations but FEV 1 % predicted ≥50% (C2 en D2), or FEV 1 % predicted <50% combined with a history of frequent exacerbations (C3 en D3).

Sensitivity analyses
All analyses were repeated with a different threshold for symptom severity: SGRQ ≥39. This value was found by Han et al. to have the strongest correspondence with the mMRC threshold of 2 [27].
Furthermore, the analyses with the SGRQ ≥25 threshold were repeated in the control group separately. Table 1 describes the distribution of the patients across old and new GOLD stages. Patients from stage 2 were classified into all four new stages, with the majority in B. Almost all stage 3 and stage 4 patients were classified into stage D.

Patient characteristics
The baseline characteristics of the patients, divided by the new GOLD stages are presented in Table 2. The largest group is formed by patients in stage D. GOLD B contained the highest proportion of current smokers. The time since diagnosis was the longest for D. Airway obstruction was similar for A and B, and for C and D. The number of different types of respiratory medications and the number of courses of antibiotics and oral steroids in the year before randomisation increased from A through D. Patients in A and B had not been admitted to the hospital in the year before randomisation.

Mortality
The covariates in the final model were age, sex, BMI, smoking status and the presence of the comorbidities coronary heart disease, vascular disease, diabetes and depression. The adjusted survival curves for each GOLD stage, adjusted for age, sex and selected co-morbidities are presented in Figures 1 and 2, for the old and the new classification, respectively.
Mortality increases with stage in both systems. However, all differences between stages were statistically significant at the 1%-level for the old classification, while in the new system only the difference between C and D was significant. The curves for B and C almost overlap (p = 0.67). Their distances to stage A are borderline significant (p = 0.08). After 4 years, 7.4% of patients in GOLD A had died, compared to 18.8% in GOLD D. These proportions were further apart for the old stages 2 and 4: 10.7% and 33.5%, respectively. Table 3 shows that Harrell's c-statistics for discriminative performance were similar for all three classification systems. All models had a discrimination that falls slightly short of being acceptable, in the interpretation of Hosmer et al. [24]. The best model fit, as measured by the AIC, was achieved for the old model. This was true in over 99% of the bootstrap replications. Figure 3 shows that there were important differences in predicted mortality across substages of D. Patients in substages with a strongly diminished lung function (D1 and D3) were more likely to die than those in D2 (p < 0.01). Being at high risk for exacerbations added relatively    little to the mortality risk. In the substages of C, no difference in mortality was found.

Exacerbations
The final regression models for the total number of exacerbations and for the number of severe exacerbations contained GOLD classification as the sole covariate. Table 4 shows the annual exacerbation rates for patients classified into the GOLD groups. The exacerbation rate increased with disease severity in both the old and the new system. The rates in the new classification covered a broader range than the rates in the old stages and the new classification system had a much better AIC than the old system.
The exacerbation rates varied widely between the substages of C and D. While the exacerbation rate in C1 (no history of frequent exacerbations) was similar to the rate in B, patients in C3 (low lung function and history of frequent exacerbations) experienced more exacerbations than patients in D overall. Symptoms, lung function and exacerbation history were all related to the exacerbation rates.
The patterns are less clear for severe exacerbations ( Table 5). The old stages showed a broader range of rates than the new stages. Substages C3 and D3 had the highest number of severe exacerbations, although C3 did not differ from D overall.

Lung function decline
The final regression models contained disease severity as the sole covariate. Overall, stages with relatively good lung function at baseline showed a faster decline over the course of the trial (see Table 6). The predicted annual rates of decline covered a broader range for the model with the old stages 2, 3 and 4 than for the model with stages A, B, C and D. Furthermore, the model with the old classification had the best fit in terms of the AIC. The models with the new GOLD classification with and without the substages had a similar fit. The substage of patients who started the trial with a relatively good lung function, C2 and D2, experienced a decline that was comparable to stages A and B. The other substages had stronger declines.

Sensitivity analyses
The results of the sensitivity analyses are presented in the Additional file 1. The same patterns in relative predictive power can be seen as in the base case analyses. Using the SGRQ ≥ 39 threshold, however, led to improvement of all AICs for the new classification system.
Similarly as to what was found in the base case analysis, all three mortality models had very similar c-statistics (Additional file 1: Table S1). In contrast with the primary analysis, the best AIC was achieved by the new mortality model with substages. The old and new classification models had the same fit.
The predicted exacerbation rates in the new classification covered a broader range than the rates in the old classification (Additional file 1: Tables S2 and S3). The new classification system also had a much better AIC than the old system, and the AIC for the new classification with substages was even better. Predicted exacerbation rates were slightly higher when the SGRQ ≥ 39 threshold was used.
With regard to lung function (Additional file 1: Table  S4), annual decline rates covered a broader range for the old model, which also had the best AIC. Rates of lung function decline were not different for different SGRQ thresholds.
When the analyses with the original threshold were repeated on the control group separately, similar patterns were found (see Additional file 1: Table S5). For severe exacerbations, the best fit was achieved by models with the new system with the new classification system with substages. For mortality and lung function the best fit was achieved by models with the old classification system.

Discussion
This study compared the prognostic performance of the old and new GOLD classifications for COPD regarding mortality, exacerbations and decline in lung function. The findings depend on the outcome measure.
As for mortality, both classification systems discriminated equally well, but the old model performed better in terms of model fit. The loss of information on lung function,  which was grouped into fewer categories in the new system, does not appear to have been completely mitigated by the added information on symptoms and exacerbation history in the new system. With regard to (severe) exacerbations, all three dimensions of the new GOLD classification strongly contributed to the predictions. This led to a much better performance for the new classification system than for the old system.
With regard to lung function decline, however, the predictive power of the old system was much better. Information on symptom level and exacerbation history did not improve the ability to predict decline of FEV 1 .
Our study is the first to compare the old and new system's ability to predict lung function decline. Agusti et al. did assess the decline across the new stages [12] but did not compare the two classification systems. Furthermore, they did not find significant differences in decline, whereas patients with a worse lung function in our data showed a slower decline. This pattern was less clear in the new system than in old system, but still clear and statistically significant. Combining patients with a low lung function and history of frequent exacerbations into the same stages hides the major differences between these patients. This was also observed in earlier studies [12,15]. Dividing the stages into substages, depending on the reason for which patients are considered high-risk, is very informative and could improve recommendations in individual treatment decisions and in the preparation of treatment guidelines.
The aim of the new guidelines is to enhance the understanding of the impact of COPD on individual patients by combining 'the symptomatic assessment with the patient's spirometric classification and/or risk of exacerbations' [8]. Although lung function in itself does not have a direct impact on patientsit only does so through symptoms, exacerbation risk and mortality riskit still is an important aspect of disease severity, and hence of the new classification system, because it is a better predictor of mortality than symptoms and exacerbations.
Using trial data for a study like this has advantages and disadvantages. Among the advantages is the high quality of the spirometry data because of there was a good quality control system in place. A disadvantage is that a trial population shows less variation in patient and disease characteristics than a real-life population because of the in-and exclusion criteria. Furthermore, the exacerbation rate in the UPLIFT trial was relatively low. Despite this we found that the new classification system was clearly better in predicting exacerbations than the old classification system.
For all analyses we combined the data from the two treatment groups in the UPLIFT trial. We performed additional analyses with treatment as a covariate. This did not lead to different conclusions.
A limitation of this study is that our data contained no information on the mMRC or CAT scores, which are the recommended ways of establishing symptom severity in  COPD patients. However, SGRQ and CAT are highly correlated [31]. According to the authors of the new GOLD guidelines, 'the crucial aspect is to consider whether the patient has only trivial symptoms or feels significantly limited by them' [32]. Several scales can be used for that purpose. In fact, the authors note that updates of the guidelines may include other scales. Nevertheless, different scales may lead to different categorisations. The currently proposed cut-off points of the CAT and mMRC do not lead to exactly the same classification of patients [27,33]. More specifically, patients were 25% less likely to be classified as C instead of D when the mMRC criterion was applied [27]. The current CAT cut-off point of 10 or more appears to be more in line with a mMRC score of 1 instead of 2 [27,34].
Earlier studies based the categorisation on the mMRC ≥2 [11][12][13][14][15]. Overall, their findings are in line with ours, using SGRQ ≥25 as a surrogate for CAT ≥10. Furthermore, we found similar results when we used a higher SGRQ threshold as a surrogate for mMRC ≥2 in the comparison of the old and new classification. This is consistent with the guideline statement that does not attach particular importance to the choice for a specific symptom scale. Nevertheless, the model fit was better when the higher SGRQ threshold was used.
In summary, in the UPLIFT population of moderate to very severe COPD patients, the 2013 GOLD classification performed better than the old classification when predicting future exacerbations, whereas the old classification system performed equally well or better when predicting mortality and lung function decline.

Conclusion
Combining our results in the UPLIFT data with those from earlier studies in different patient populations leads to the conclusion that the new classification system is a modest step towards a phenotype approach. The new system is probably an improvement for the prediction of exacerbations, but a step back with regards to predicting mortality and lung function decline.
Abbreviations AIC: Akaike information criterion; BMI: Body mass index; CAT: COPD assessment test (CAT); COPD: Chronic obstructive pulmonary disease; GOLD: Global initiative for chronic obstructive lung disease; mMRC: Modified British medical research council questionnaire; SGRQ: St. Georges respiratory questionnaire (SGRQ); UPLIFT: Understanding potential long-term impacts on function with tiotropium.

Competing interests
This study was funded by Boehringer Ingelheim GmbH.

Authors' contributions
LG: design, data analysis and interpretation, manuscript writing. IL, NM: design, interpretation, manuscript revision. KB: interpretation, manuscript revision. MPMHR: conception and design, interpretation, manuscript writing. All: final approval of the manuscript.