Assessment of measurement properties of peak VO2 in children with pulmonary arterial hypertension

Background The 6-minute walk test evaluates the effect of pharmacologic intervention in adults with pulmonary arterial hypertension (PAH) but, for reasons of compliance or reliability, may not be appropriate for children at all ages. Thus, peak oxygen consumption (VO2, maximal exercise test) was used instead in a pediatric PAH trial (STARTS-1) to evaluate pharmacologic intervention with sildenafil. This was the first large placebo-controlled trial to use the peak VO2 endpoint in this population. Our working hypothesis was that, as with other populations, percentage changes in peak VO2 in pediatric patients with PAH are reliable and are associated with changes in other clinical endpoints. Methods Using data from the subpopulation of 106 patients who were developmentally and physically able to perform exercise testing, all of whom were World Health Organization Functional Class (WHO FC) I, II, or III, reliability was assessed using the intraclass correlation coefficient and Bland-Altman plot on screening and baseline data. Relationships between percentage change in peak VO2 from baseline to end of treatment and other endpoints were evaluated using correlation coefficients and regression analyses. Results The intraclass correlation was 0.79 between screening and baseline peak VO2, an agreement that was supported by the Bland-Altman plot. Percentage change in peak VO2 correlated well (r ≥0.40) and showed responsiveness to a physician global assessment of change and with change in WHO FC (for baseline classes I and III). Percentage change in peak VO2 did not correlate with change in the Family Cohesion of the Child Health Questionnaire (r = 0.04) or with a subject global assessment of change (r = 0.12). The latter may have been influenced by child and parental-proxy response and instrument administration. Conclusion In pediatric PAH patients who are developmentally and physically able to perform exercise testing, peak VO2 measurements exhibited good reliability and improvements that were associated with improvements in certain other clinical endpoints, such as the WHO FC and a physician global assessment. Trial registration ClinicalTrials.gov identifier NCT00159913.

Results: The intraclass correlation was 0.79 between screening and baseline peak VO 2 , an agreement that was supported by the Bland-Altman plot. Percentage change in peak VO 2 correlated well (r ≥0.40) and showed responsiveness to a physician global assessment of change and with change in WHO FC (for baseline classes I and III). Percentage change in peak VO 2 did not correlate with change in the Family Cohesion of the Child Health Questionnaire (r = 0.04) or with a subject global assessment of change (r = 0.12). The latter may have been influenced by child and parental-proxy response and instrument administration.

Conclusion:
In pediatric PAH patients who are developmentally and physically able to perform exercise testing, peak VO 2 measurements exhibited good reliability and improvements that were associated with improvements in certain other clinical endpoints, such as the WHO FC and a physician global assessment.

Background
Pulmonary arterial hypertension (PAH) is a relatively rare condition associated with high mortality [1]. It is characterized by increased pulmonary vascular resistance and pulmonary arterial pressure leading to right ventricular failure and ultimately death [2]. It may be inherited (heritable PAH [HPAH], classified as familial or sporadic), develop spontaneously (idiopathic PAH [IPAH]), or occur in association with congenital heart defects, connective tissue disease, or other causes (associated PAH [APAH]) [3]. Oral sildenafil citrate (REVATIO W , Pfizer Inc, New York, NY) has been found to be efficacious and generally well tolerated in the treatment of chronic PAH in adults, both as disease-specific monotherapy and as add-on to intravenous therapy with epoprostenol [4,5]. However, safe and effective therapy to increase the functional capacity, quality of life, and survival of pediatric patients with PAH is also needed.
A widely used, noninvasive technique to assess PAH severity and response to treatment is the 6-minute walk test, which is based on improvements in submaximal exercise capacity [6,7]. However, when the first large, multicenter, randomized, double-blind, placebo-controlled clinical trial to investigate the effectiveness of sildenafil treatment for PAH in children who require treatment despite conventional therapy was being designed (ClinicalTrials.gov: NCT00159913) [8], many specialists believed that compliance with the directions for the 6-minute walk test could be difficult for children. Children may become uninterested or demotivated by factors unrelated to PAH, which could impact reliability of the test. Additionally, they may walk at a variable pace, resulting in unreliable or unstable measurements. Thus, for the design of the clinical trial, it was decided to use formal cardiopulmonary exercise testing that could be more readily standardized.
The ability to perform aerobic work is defined by peak oxygen consumption (VO 2 ) at maximal effort [9]. Peak VO 2 is a parameter of noninvasive cardiopulmonary exercise testing that is affected by age, sex, conditioning status, disease, or medications. Its prognostic value in terms of survival has been demonstrated in adult patients with IPAH [10]. Thus, percentage change from baseline to end of treatment in peak VO 2 was selected as the primary efficacy endpoint in the controlled clinical trial of sildenafil treatment for PAH in children, making it the first trial of its kind with the potential to evaluate the correlation between changes in peak VO 2 and other clinical endpoints [8].
The aim of this paper is to investigate the measurement properties of peak VO 2 in terms of its associations with other clinical endpoints and its reliability. It was hypothesized that, as observed with other populations, percentage changes in peak VO 2 in pediatric patients with PAH are reliable and are associated with changes in certain clinical endpoints.

Data Set
The data set was derived from the Sildenafil in Treatment-naive children, Aged 1-17 years, with pulmonary arterial hypertension (STARTS-1) trial, a multinational trial of sildenafil citrate with a 16-week, double-blind, placebo-controlled treatment phase [8]. Pediatric patients (aged 1-17 years) weighing ≥8 kg were included if they had IPAH, HPAH, or APAH associated with congenital heart defects or connective tissue disease. PAH (defined as mean pulmonary artery pressure ≥25 mmHg at rest, pulmonary capillary wedge pressure ≤15 mmHg [or mean left atrial pressure ≤15 mmHg or left ventricular end-diastolic pressure ≤15 mmHg], and as pulmonary vascular resistance index ≥3 Wood units × m 2 ) was confirmed by right heart catheterization at baseline. Concurrent medication remained stable throughout the trial except for changes made for safety reasons. Nitrates, cytochrome P450 3A4 inhibitors, prostacyclin analogues, endothelin receptor antagonists, phosphodiesterase type 5 inhibitors (other than study medication), and arginine supplements were not allowed.
The trial was conducted in compliance with the ethical principles of the Declaration of Helsinki. The final protocol, any amendments, and informed consent documentation were reviewed and approved by the Institutional Review Boards and/or Independent Ethics Committees at each of the investigational centers participating in the study. a Written informed consent was obtained from each child's legal guardian and assent from each child when applicable.
Patients were stratified by developmental ability to perform cardiopulmonary exercise testing (bicycle ergometer) and by weight. Dosage of sildenafil was dependent on weight and doses were selected to achieve maximum plasma concentrations of 47 (low dose), 140 (medium dose), and 373 (high dose) ng/mL at steady state [8]. The 8-kg to 20-kg group was randomized 1:2:1 to sildenafil medium (10 mg) and high (20 mg) doses and placebo, respectively. The >20-kg to 45-kg group was randomized 1:1:1:1 to sildenafil low (10 mg), medium (20 mg), and high (40 mg) doses and placebo, respectively. The >45-kg group was randomized 1:1:1:1 to sildenafil low (10 mg), medium (40 mg), and high (80 mg) doses and placebo, respectively. Study medication was administered 3 times daily, ≥6 hours apart for 16 weeks. All patients randomized to sildenafil received 10 mg 3 times daily for 1 week followed by titration to assigned dose. A total of 234 patients were randomized and treated, of whom 115 were developmentally and physically able to perform exercise testing.
The primary efficacy endpoint in the STARTS-1 trial was percentage change in peak VO 2 (normalized to body weight), measured in mL/kg/min, from baseline to week 16 or end of treatment (at trough plasma concentrations [before dosing or ≥4 h postdose]). Peak VO 2 was assessed by cardiopulmonary exercise testing in those who were developmentally able to participate and achieved functional capacity limits for peak VO 2 of ≥10 mL/kg/min and ≤28 mL/kg/min at screening [8]. Other endpoints used in the current correlational analyses included the following: a physician global assessment of change (PGA) and a subject/parent global assessment of change (SGA), which are 7-point rating scales ("markedly improved," "moderately improved," "mild improvement," "no change," "slightly worse," "moderately worse," and "markedly worse"); World Health Organization Functional Class (WHO FC, in which FC I represents no limitation of physical activity, FC II represents slight limitation, FC III represents marked limitation, and FC IV represents inability to carry out any physical activity without symptoms) [11]; and the Family Cohesion domain of the parent form of the Child Health Questionnaire [12].

Correlational analyses
The analysis plan was formed prospectively (before conducting any analysis), with all analyses conducted in SAS/STAT W Version 8.2 (SAS Institute, Cary, NC). Analyses were based on peak VO 2 data collected at baseline and at the end of treatment.

Reliability
Reliability refers to the reproducibility of the measurement when repeated at random in the same patient. Patients whose peak VO 2 status has not changed should have a similar, or repeatable, response each time they are assessed. If there is considerable variability, the measurements are unreliable and results will be uninterpretable.
To assess test-retest reliability (stability), we examined the strength of agreement between peak VO 2 pretreatment measurements at screening and baseline (up to 21 d after screening); no post-randomization data were used. We calculated the intraclass correlation (ICC) along with its confidence interval (CI), which estimates the proportion of all variation that is not due to measurement error [13,14]; a value ≥0.7 indicates acceptable reliability [15]. We also calculated the Pearson correlation coefficient, which gauges the magnitude of the linear relationship between the screening and baseline measurements. In addition, we constructed a Bland-Altman plot, which depicts agreement between screening and baseline measurements [16].

Associations with Peak VO 2
Associations were evaluated by calculating Pearson correlation coefficients between the percentage change (baseline to end of treatment) in peak VO 2 and each of following measures: the PGA; the SGA; change (baseline to end of treatment) in WHO FC by baseline FC; and change (baseline to end of treatment) in the Family Cohesion domain [12]. In sensitivity analyses, the corresponding Spearman-rank correlation coefficients were also examined.
For each of the prespecified correlational analyses, three sets of Pearson correlations were calculated: (1) pooled across treatment groups, (2) by treatment group (placebo separate from all sildenafil groups combined), and (3) partial, adjusting for (or partialing out) treatment. Differences in results among them were noted. It was hypothesized that associations would be meaningful (≥0.40, consistent with a meaningful correlation [17]) between percentage change in peak VO 2 and all of the other measures except for change in the Family Cohesion domain. Correlation coefficients less than 0.30 were taken as less than meaningful [18]. Those between 0.30 and 0.40 were taken as ambiguous in their import.

Responsiveness
Responsiveness of measurement, a type of correlational analysis, addresses the ability to detect change when a particular patient improves or deteriorates. We assessed this association by comparing percentage change (baseline to end of treatment) in peak VO 2 with change (baseline to end of treatment) in the WHO FC (categorized by baseline FC), the PGA, and the SGA. A regression analysis was applied to examine each of those relationships, with percentage change in peak VO 2 serving as the outcome or dependent variable and each of the other measures serving as a separate predictor or explanatory variable. In each bivariate analysis, a regression model was fit in two ways: with the predictor taken as a discrete categorical variable and as a continuous variable.

Reliability
The estimated ICC between screening and baseline peak VO 2 was 0.79 (95% CI, 0.71-0.85; P < 0.0001), suggesting good reliability (>0.70). The same estimate of 0.79 was obtained from the Pearson correlation coefficient. Good agreement between the screening peak VO 2 and baseline peak VO 2 was also suggested by the Bland-Altman plot (Figure 1), which indicated no relationship between the difference in values across the 2 visits (which represents measurement error) and the mean of values across the 2 visits (which represents the true value). These data not only suggest that the data are reliable, but also support the use of the average of the two measurements as the baseline for statistical analyses.

Associations
Across all treatment groups combined, Pearson correlations of percentage changes in peak VO 2 from baseline with the PGA and with the change in WHO FC (with baseline FC of I and III) correlated well (correlations of ≥0.40; Table 2). Correlations of percentage changes in peak VO 2 with the PGA depended on treatment group: correlation of 0.49 (95% CI, 0.30-0.64; P < 0.0001; n = 77) for sildenafil doses combined and −0.11 (95% CI, -0.46 to 0.27; P = 0.57; n = 29) for placebo. For patients with WHO FC II or I at baseline, there was little (FC II) or no (FC I) room for improvement in WHO FC. Thus, care needs to be taken with the interpretation of their data. Across all treatment groups combined, Pearson correlations of percentage changes in peak VO 2 with changes in the Family Cohesion domain score and with the SGA were 0.04 and 0.12, respectively ( Table 2) Results from Spearman correlations were very similar to those with Pearson correlations.

Responsiveness
For each categorical improvement on the PGA, the mean percentage change in peak VO 2 increased by 8% (assuming a linear relationship, with the PGA taken as continuous; 95% CI, 4.6%-11.5%; P < 0.0001; Figure 2). For each categorical improvement on the SGA, the mean percentage change in peak VO 2 increased by 2.2% (assuming a linear relationship, with the SGA taken as continuous; 95% CI, -1.3% to 5.8%; P = 0.21; Figure 3). The mean percentage changes in peak VO 2 , which were larger for higher improvement categories, were statistically significant for each of the three improvement categories of the PGA (7.2, 14.1, and 32.1; P < 0.01 for each) and in the two highest improvement categories of the SGA (9.8 and 11.9; P < 0.02 for each).
For the 27 patients with WHO FC I at baseline, there was no room for improvement in FC: 3 of the 4 who deteriorated displayed a reduction in peak VO 2 ( Figure 4A). None of the 56 patients with WHO FC II at baseline deteriorated and only 8 improved; 6 of the 8 had an increase in peak VO 2 . Among all patients with WHO FC II at baseline, mean percentage change in peak VO 2 increased by 4.40% (95% CI, -7.3% to 16.1%; P = 0.45; Figure 4B). None of the 21 patients with WHO FC III at baseline had a deterioration in WHO FC and 14 had an improvement, of whom 12 had an increase in peak VO 2 . Among all patients with WHO FC III at baseline, mean percentage change in peak VO 2 increased by 24.6% (95% CI, 5.1%-44.2%; P = 0.02) for a 1-category improvement in WHO FC ( Figure 4C).

Discussion
In general, the results indicate that the peak VO 2 has favorable measurement properties in pediatric patients with PAH who are developmentally and physically able to perform exercise testing. The magnitude of the correlation of mean percentage change in peak VO 2 with the PGA was dependent on active or placebo treatment. This is to be expected because the placebo group is likely to have a more restricted range of values (which represent measurement variability and random fluctuations over time). In contrast, the active treatment group is likely to have a wider range of values (from the additional variability of individual treatment responses). In a 16-week trial, it is not surprising that only 4 patients (all WHO FC I at baseline) reported deterioration in WHO FC. The importance of this endpoint is in the observance of improvement in WHO FC. However, for the large proportion of patients who were WHO FC I or II at baseline, there was no or limited room for improvement (unlike in WHO FC III patients). Eight of the 56 patients (14%) who were WHO FC II at baseline improved, but 14 of 21 patients (67%) who were WHO FC III at baseline improved. For these patients with WHO FC III at baseline, there was a strong positive association with percentage change in peak VO 2 .
It was unexpected that the percentage change in peak VO 2 would share a low correlation with the SGA, and it may reflect influence by factors associated with child and parental-proxy responses and with instrument administration. A placebo response may have been observed with the SGA, in which patients (regardless of treatment group) are shifted toward a "mild improvement" response whether or not peak VO 2 improves. In contrast, "markedly improved" on the SGA is unlikely to be caused by a placebo response and most such patients had clear improvement in peak VO 2 . This disparity can impair the correlation. The low correlation between the percentage change in peak VO 2 and the SGA becomes less surprising given that a post-hoc correlation between PGA and SGA was not very high (0.39). The PGA correlated well with the change in WHO FC in the subgroup with baseline FC III but the SGA did not. The SGA is a mixture of parent and patient (child) responses, the  meaning of which may be confounded, especially when the patient is young.

Conclusions
This pediatric PAH trial-the largest one to dateoffered the opportunity to evaluate peak VO 2 as an endpoint with regard to its correlation with other clinical endpoints, such as the WHO FC and the PGA. Peak VO 2 exhibited good reliability, and improvements were associated with improvements in certain other clinical endpoints. Additional research should be conducted to further elucidate the relationship between peak VO 2 and the SGA, to inform use of the SGA in this patient population. This initial assessment of the measurement properties of peak VO 2 suggests it is a robust measure with utility as a primary endpoint in clinical trials for the evaluation of the effect of drug treatment in pediatric PAH.