To our knowledge, this is the first study to establish study-wide precision and accuracy for CPET measurements in multicenter trials, based on normalcy of the coupling between mechanical and metabolic power output. We found: 1) the CV of CPET measurements was reduced using the central reader assessment of BioQC exercise tests to minimize within-center variability; 2) a rigid application of a ±10 % predicted V̇O2 cut-off criterion excluded 67 % of all measurements, some of which were within the normal distribution of accurate tests; and 3) the application of a composite z-score to identify measurements lying outside normal limits increased the precision and accuracy of multicenter trial CPET measurements by ~50 %.
Efficacy of CPET QC based on within-center precision
Acceptance rate by the central reader method for all BioQC tests was 76 %. The BioQC process identified measurement errors in 12 of 15 centers before measurements of study patients were made in the parent clinical trial. In six centers, measurement error was resolved by standard troubleshooting approaches, and required only one or two additional BioQC tests to demonstrate resolution. The central reader method excluded outlying tests based on reproducibility, and effectively excluded tests departing from normality. Analysis of all tests showed non-normal distribution, whereas including only tests accepted by the central reader assessment resulted in a data set that was not significantly different from a normal distribution (Table 2). The central reading process reduced within-center CV for gas exchange and ventilation measurements to within the range 5.8 to 9.2 % (Table 1), which is within the range generally accepted for CPET studies [3, 4, 12, 31, 32].
Application of a central BioQC reader, therefore, provided a beneficial addition that reduced measurement variability in the multicenter trial. However, while within-center variability was below the upper limits recommended by ATS/ACCP  (~10 %), there remained variability between different centers. The between-center CV for gas exchange measurements of all centrally accepted BioQC tests ranged from 9.6 to 12.5 % (Table 2). This residual between-center variability effectively lessened the benefit of the BioQC method, and weakened the statistical power to demonstrate a given change in CPET outcome measures in the parent clinical trial .
CPET QC based on study-wide precision and accuracy
To our knowledge, only one previous study attempted a QC regimen based on both precision and accuracy . However, the accuracy criterion used was wide (±25 %) because the approach did not account for differences in mechanical power output during treadmill walking at given speed/grade combinations among individuals differing in weight. Therefore, to develop an approach to CPET QC in multicenter trials that included study-wide precision and accuracy in the coupling of mechanical to metabolic power, we applied two different BioQC methods. The first, a simple criterion approach, rigidly excluded BioQC tests in which accuracy of any one V̇O2 measurement (V̇O2,20W, V̇O2,70W, and V̇O2,slope) lay outside 90 to 110 % of the predicted value at the calculated power output. This approach appears inherently sensible, in that measurements outside this ±10 % range (roughly based on guideline recommendations [4, 31, 32]) are considered as outlying, and thus excluded. However, a limitation is that a relatively small error in only one variable causes a failing BioQC test, even if all other measurements are within the tolerable limit. The very low acceptability rate of tests across the 15 centers (33 %) makes application of this criterion impractical. Indeed, seven study centers would have been completely excluded from the trial, despite reporting demonstrably accurate measurements based on the retrospective analysis of the normal distribution of the BioQC measurements (Fig. 3). Thus, while the rigid criterion dramatically improved the CV of CPET measurements, it excluded data that were within the normal distribution of measurement variability (e.g., Fig. 1, Table 2).
Therefore, we developed a method that allowed small variability outside the 90 to 110 % range for predicted V̇O2, but could successfully identify outlying, non-normal measurements and reduce between-center measurement CV. The composite z-score considered equally, the relative deviation from the mean of “position” (V̇O2,70W) and “slope” (ΔV̇O2/ΔWR) of the highly predictable relationship between V̇O2 and WR. By combining error distribution of these two variables, a small deviation from predicted (i.e., between ~87 % and ~113 %) in one measurement was allowed as long as the other was accurate. We found this method able to strongly predict systematic deviations in non-normal measurements above a composite z-score of 0.67 (based on SD of all tests; Fig. 1). In addition, composite z-score of 0.67 coincided with local minima in CV of absolute V̇O2 measurements (Fig. 2a) and resulted in a study-wide CV of ~6 % (Fig. 2b).
Using z = 0.67 we were able to identify that 52 % of the BioQC tests lay outside the normal distribution of V̇O2 measurements. While this approach excluded more tests than the central reading method, this combined precision- and accuracy-based method achieved three main benefits. Firstly, it had a strong agreement with the rigid criterion-based approach (84 % agreement between methods). Secondly, it had a relatively high acceptance rate (48 %) without compromising narrow measurement CV compared with the criterion method (Table 2). Lastly, it had a low CV of CPET measurements; ~50 % lower CV than that compared with central precision-based QC approaches that form the basis of guideline recommendations. This latter point is of considerable importance for the design and conduct of multicenter clinical trials with CPET measurement outcomes. By applying a z-score-based BioQC method across all centers, we suggest that measurement variability can be reduced by ~50 %, providing an increase in statistical power to detect changes in CPET measurements. Regular BioQC tests are not onerous and, until a larger BioQC data population is established, any CPET laboratory seeking to implement a QC procedure may simply apply the z-score criterion using equation 2, and the study-wide population SD values established in this study (11.0 % and 13.6 % for V̇O2,70W and V̇O2,slope, respectively). While we found the optimal z-score at 0.67, based on distribution normality, the CV of absolute V̇O2 measurements remained close to the minimum up to a z-score of ~0.90, corresponding to ~65 % of all tests and a CV for % predicted below ATS/ACCP guidelines (which occurred at z-score ~1.0) . Further research is required to determine more precisely the optimal z-score within the range of ~0.67 to ~0.90 that balances requirements of normality, minimized CV and the number of accepted tests to inform CPET studies. Nevertheless, while using a combined z-score of 0.67 would minimize CPET measurement variability, any power calculations for future clinical trials should also account for the response variability inherent in the clinical population studied. Thus, the combined z-score approach maximizes statistical discriminatory power within multicenter trials and minimizes laboratory testing burden and study participant risk.
Strategies to minimize measurement variability
We found a relatively high rate of measurement error over 16 months. Importantly, 8 of 15 centers (53 %) required at least one additional validation procedure after an initial failing BioQC test. Each failure triggered a CPET-system troubleshooting process, which included site technicians and central support to identify the error source. Most required involvement of the system manufacturer and eventually led to major equipment service, emphasizing the need for regular maintenance.
In addition, this justifies the recommendation for frequent and rigorously evaluated QC methods in order to prevent large unexplained variability in CPET measurements. The BioQC process used here also identified equipment error prior to equipment failure, and allowed centers to address failing components of CPET systems before trial-related measurements were scheduled. Overall, our results support the view that systematic BioQC is needed to achieve satisfactorily accurate and precise data in multicenter trials employing CPET.
A potential limitation relates to the accurate estimation of treadmill WR. External WR is calculated considering a subject’s weight ; but does not account for the inertia associated with body movements while walking . These inertial components increase with weight, which may reduce the accuracy of calculated WR and predicted V̇O2 value upon which BioQC is based. A similar phenomenon occurs in cycle ergometry, where V̇O2 is influenced by pedaling frequency. However, a change in treadmill speed between 20 W and 70 W (1.0 mph to 1.8 mph) requires an obligatory cadence increase and thus variable internal work; in cycling, cadence can be effectively controlled [34, 35]. One solution would be to recruit BioQC subjects who are similar in weight to potential trial patients. Another solution would be to use calculations for treadmill WR that incorporate kinetic energy (mv2) instead of momentum (mv).
Another limitation is that we used an equation to predict metabolic rate that was originally developed for cycle ergometry. However, within the speed range used in this study, measured and predicted metabolic rates show strong agreement . V̇O2 prediction was based on a ΔV̇O2/ΔWR of 10.1 mL/min/W [26, 27]. A range of studies support this value, e.g., 10.2 ± 1.0 mL/min/W  and 9.9 ± 0.7 mL/min/W , although it is recognized that a greater value may be seen in endurance-trained individuals . While, in this study we found that ΔV̇O2/ΔWR averaged 10.6 mL/min/W, this mean was derived from only 15 individuals who performed the BioQC and was within the normal range. We found that post hoc adjustment of the target ΔV̇O2/ΔWR between 10.1 mL/min/W and 10.6 mL/min/W excluded only one additional BioQC test and had no effect on the optimal z-score range. Nevertheless, equations to better calculate treadmill WR to improve accuracy of the ΔV̇O2/ΔWR target, or using exercise modalities such as cycling where WR can be better controlled, would likely further improve precision and accuracy provided by the composite z-score BioQC method.
The BioQC method relies on the attainment of steady-state metabolic responses below the lactate threshold at 70 W. This may require verification by an additional incremental exercise test for non-invasive lactate threshold estimation, and/or the use of a lower WR for less aerobically fit or smaller individuals.
The QC method assesses instrumental measurement precision and accuracy (as opposed to physiologic variability) from submaximal steady-state CPET measurements, because the variability of predicted values for healthy participants is low within this domain. However, clinical trials typically assess both submaximal and maximal values from CPET measurements. Therefore, the instrumental measurement precision determined in this study may not necessarily reflect the instrumental precision of peak measurements, where breathing frequency is greater, and the response times of the gas analyzers become increasingly important. Nevertheless, quality assurance of the integrated CPET measurement system linking the mechanical and metabolic power output within the submaximal domain should also contribute to improving assurance of multicenter instrumental precision and accuracy of maximal CPET measurements.