Evidence synthesis in pulmonary arterial hypertension: a systematic review and critical appraisal

Background The clinical landscape of pulmonary arterial hypertension (PAH) has evolved in terms of disease definition and classification, trial designs, available therapies and treatment strategies as well as clinical guidelines. This study critically appraises published evidence synthesis studies, i.e. meta-analyses (MA) and network-meta-analyses (NMA), to better understand their quality, validity and discuss the impact of the findings from these studies on current decision-making in PAH. Methods A systematic literature review to identify MA/NMA studies considering approved and available therapies for treatment of PAH was conducted. Embase, Medline and the Cochrane’s Database of Systematic Reviews were searched from database inception to April 22, 2020, supplemented by searches in health technology assessment websites. The International Society for Pharmacoeconomics and Outcomes Research (ISPOR) checklist covering six domains (relevance, credibility, analysis, reporting quality and transparency, interpretation and conflict of interest) was selected for appraisal of the included MA/NMA studies. Results Fifty-two full publications (36 MAs, 15 NMAs, and 1 MA/NMA) in PAH met the inclusion criteria. The majority of studies were of low quality, with none of the studies being scored as ‘strong’ across all checklist domains. Key limitations included the lack of a clearly defined, relevant decision problem, shortcomings in assessing and addressing between-study heterogeneity, and an incomplete or misleading interpretation of results. Conclusions This is the first critical appraisal of published MA/NMA studies in PAH, suggesting low quality and validity of published evidence synthesis studies in this therapeutic area. Besides the need for direct treatment comparisons assessed in long-term randomized controlled trials, future efforts in evidence synthesis in PAH should improve analysis quality and scrutiny in order to meaningfully address challenges arising from an evolving therapeutic landscape.


Background
Pulmonary arterial hypertension (PAH) is a rare and debilitating chronic disease of the pulmonary vasculature [1]. Disease progression is characterized by increasing pulmonary vascular resistance (PVR) and non-specific symptoms (e.g., dyspnoea during exercise, fatigue, chest pain, and light-headedness), that ultimately leads to right heart failure and premature death [1,2]. Prior to the availability of PAH-specific therapies, median survival time was documented as 2.8 years in the US patients with PAH [3]. Five-year survival rate in newly diagnosed patients is reported to be 61.2% [4].
The treatment of PAH is guided by an evidence-based treatment algorithm published by the European Society of Cardiology and European Respiratory Society (ESC/ ERS) [2]. The overall treatment goal is to achieve a lowrisk status, associated with World Health Organization (WHO) Functional Class II, and good exercise capacity (> 440 m in the 6-min walking distance test), and rightventricular function assessed using echocardiography. The latest guidance and proceedings (see Figure S1 in the electronic supplementary material) recommend either monotherapy or initial oral combination therapy for treatment-naïve patients at a low or intermediate risk of clinical worsening or death [2,6]. For these patients, oral therapies are recommended, therefore ERA and PDE-5I are generally used as first-line treatment. For patients who fail to achieve an adequate clinical response (i.e. a low-risk status after 3 to 6 months) with initial therapy, treatment with sequential double or triple combination therapy is recommended. For high-risk treatment-naïve patients, an initial combination therapy regimen including a drug targeting the PGI2 pathway requiring continuous IV administration is indicated.
A lack of head-to-head treatment comparisons in randomized controlled trials (RCTs) has compounded clinical decision-making in PAH. As a result, a multitude of meta-analyses (MA; the synthesis of evidence from the same treatment comparisons assessed in clinical trials [7]) and network meta-analyses (NMA; the synthesis of evidence from both direct and indirect evidence to allow treatment comparisons that have not been directly assessed in clinical trials [7]) in PAH have been conducted.
Given the absence of direct RCT comparisons and the evolution of disease definition, classification, trials designs, available therapies and treatment guidelines, it is important to better understand the quality of published MA and NMA in PAH and their alignment with clinical decision-making today. The objective of the study was to critically appraise the quality and validity of published MA and NMA studies in PAH and explore the impact of the findings from these studies on current decisionmaking.

Search strategy and data collection
A systematic literature review was conducted according to the recommendations of the Cochrane Collaboration [8] and the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [9], to identify published evidence synthesis (i.e. MA and NMA) studies in PAH.
Searches were conducted from the database inception to September 12, 2018 and updated on April 22, 2020 in Embase, Medline (including Medline-In-Process) and the Cochrane's Database of Systematic Reviews via OVID in line with The National Institute for Health and Care Excellence (NICE) technology appraisal guidelines and recommendation from Centre for Review and Dissemination and the Cochrane Collaboration [10][11][12]. Supplementary searches included websites of selected health technology assessment agencies.
Retrieved records were assessed by one reviewer against the pre-specified PICOS criteria (Table S1 in the electronic supplementary material) and unblinded assessments were double checked by the second reviewer. Any discrepancies were resolved through discussion with a third reviewer. Studies were included if they met the following criteria: 1) adult patients with any etiology of PAH (pulmonary hypertension (PH) Group 1) [2], 2) at least two approved and available therapies or drug classes for treatment of PAH (to allow assessment of relative efficacy and safety of compared treatments), 3) full-text MA/NMA report. Details of the search methodology are provided in Tables S2a-h in the electronic supplementary material. Key baseline characteristics of patients with PAH from the included RCTs were extracted to explore the extent of heterogeneity across the trials.

Study appraisal
A targeted review of published checklists for evidence synthesis studies was conducted. Checklists published by NICE [13], ISPOR [14], PRISMA [15] and GRADE [16] were identified. Criteria for checklist selection included: Domains covered, such as relevance of research question, methods for establishing the evidence base, assessment for internal validity, statistical methods, and reporting of results Suitability to present context, including applicability to different forms of evidence synthesis Generalizability Acceptability and recognition of the checklist The ISPOR checklist was deemed the most appropriate as it covers all domains listed in the checklist selection criteria, is suited to the study objective and is applicable to different types of evidence synthesis.
Complementary questions were added to the 26-item ISPOR checklist with questions specific to the disease area and/or study objective. These additional questions are marked as such in the study assessment provided in Table S3 in the electronic supplementary material.
The ISPOR checklist provides for a quality grading whereby an overall assessment of 'strong', 'neutral' or 'weak' is given for each of the six domains (i.e. relevance, credibility, analysis, reporting quality & transparency, interpretation, conflict of interest). However, no explicit criteria are provided for scoring each domain. A set of criteria specific to each domain for quality grading was therefore adopted which is described in Table 1. Study appraisals by one reviewer were double checked by a second reviewer.

Study characteristics
A total of 52 MA and NMA studies met the inclusion criteria and were retained for data extraction and quality appraisal. From electronic database searches, 51 full-text publications were included. From the hand-search of publicly available websites of health technology assessment bodies, one report of the Canadian Agency for Drugs and Technologies in Health was included. The PRISMA diagram in Figure S2a-b (see electronic supplementary material) presents the search results.

Quality appraisal
The quality assessment of the included studies is summarized in Fig. 2 by overall judgement (strength, neutral, weakness) against each domain of the checklist and the number of studies scoring each judgement in each domain in Table 3. The detailed quality assessments are presented in Table S3 in the electronic supplementary material.

Relevance
Of the 52 studies reviewed, eight were scored as strong in terms of relevance, 26 as neutral, and the remaining 18 as weak.
Very few studies fulfilled the checklist item about the extent to which an evidence synthesis study is informative to decision makers today and aligned with the current clinical practice and guidelines. Several papers did not explicitly state the research question or decision problem guiding the analysis [18,21,29,33,42,53,59]. Several other studies failed to justify the focus or their research question [17,18,20,21,25,31,32,40,44,58,[60][61][62][63][64]. For example, some studies formulated research questions with a very narrow scope (e.g. oral treatments [17,20,21,32,60,62,63]) or included trials with non-PAH populations [34,43,44], therefore precluding determination of the optimal choice of therapy based on a comparison of all available treatment options. Some studies included unapproved or withdrawn treatments, while several studies made conclusions at odds with current knowledge, guidelines and clinical practice. For

Relevance
At least three of the six checklist items suggested study shortcomings, for example omission of relevant therapies in the analysis, omission of relevant outcomes for evidence synthesis, or inclusion of patients outside the target population.
1-2 checklist items were not addressed satisfactorily; no or insufficient justification for a particular analysis approach was provided (e.g. inclusion of oral therapies only without justification).
All checklist items were appropriately addressed.

Credibility
Information omitted or insufficient information provided for at least three of the nine checklist items, for examples, omission of key databases in the SLR, omission of a quality assessment of included studies, or lack of identification of imbalances in the distribution of key effect modifiers prior to the analysis.
1-2 checklist items were not addressed satisfactorily, for example, an adequate search strategy but no transparent reporting of the full search strings, or lack of reporting of the results of the quality assessment.
All checklist item were addressed appropriately. The checklist domain 'credibility' includes one question only applicable to NMA studies; this question was not considered for the domain grading of MA studies.

Analysis
At least three of the 10 checklist items suggested study shortcomings, such as lack of subgroup analyses or metaregression in cases of between-study heterogeneity, pooling of drug classes, treatments or doses without proper justification, or lack of a valid rationale for the use of random effects or fixed effect models.
1-2 checklist items were not addressed satisfactorily, such as insufficient detail on the statistical model.
All checklist items were addressed appropriately. The checklist domain 'analysis' includes four questions only applicable to NMA studies; these questions were not considered for the domain grading of MA studies.

Reporting quality & transparency
At least two of the six checklist items were not addressed satisfactorily, or discussion of the impact of important patient characteristics on treatment effects was not included.
Insufficient information for one checklist item or a brief discussion of the impact of the impact of patient characteristics on analysis results was provided.
All checklist items were addressed appropriately. The checklist domain 'reporting quality & transparency' includes four questions only applicable to NMA studies; these questions were not considered for the domain grading of MA studies.
Interpretation Results were not contextualized with consideration of limitations or specific treatments were endorsed over others despite a lack of discussion of betweenstudy heterogeneity and/or despite pooling of active therapies.
Study limitations (e.g. between-study heterogeneity) were provided however without a detailed discussion of the impact these may have had on observed study results.
All these aspects were addressed appropriately.

Conflict of interest
No information on conflicts of interest was provided, or details of author disclosures and contributions were insufficient.
Disclosures as well as author contributions were clearly stated in cases of personal or financial relationships of affiliations that could have biased the work in question.
No personal or financial relationships or affiliations (that could have biased the study) were declared.
MA Meta-analysis, NMA Network meta-analysis, SLR Systematic literature review      [72], AMBITION [73], GRIPHON [74]). Such inconsistencies across studies challenge a robust interpretation of results for decision makers concerned with a comprehensive assessment of all approved treatments, given the dearth of direct comparisons in RCTs.

Credibility
Of the 52 studies reviewed, six were scored as strong in terms of credibility, 18 as neutral, and the remaining 28 as weak.
The majority of studies attempted to identify all relevant RCTs. Some studies did not search all of the most relevant databases, i.e. MEDLINE, Embase, CENTRAL [18,29,32,34,35,43,44,50,51,66]. Several studies did not provide details of the search strategy [18-21, 24-26, 29, 31, 32, 35, 36, 39, 40, 43-45, 48, 50-53, 57-59, 61, 65-67] and one study did not provide any details on the search strategy and searched databases [56]. Although Badiani 2015 reported that prostanoids with IV/inhaled/SC ROA were considered for evaluation, trials on prostanoids with these ROAs were not included in the analysis. No justification provided. b In Fox 2011, sitaxsentan, ambrisentan and vardenafil were included in the search strategy of the review, however, trials with these therapies were not included in the analysis. No justification provided. c In Zheng 2014a, trials on sitaxentan were excluded from the analysis as it was withdrawn from the market due to liver toxicity. The trial on selexipag was also excluded but provided no justification for the exclusion. The proposed methodology was found to be relevant to answer the decision problem in almost all included studies. Some studies did not conduct a quality assessment of included RCTs [18,24,28,43,53,56,57]. Several studies did not provide the results of the RCT quality assessment or discuss implications for the analysis in case of poor quality RCTs [21,23,30,31,36,39,45,60,61,66].
Given the absence of randomization across the RCTs included in an MA or NMA, the assessment of effect modifiers is essential to validate assumptions around homogeneity, consistency and transitivity [75,76]. Effect modifiers are study and patient characteristics associated with treatment effects, capable of modifying (positively or negatively) the observed effect of a risk factor on  Conflict of Interest 22 16 14 disease status. Potential effect modifiers in PAH include patient baseline characteristics such as 6MWD, WHO functional class, disease duration, background therapies and etiology; and study design characteristics such as study duration and imputation rules. As the overview of design and patient baseline characteristics of included PAH RCTs (see Fig. 1a-c; Figure S3a-d in the electronic supplementary material) demonstrates, substantial between-study heterogeneity is a feature of every evidence synthesis study in PAH. The majority of studies did not offer a comprehensive assessment prior to analysis or identify imbalances in effect modifiers across the RCTs [17, 18, 20, 21, 23-27, 30, 32, 34, 39, 43-46, 49, 51, 52, 56-64, 66, 67, 69].

Analysis
Of the 52 studies reviewed, five were scored as strong in terms of analysis, 20 as neutral, and the remaining 27 as weak.
Preservation of study randomization of included RCTs was fulfilled by almost all included studies except in five studies with single-arm [36,39,54], retrospective comparative [35] or open-label extension design [56]. Several MAs adopted an approach whereby, for multi-arm trials, the control group was split and the sample size halved [34,37,58,65]. Though outlined in the Cochrane Handbook for Systematic Reviews of Interventions [12], this approach effectively breaks randomization and should therefore be avoided. Other forms of evidence synthesis (e.g. NMA) are more appropriate in this case. Of the included NMA studies with closed loops, most assessed the consistency between the direct and indirect evidence [13,14,48,57,62].
Lastly, several studies pooled treatments at the class level, usually without sound justification for the assumption of a class effect. Very few studies refrained from lumping treatments, doses and co-treatments together [28, 47, 48, 53-55, 60, 62].

Reporting quality & transparency
Of the 52 studies reviewed, seven were scored as strong in terms of their reporting quality and transparency, 22 as neutral, and the remaining 23 as weak.
All included NMA studies presented a network diagram, except Zhang et al. 2016 [61]. Two of the 11 included NMA studies did not present details of the number and/or RCTs per pairwise comparison [18,30]. Separate reporting of direct and indirect comparisons was omitted in six NMA studies [18,25,30,48,54,67]. A ranking of interventions according to the reported treatment effects was provided by two-third of the included NMA studies [18,25,33,42,48,55,57,61,62,67], some of which did not report associated uncertainty measures. The reporting of all pairwise contrasts between interventions, along with measures of uncertainty, was not adhered by two of the 11 NMA studies [18,54].

Interpretation
Overall, 15 of the 52 studies reviewed were scored as strong in terms of their interpretation of study findings, 23 as neutral, and the remaining 14 as weak.

Conflict of interest
Among included studies, 22 were scored as strong in terms of conflict of interest,16 as neutral, and the remaining 14 as weak.
Less than a third of all assessed studies provided either no information about conflicts of interest or insufficiently detailed author disclosures. Other studies reported no personal or financial relationships, or clearly stated author contributions in case of personal or final relationships of affiliations that could have biased the respective study.

Discussion
The objective of this study was to systematically appraise all identified MA/NMA studies in PAH and assess their quality given that such studies are taken into consideration for evidence-based decision-making. To our knowledge, this is the first study of this type in PAH. Overall, the appraisal found most evidence synthesis studies to be of low quality.
Most included evidence syntheses were found not to have defined the decision problem (i.e. the research question underpinning a study), population, selection of comparisons and outcome selection that is compatible or aligned with current clinical practice and treatment guidelines [2,79]. Of note, the majority of the studies [18-26, 29, 30, 32, 34, 36, 40, 43-47, 49, 52, 53, 55-58, 60-64, 66, 67] included trials that do not reflect today's clinical practice. For example, the BREATHE-2 [80] and PACES [81] trials investigated bosentan and sildenafil, respectively, as add-on therapy to IV epoprostenol. By contrast, PAH management today typically involves treatment initiation of oral therapy with an ERA and/or PDE-5I in low or intermediate-risk patients comprising the vast majority of patients, whereas parenteral prostacyclins would only be considered or added for high-risk patients [6].
Notably, clinical trial design has evolved from a preponderance of small, short-term and often open-label studies in treatment-naïve patients with severe PAH to larger, longer-term and event-driven trials (such as COMPASS-2 [82], SERAPHIN [72], AMBITION [73], GRIPHON [74]) in largely treatment-experienced and less severe patient populations. Similarly, primary endpoint definition has gradually shifted from improvement in 6MWD to morbidity and mortality as a composite endpoint (with components such as all-cause death, PAH-related hospitalization or disease worsening) which is considered to be a more patient-and clinically relevant endpoint [83][84][85].
While these changes in trial design and PAH management pose challenges for studies synthesizing evidence generated across such large time spans, a transparent interpretation of findings in recent MA/NMA studies in relation to present clinical practice and guidance was found to be lacking.
A related shortcoming of appraised studies is the choice of outcomes analyzed, which was found to be selective, incomprehensive, and usually not accompanied by clear justification. The most commonly assessed outcome was 6MWDdespite failure of multiple studies to consistently establish significant associations between 6MWD and clinically more relevant outcomes such PAH-related hospitalization, lung transplantation, initiation of rescue therapy or death [28,29,43,50,86,87]. Moreover, the assessed evidence synthesis studies generally neither presented a review of the outcome definitions and outcome measures of included trials, nor an assessment of imputation rules for handling missing data.
Mortality was less commonly assessed, which reflects the inherent challenges in designing clinical trials of PAH therapies to detect statistically significant or clinically meaningful differences in mortality. Replication of earlier trials (e.g. Barst 1996 [78]) showing survival benefit over a very short time period and placebo-controlled RCTs comparing monotherapy with no therapy in treatment-naïve patients would be considered unethical today.
Another crucial drawback in most included studies is the lack of a thorough assessment of key effect modifiers prior to the analysis. As the graphs presenting patient baseline characteristics across PAH trials demonstrate (see Fig. 1a-c; Figure S3a-d in the electronic supplementary material), there is marked between-study heterogeneity. One recurring observation was that most evidence synthesis studies included a mix of PAH and non-PAH patients populations, as in the aerosolized iloprost randomized (AIR) study [88] which included PAH and chronic thromboembolic pulmonary hypertension (CTEPH) patients.
Only a handful of studies sought to address such potential systematic differences in the effect modifiers through means of subgroup/sensitivity analyses, metaregression. This may be due to limited subgroup data available from published PAH RCTs, and challenges around smaller sample sizes associated with subgroup data which results in wider uncertainty estimates and lower likelihood of detecting significant relative treatment effects.
In terms of results synthesis, several studies were found to pool treatments at the drug class level. Best practices guidelines in evidence synthesis, such as NICE DSU TSD 7 [13], recommend against pooling treatment doses or treatments into drug classes since characteristics of the underlying trial population or efficacy/safety trial results may be different. This review has some limitations. A thorough assessment of the quality of MA/NMA studies is limited by the heterogeneity across included trials. A detailed assessment of between-study heterogeneity in each included MA/NMA was beyond the scope of the review. Nevertheless, a preliminary assessment of patients' baseline characteristics of all PAH trials included across the appraised MA/NMA studies was considered reflective of most studies. Results or analyses relating to PAH subgroups by etiology, severity or age were not explored further due to no or very limited studies focusing on these specific sub-populations.

Conclusion
This is the first critical appraisal of published MA/NMA studies in PAH, suggesting overall low quality and validity of efforts synthesizing PAH evidence. As our study demonstrates, this has important implications for clinical decision-making and future research. First, the choice of optimal therapy to maximize patient outcomes should also be guided by a consideration of the limitations of published MA/NMA studies highlighted in this study. Second, future attempts of evidence synthesis in PAH should improve the level of validity and scrutiny to meaningfully address challenges arising from an evolving therapeutic landscape. This should include the definition of decision problems that are aligned with today's clinical practice and treatment guidelines, justification of key analysis assumptions, a comprehensive interrogation of the evidence base prior to analysis, use of individual patient data to mitigate issues of heterogeneity, and a transparent presentation of results and associated uncertainty measures for all relevant outcomes.