Targeted high-throughput sequencing of candidate genes for chronic obstructive pulmonary disease

Background Reduced lung function in patients with chronic obstructive pulmonary disease (COPD) is likely due to both environmental and genetic factors. We report here a targeted high-throughput DNA sequencing approach to identify new and previously known genetic variants in a set of candidate genes for COPD. Methods Exons in 22 genes implicated in lung development as well as 61 genes and 10 genomic regions previously associated with COPD were sequenced using individual DNA samples from 68 cases with moderate or severe COPD and 66 controls matched for age, gender and smoking. Cases and controls were selected from the Obstructive Lung Disease in Northern Sweden (OLIN) studies. Results In total, 37 genetic variants showed association with COPD (p < 0.05, uncorrected). Several variants previously discovered to be associated with COPD from genetic genome-wide analysis studies were replicated using our sample. Two high-risk variants were followed-up for functional characterization in a large eQTL mapping study of 1,111 human lung specimens. The C allele of a synonymous variant, rs8040868, predicting a p.(S45=) in the gene for cholinergic receptor nicotinic alpha 3 (CHRNA3) was associated with COPD (p = 8.8 x 10−3). This association remained (p = 0.003 and OR = 1.4, 95 % CI 1.1-1.7) when analysing all available cases and controls in OLIN (n = 1,534). The rs8040868 variant is in linkage disequilibrium with rs16969968 previously associated with COPD and altered expression of the CHRNA5 gene. A follow-up analysis for detection of expression quantitative trait loci revealed that rs8040868-C was found to be significantly associated with a decreased expression of the nearby gene cholinergic receptor, nicotinic, alpha 5 (CHRNA5) in lung tissue. Conclusion Our data replicate previous result suggesting CHRNA5 as a candidate gene for COPD and rs8040868 as a risk variant for the development of COPD in the Swedish population. Electronic supplementary material The online version of this article (doi:10.1186/s12890-016-0309-y) contains supplementary material, which is available to authorized users.


Background
Chronic obstructive pulmonary disease (COPD), characterised by a persistent airflow obstruction [1], is a lifethreatening disease accounting for 6 % of all deaths globally in 2012 [2]. The development of the disease is influenced by environmental determinants, most commonly cigarette smoking, genetic risk factors and possible genetic protective factors [3]. Candidate gene association studies have suggested several potential COPD susceptibility genes, and genome-wide association studies (GWAS) have identified multiple COPD susceptibility loci [4]. However, genetic mapping in families with high penetrance for a disease gene variant can be helpful in pinpointing new susceptibility genes even for multifactorial traits. Recently, we reported mutations in the gene for fibroblast growth factor 10 (FGF10) involved in lung development, as a possible cause of COPD in families from Sweden [5]. Hence, a monogenic form of COPD could result from mutations in FGF10. To date, the only other known monogenic form of COPD is alpha 1-antitrypsin deficiency caused by disruption of the alpha-1-antiproteinase (SERPINA1) gene [6].
Typically in GWAS, common polymorphisms are tested for association. In this study, we provide an alternative approach with the aim to perform an in-depth analysis of exons of candidate genes for COPD by using high-throughput sequencing. This allowed us to detect the full spectrum of single nucleotide variation at any frequency in selected genomic regions and to also capture variants with a potential functional effect on gene expression levels. We show here that targeted high throughput sequencing using a well-defined populationbased case-control sample can i) assess the impact of common variants in genes important for lung development, and ii) test genetic variants in a large set of candidate genes and genomic regions for association with COPD. To accomplish this we captured and sequenced 22 genes implicated in lung development as well as 61 genes and 10 genomic regions previously associated with COPD. The sample used here is comprised of cases and controls from The Obstructive Lung Disease in Northern Sweden (OLIN) studies. The population in northern Sweden, an admixture of three different ethnic groups (Swedes, Finns and Saami), showed a dramatic growth of population size since the 18 th century from a relatively small founder population [7]. This resulted in founder effects that significantly reduced the heterogeneity of this population, making it suitable for genetic association studies of multifactorial phenotypes, such as COPD [8].
This study assessed Swedish COPD cases and controls and assessed detected variants in candidate genes for association with COPD. We replicated a previous described association signal in CHRNA3, which also associated with lower CHRNA5 gene expression. The DNA capture design and targeted sequencing used here show potential to detect known single nucleotide variants in association with COPD with the additional potential to also detect low-frequent variants. The result presented here using the relative limited sample size could be replicated using our targeted capture design in larger samples from different populations.

Patient material and ethics statement
The OLIN studies are an on-going research program focused on asthma, allergy and COPD. It started 30 years ago [9] and now involves more than 50,000 subjects from northern Sweden. Within OLIN, a COPD-cohort was identified at re-examination of several cohorts in 2002-2004 [10]. At recruitment, COPD (n = 993) was defined using the fixed ratio of FEV 1 / FVC < 0.70 (forced expiratory volume in 1 s / forced vital capacity). When calculating the ratio FEV 1 / FVC, the highest values of FEV 1 and the highest value of forced vital capacity (FVC) or slow vital capacity (SVC) were used. This has support in the GOLD documents [1] and is acknowledged in the recent ERS task force guidelines for epidemiological studies on COPD [11]. An age and gender matched control population (n = 993) without obstructive lung function impairment was also recruited [10]. Since 2005 the OLIN COPD cohort with corresponding controls is followed up annually with a basic program including spirometry and interviews regarding symptoms and morbidity [12]. We initially selected 96 COPD cases (18 non-smokers, 43 former smokers and 35 smokers) from those who had an FEV 1 < 80 % of predicted value in 2005 and either FEV 1 / FVC < LLN (lower limit of normal) in 2010 or were rapid decliners with an annual FEV 1 decline of ≥ 60 ml between 2005 and 2010. We also identified a set of 96 age-and gender-matched controls (33 smokers and 63 former smokers) with normal lung function. These 96 cases and 96 controls are henceforth termed the OLIN discovery sample (Table 1). Furthermore, we defined an OLIN replication sample consisting of individuals from the OLIN COPD study for which DNA was available (n = 1,534). From this group we classified individuals as cases when FEV 1 / FVC was lower than LLN in 2010, or if they had a yearly FEV 1 decline from 2005 to 2010 of at least 60 ml (n = 256). The remaining individuals were used as P values for differences in parameters between cases and controls were calculated using two-sided Student's t-tests, assuming equal variance. The ethics board of Umeå University (Dnr 04-045 M, supplement approved 2005-06-13) approved the use of individual phenotypic data and DNA samples for genetic research.

Sequencing and quality controls
In total 22 genes implicated in lung development, 61 genes and 10 genomic regions previously associated with COPD (Additional files 1 and 2), were investigated using targeted sequencing of captured genomic regions (HaloPlex Protocol Version A, Agilent Technologies, Santa Clara, CA). Regions of 1.5 kb of genomic sequence, including specific intergenic polymorphisms, was also included in the design. The regions of interest (ROI) were designed to target all known exons of major/known transcripts and at least 20 base pairs (bp) of intronic sequences flanking each exon. The sequence capture design included 953 target regions spanning 204.384 bp with 95.9 % (196.066 bp) coverage an average. Captured genomic regions were subjected to high throughput paired-end (100 bp read module) sequencing (HiSeq2000, Illumina, San Diego, CA) at the Science for Life Laboratory in Uppsala, Sweden. Sequence reads were aligned to the hg19 reference genome and single nucleotide variants (SNVs) were called using GATK Unified Genotyper (GATK bundle v.2.2) [13]. Next, we enriched for high quality SNVs by removing SNVs with low confidence (QD < 1.5), Phred scaled quality score (<50) and SNVs within SNV clusters. These high quality variants are henceforth referred to as 'variants'. We also removed individual cases and controls with sequencing read depth consistently < 10 reads. The strategy for filtering and quality controls are illustrated in Fig. 1.

Statistical analyses
Test for genetic association and genetic effect was performed for each predicted variant separately using the discovery sample. In addition, rs8040868 and rs11728716 were tested for association using the available OLIN sample (replication sample) as, according to RegulomeDB, these two markers present with potential functional effects on gene regulation ( Table 2). Tests for allelic association of individual variants with COPD were performed using the Fisher's exact test. Results were considered statistically significant when p < 0.05. No adjustment for multiple testing was performed in these analyses. Effect size was measured using odds-ratios (OR) with 95 % confidence intervals (CI).
Visualization SNPs associated with COPD located in the CHRNA3/5 region was made using LocusZoom v1.1 [14] available on http://locuszoom.sph.umich.edu/locus zoom/. RefSeq gene/transcript case-control tests for aggregation of genetic variants in the targeted genomic regions were performed using PLINK/SEQ v0.1 [15]. Tests were divided into an analysis of rare variants with minor allele frequency (MAF) < 5 % or common variants (MAF ≥ 5 %). The UNIQ test, which identifies unique risk alleles, was utilized using default parameters to count the total number of alleles found only in cases (risk variants). Similarly, the SKAT burden tests, which assesses excess of rare alleles in cases compared to controls, was also utilized. Since both UNIQ and SKAT burden are 1-sided tests, we also swapped the phenotype information and analyzed the effects in both directions (excess of alleles in cases or controls) separately to capture evidence of both risk and protective alleles. Due to the matched design of the case and control groups no covariate adjustments (age, sex, pack-years) were performed in analysis using the discovery sample. Linkage disequilibrium information of associated variants was extracted from genotypes from the sequencing analysis using Haploview 4.2 [16].

Functional analysis of associated variants
To study the possible effect of associated variants on gene expression, we used information from RegulomeDB, a The "PHRED-scaled" CADD score based on ranks of all SNV in the hg19 genome reference database that combines ENCODE data sets (chromatin immunoprecipitation sequencing (ChIP-seq) peaks, DNase I hypersensitivity peaks, DNase I footprints) with additional data sources (ChIP-seq data from the NCBI Sequence Read Archive, conserved motifs, expression quantitative trait loci (eQTL), and experimentally validated functional variants) [17]. A scoring system is based on the confidence of the functionality of variants, a lower score corresponding to stronger confidence. Subcategories are used to denote additional functional annotations. Combined Annotation Dependent Depletion (CADD) scores were used to assess potential structural and functional effect of associated nonsynonymous variants [18].

Lung expression quantitative trait loci analyses
The existence of expression quantitative trait loci (eQTLs) was investigated as previously described using genotyping and gene expression data from 1,111 patients who underwent lung surgery at one of three sites, Laval University (discovery sample), University of British Columbia, and University of Groningen (replication sample sets) (referred to as Laval, UBC, and Groningen) [19,20]. The eQTL data is derived from non-tumour lung parenchymal samples and expression data were adjusted for age, gender, and smoking status. Estimated P-values for each region were Bonferroni-corrected for multiple testing based on the number of SNPs and probe sets (number of SNPs x number of probe sets) and were considered significant if corrected p < 0.05.

SNP genotyping for validation of rs11728716 and rs8040868
Individuals from the OLIN replication sample (n = 1,534) were genotyped for the rs11728716 and rs8040868 variants (99.2 % and 99.8 % success rate, respectively) at the Uppsala Genome Center (Uppsala, Sweden) using commercially available TaqMan assays (Life Technologies, Carlsbad, CA). Assay conditions were according to manufacturer's recommendations. Effect size was estimated by comparing ORs with 95 % CI between cases and controls. Furthermore, to assess smoking dependence, we measured association and effect sizes also between the groups 'nonsmokers' and 'ever smokers' , and between 'current smokers' and 'former smokers'.

Selection of cases and controls for the discovery sample
The characteristics of each sample are listed in this section as value ± standard deviation. Cases and controls were matched for age (cases: 68 ± 10 years; controls: 66 ± 11 years; p = 0. 15 (Table 1).

Test for association between genetic variants and COPD
We identified 2,151 SNVs after analysis of the sequenced target regions. After variant and sample quality control procedures, 1588 SNVs and 68 cases and 66 controls were retained in the downstream analysis (Fig. 1). Out of the 1588 variants, we identified 37 variants with significantly different allele frequencies in cases and controls (henceforth referred to as 'associated variants') ( Table 2). We initially detected two novel variants in the discovery sample: GRCh37.p13, 5:g.157002804C > G in the ADAM19 gene and GRCh37.p13, 7:g.73477874C > A in ELN. However, using Sanger sequencing of the same sample we excluded both variants, as they were monomorphic. Three of the associated variants were shown to be unique to controls including missense variant rs61741262 (p.Asn1722Ser) in TNS1. The most significantly associated variants were all intronic (GSTCD, rs11728716, p = 6.5 × 10 −5 , OR = 7.4 (2.5-22.5) and MMP12, rs632009, p = 6.7 × 10 −4 , OR = 2.5 (1.5-4.1).
Although the majority of the associated variants were intronic (or intergenic), five were protein-coding (Table 2) (Table 2).
We tested the presence of variants uniquely found in cases or controls as well as gene burden tests in cases against controls as specified in PLINK/SEQ v0.1 using standard settings. We noted that the ADAM19, WNT2, CHRNA5, NOS3 and PTCH1 genes all harbor rare variants (MAF < 5 %) uniquely found in cases (Additional file 3). Conversely, the FGF8, CTNNB1 and HHIP genes contain rare variants uniquely found in the control sample (Additional file 3). Neither gene burden analysis (SKAT) or analysis of rare alleles (MAF < 5 %) yielded significant results. However, by performing a joint analysis with only common alleles (MAF ≥ 5 %) in target regions using SKAT, we showed a significant gene burden for the genes GSTCD, FGF1, ELN and ESR1 (Additional file 4).

In silico analysis of predicted functions of associated variants
According to RegulomeDB, all 37 associated variants were located within known and predicted regulatory elements in intergenic regions (Table 2). We noted that a variant in CHRNA3 (rs8040868) and a variant in GSTCD (rs11728716) each showed a RegulomeDB score of "1f", denoting the presence of transcription binding site or DNAse peak.

Lung eQTL results
According to RegulomeDB, the rs8040868 (CHRNA3) and rs11728716 (GSTCD) variants could present with potential functional effects on gene regulation. To determine if these variant could represent eQTL, we analysed the genotypes and gene expression data in the discovery sample (Laval) as well as replication samples (UBC and Groningen). One of these variants, rs8040868:C > T, was confirmed to be significantly associated with gene expression of the nearby gene CHRNA5 in all three data sets, with the C allele (minor allele) associated with lower CHRNA5 expression ( Fig. 2 and Additional file 7). Interestingly, we could also see a high correlation between rs8040868 and expression of an anti-sense transcript (AF147302) of unknown function from the adjacent IREB2 gene region (data not shown). AF147302 is likely a result of strong bi-directional promoter activity in this region [22].

The rs8040868 variant is associated with COPD in the OLIN replication sample
We also investigated pulmonary data from the replication sample (n = 1,534; cases = 256, controls = 1278). Analysis using RegulomeDB predicted both rs11728716 and rs8040868 variants as being functional (score 1f for both variants). We therefore selected these two variants for genotyping in all available OLIN samples (n = 1,534). The frequency of the rs8040868-C allele was 35 % in the reference group (n = 1,278) and 42 % in the cases (n = 256) resulting in a significant association (p = 0.003) and an OR of 1.4 (95 % CI 1.1-1.7) for COPD. SweGen variant frequency database reports a 39 % frequency of rs8040868 in 1000 whole genomes representing a crosssection of the Swedish population (https://swefreq. nbis.se). The frequency of the homozygous rs8040868-CC genotype was 12 % in the reference group and significantly higher (18 %) in the cases (p = 0.018). When comparing smoking status using all genotyped individuals, no significant difference in allele frequency between neither the groups 'non-smokers' (n = 589) and 'ever smokers' (n = 943) (OR 1.1 95 % CI 1.0-1.7; p = 0.09) nor between the groups 'current smokers' (n = 312) and 'former smokers' (n = 631) (OR 1.2 95 % CI 1.0-1.4; p = 0.11) was seen. The latter test was used to assess nicotine dependency and aptitude for smoking cessation under the assumption that a genetic variant associated with these traits would be underrepresented in a former smoking group as compared to a group of current smokers, i.e., harder to quit smoking. The tests for association with smoking must however be taken with caution as the confidence intervals are wide and a larger sample size would be needed for replication.
Analysis of rs11728716 using the OLIN replication sample (n = 1,534) revealed no association with COPD (p = 0.07; OR = 2.2). In order to test if rs11728716 is associated with severe COPD, we stratified the available COPD cases based on severity and selected cases with FEV1%pred < 40 % and FEV1/FVC < LLN. Our results show a significant association (p = 0.017) between rs11728716 and the group of severe COPD (n = 14). The allele frequency of rs11728716-A was 10 % among cases with severe COPD and 4 % in the controls.

Discussion and conclusion
Genetic variants influencing lung function in children and adults may ultimately lead to the development of COPD [23]. Since limited disease-specific therapy for COPD is available, an improved knowledge of genetic variants modulating the pathogenic mechanisms underlying COPD is greatly needed. We aimed here to identify genetic variants within, or close to, the coding regions of genes and loci previously associated with COPD, or in genes involved in lung development. We opted for a qualitative rather than a quantitative approach with the selection of cases with moderate or severe COPD and progressive decline in lung function. Furthermore, controls were all smokers without COPD that, in our study design, can aid the identification of potential protective genetic variants and aid detection of genetic variants associated with severe COPD. When applying a Bonferroni correction for the total number of variants detected, no variants showed statistically significant association. We did, however, identify several variants with a likely biological significance, as indicated by high effect sizes (odds ratio), that we believe warrants further investigation in a larger sample. Furthermore, potential functional effects of variants were investigated using data from a large number of lung samples and we describe here a COPD lung eQTL.
When comparing our association data with the lung eQTL data (discovery data set from Laval University), we could identify a variant associated with COPD that was also associated with level of gene expression (Fig. 2). This variant, synonymous variant (rs8040868) in CHRNA3 on chromosome 15, confers a risk for the development of COPD in both our OLIN discovery sample with moderate or severe COPD and our OLIN replication sample including all available COPD cases and controls in OLIN (OR 1.4, p = 0.003). In the lung eQTL data, we could see a correlation of the C allele of rs8040868 with lower expression levels of CHRNA5 (Fig. 2), and, to a lesser extent, also CHRNA3 and PSMA4, which are located in close proximity to CHRNA5. The α-nicotinic receptor (CHRNA3/5) gene locus on chromosome 15q25.1 is associated with COPD, lung cancer and peripheral arterial disease, as well as other smoking related conditions [24,25] and nicotine addiction [26,27]. Recently, the CHRNA3/5 locus was implicated in all-cause mortality among smokers in a Finnish cohort [2]. The rs8040868-C allele associates with both reduced pulmonary function and lung cancer [24,25,28,29] and affects DNA-methylation and transcription of CHRNA5 [30]. Furthermore, rs8040868 is also in LD with a nearby variant (rs16969968) previously reported to be associated with expression levels of CHRNA5 in the lung [21]. The direction of effect is the same for both SNPs, with the minor alleles associated with reduced expression of CHRNA5. Also recently, rs16969968 was found to be the most significantly associated variant in an exome array analysis in a study including more than 6,100 COPD cases and 6,000 control subjects across five cohorts [31].
Several genetic variants showed association with COPD in our population, but did not correlate with gene expression levels in the lung, including previously identified variants in the genes glutathione S-transferase, cterminal domain containing (GSTCD), surfactant protein D (SFTPD) and matrix metalloproteinase-12 (MMP12) [32][33][34][35][36]. We identified a haplotype consisting of three risk-conferring variants, rs72671840, rs72671858 and rs11728716 (G-T-A haplotype), at the GSTCD gene locus on chromosome 4q24. The variant rs11728716 has previously been associated with lung function [32][33][34] and is likely to affect the transcription of GSTCD. We show here that rs11728716 was associated with severe COPD using the OLIN replication sample. Although intriguing, due to the limited number of severe COPD cases used in this study, this result needs further verification in a larger sample. The other two variants (rs72671840 and rs72671858) are of unknown function [37]. GSTCD encodes a glutathione S-transferase C-terminal domain protein involved in detoxification by catalysing conjugation of glutathione to products of oxidative stress. We found association between COPD and rs6413520, a synonymous variant, p.(Ser45=), within SFTPD on chromosome