Identification of candidate susceptibility genes for colorectal cancer through eQTL analysis
Closa A, Cordero D, Solé X, Crous-Bou M, Paré-Brunet L, Sanz-Pamplona R, Berenguer A, Aussó S, López-Dóriga A, Alonso H, Moreno V.
Unit of Biomarkers and Susceptibility, Cancer Prevention and Control Program, IDIBELL – Catalan Institute of Oncology and CIBERESP. Barcelona, Spain.
Introduction
To date genome-wide association studies have identified 26 SNPs in 23 susceptibility loci for colorectal cancer (CRC). Most of these SNPs are located in intergenic positions and are considered just markers since their functional roles are generally unknown.
The identification of the relevant genes responsible for these associations is important, since they may be considered targets for developing new strategies for prevention or therapy.
Objectives
In this study we aim to identify candidate genes responsible for CRC risk susceptibility using cis and trans-eQTL analysis in two series of samples, one of healthy colonic mucosa and other of normal mucosa adjacent to colon cancer. For completeness, we have also analyzed the effect in tumor tissue, but these are not used for discovery, since as well known, the gene expression profiles in tumors are highly altered by diverse mechanisms that may introduce both false positive and false negative results.
Methods
Normal mucosa from 100 patients with colon cancer and 50 healthy donors that underwent colonoscopy have been included in the COLONOMICS project (www.colonomics.org). Gene expression data was generated with the Affymetrix Human Genome U219 Array Plate platform. After quality control, a total of 246 arrays were used for subsequent analyses. Raw data were normalized using the RMA algorithm implemented in the Bioconductor affy package available at environment for statistical computing R.
Risk SNPs identified in GWAS studies up to June 2013 (www.genome.gov/GWASstudies) were considered for the analysis. Additional relevant SNPs identified in fine-mapping studies for these regions were also considered. In total, 26 GWAS SNPs plus 4 additional risk SNPs were analyzed. Genotypes were extracted from the Affymetrix Genome-Wide Human SNP 6.0 array, which had been hybridized with genomic DNA extracted from normal colonic mucosa. SNP calling had been performed with the Corrected Robust Linear Model with Maximum Likelihood Classification (CRLMM) algorithm as implemented in R/Bioconductor package crlmm. Genotypes for 18 of GWAS SNP were not available in the array, and were imputed using IMPUTE2 (v2.2.2) software after haplotyping with PHASEIT (v1.ESHG). The 1000 genomes reference panel (March 2012 version) was used as reference. Imputation qualities were > 0.98 for all the SNPs.
We reduced the expression data to one unique value for gene using principal component analysis (PCA). For each gene, the first PC was calculated to resume the common larger variability of the different probe sets. Furthermore, a model-based clustering was applied in order to detect and remove not-expressed and saturated genes from further analysis.
A region of 2 Mb upstream and downstream of each GWAS SNP was defined and genes within this region were tested for candidates eQTL. An additive genetic model was considered and partial correlation (adjusted for tissue type) was used for the analysis. Also a Bonferroni correction was applied, multiplying the p-values by the number of genes analyzed in the 4Mb region. For trans-eQTL analysis, the Bonferroni correction accounted for 18,665 genes (p<1e-7 were considered significant).
Results
The analysis of 30 GWAS SNPs identified three loci with five candidate genes (rs3802842 11q23.1: C11orf53, C11orf93, C11orf92; rs7136702 12q13.1: DIP2B; and rs5934683 Xp22.3: SHROOM2). The expression of these genes varied linearly with the genotypes in the corresponding GWAS SNP. The expression of the 3 genes in 11q23.1 was highly correlated, and no one of them could be identified as a preferable candidate. Once these genes were identified, a detailed analysis was performed to search for alternative SNPs with larger association to their respective gene expression. For chromosome 11 we analyze other 27 SNP that are in high LD (>0.8 R-Squared) with the rs3802842 and calculated the partial correlation with a summary expression value for the three orf genes derived from a PCA. SNP rs7130173 was the most significant in the region and a conditional analysis showed that it dominated the eQTL association.
For chromosome 12 a similar analysis identified rs61927768, which is located 40bp from the TSS of DIP2B, as the most relevant SNP in the region (p-value 2,2e-16). Furthermore, we analyzed this region with JASPAR and TRANSFAC and found that this SNP is in a SP1 binding site with an expression correlation of 0.38.
Conclusions
We have identified candidate genes in three GWAS loci that are strong eQTLs. These findings are relevant since open the path for further functional studies that may reveal intervention strategies for preventing CRC.