Pitfalls when conducting comparative analyses or meta-analyses of GWAS results


Violation of independence
An important assumption in a meta-analysis is that different datasets are sampled independently. An overlap of individuals between different datasets may lead to a similar association signal in each dataset, thereby artificially confirming this signal and leading to a spurious association reported by the meta-analysis

Difference in sample size
A mere intersection analysis can suffer from differences in sample size between datasets, as weaker--but true--associations may not be discovered in the smaller dataset. This would lead to false negative findings in the intersection analysis. Meta-analyses tend to correct for this issue by weighting different datasets according to their size.

Interaction effects
An association signal may be present in one dataset, but absent in another, because the individuals in one dataset are exposed to an environmental effect which triggers a gene-environment interaction. Individuals in both datasets may be genetically susceptible to this effect, but it will not be observed in one dataset because the environmental effect is absent there. This would again lead to false negative findings in comparative analyses. A similar phenomenon can occur if a gene-gene interaction affects the phenotype, and the relevant interacting genotype is present in one dataset but not the other. Then the same gene may be significantly associated in only one of the two datasets.

Genetic heterogeneity
Population structure can cause the finding that certain loci are associated with the phenotype, which are merely correlated to geography and local environmental influences that affect the phenotype. Furthermore, these systematic ancestry differences between different phenotypic classes can lead to spurious associations, just as in a genome-wide association study on a single dataset. If two or more phenotypes are significantly correlated with kinship, then they may show shared genetic association signals due to this confounding. easyGWAS offers techniques for association mapping that correct for confounding by population structure in form of Linear Mixed Models. easyGWAS also flags phenotypes in the phenotype correlation matrix that are significantly associated to population structure, to inform the user of this source of potentially spurious joint associations.

When combining datasets in a meta-analysis, one can assume a fixed-effects model in which the effect of a genetic variant is assumed to be the same in all datasets. Although the fixed-effects model is the most popular approach to meta-analysis, if the datasets exhibit large genetic heterogeneity, this will result in a violation of the above mentioned assumption. As a consequence of the latter, it will yield inflated p-values.

Phenotypic heterogeneity
Different protocols to measure phenotypes in different datasets, including different levels of replication of the phenotypes, could lead to artificial differences between associations found in different datasets, despite a common genetic architecture.

Publication bias
It has been observed that studies that find association signals are more likely to be published. This form of publication bias affects the meta-analysis of published studies, because the null hypothesis of no association between genotype and phenotype has already been rejected in each individual dataset (Rothstein et al., 2005). It is not well understood how exactly does publication bias affect GWAS. Nevertheless, it is clear that if the studies to be combined in a GWAS meta-analysis focus on results that, a priori, seem more favorable, the bias will then be present (Zeggini and Ioannidis, 2009).

Missing genotypes
The use of different genotyping platforms often results in different sets of genetic markers being present in different datasets. As a result, the SNP with the strongest association in one dataset may not even be present in another dataset. This leads either to the need of 1) restricting the analysis to SNPs that are present in all datasets, 2) analyzing hits on the gene level rather than the SNP level, or 3) imputing missing SNPs in each dataset. Options 1) and 2) are currently offered by easyGWAS. Special caution needs to be taken with datasets that have been imputed as mentioned in option 3) (Bush and Moore, 2012). If the studies to be combined in a comparative- or meta-analysis have been imputed with different algorithms and/or different haplotype panels drawn from an ethnic population that differs from the target one, this will create additional heterogeneity in the data.