Hierarchical modeling for estimating relative risks of rare genetic variants: properties of the pseudo-likelihood method.
Academic Article
Overview
abstract
Many major genes have been identified that strongly influence the risk of cancer. However, there are typically many different mutations that can occur in the gene, each of which may or may not confer increased risk. It is critical to identify which specific mutations are harmful, and which ones are harmless, so that individuals who learn from genetic testing that they have a mutation can be appropriately counseled. This is a challenging task, since new mutations are continually being identified, and there is typically relatively little evidence available about each individual mutation. In an earlier article, we employed hierarchical modeling (Capanu et al., 2008, Statistics in Medicine 27, 1973-1992) using the pseudo-likelihood and Gibbs sampling methods to estimate the relative risks of individual rare variants using data from a case-control study and showed that one can draw strength from the aggregating power of hierarchical models to distinguish the variants that contribute to cancer risk. However, further research is needed to validate the application of asymptotic methods to such sparse data. In this article, we use simulations to study in detail the properties of the pseudo-likelihood method for this purpose. We also explore two alternative approaches: pseudo-likelihood with correction for the variance component estimate as proposed by Lin and Breslow (1996, Journal of the American Statistical Association 91, 1007-1016) and a hybrid pseudo-likelihood approach with Bayesian estimation of the variance component. We investigate the validity of these hierarchical modeling techniques by looking at the bias and coverage properties of the estimators as well as at the efficiency of the hierarchical modeling estimates relative to that of the maximum likelihood estimates. The results indicate that the estimates of the relative risks of very sparse variants have small bias, and that the estimated 95% confidence intervals are typically anti-conservative, though the actual coverage rates are generally above 90%. The widths of the confidence intervals narrow as the residual variance in the second-stage model is reduced. The results also show that the hierarchical modeling estimates have shorter confidence intervals relative to estimates obtained from conventional logistic regression, and that these relative improvements increase as the variants become more rare.