head of
- Weill Cornell ALACRITY Center Director, Research Methods Core 2017 -
- MS Program in Biostatistics and Data Science Program Director 2016 - 2020
Dr. Banerjee has over a decade of experience as a biostatistician in biomedical collaborations, which range from randomized clinical trials, observational cohort studies, comparative effectiveness research, analysis of mHealth data (smartphones and wearables), analysis of “big data” (EHR, claims, large registries and cohorts), statistical genetics, and cancer genomics. His methodological research interests are in multivariate statistics and its various applications including longitudinal data and repeated measures, and in high-dimensional statistical problems in variable selection and prediction. Recently, he has developed interests in the use of machine learning algorithms to elucidate digital phenotypes of individuals using mHealth data (smartphone and wearables) and to predict healthcare outcomes using "big" administrative data. He has served as the primary Biostatistician on 20+ federally funded grants. He is the Director of the Research Methods Core of the Weill Cornell ALACRITY Center (P50 MH113838) where he designs and oversees the analysis of clinical trials and leads the efforts in the development of statistical methodologies using “big” data (electronic health records and health insurance claims), data with multivariate longitudinal outcomes, data from sensors and wearables and brain imaging data (e.g. fMRI). He has served as the statistical mentor of nine fellows and junior faculty and has been the primary mentor for three post-doctoral scholars. He was the founding director of the MS program in Biostatistics and Data Science offered by WCMC till 2020.
Contribution to Science
1. Digital phenotyping and digital interventions with mHealth Data: Recent advancements in mHealth technology, owing to the ubiquitous use of smartphones and wearable devices, have increased the potential of research studies to collect behavioral data on participants in their natural environment through active sensing (through frequent surveys via an app) and passive sensing (obtained through the sensors of the devices). The massive amount of longitudinal data collected on each individual is multi-modal, complex and have a high degree of missing data. I am leading several projects that seek to develop a statistical framework and produce open-source software to analyze such complex data. Specifically, I have led the development of a pre-processing algorithm (2SpamH) that identifies under-recorded passive data as “missing data” using a two-step K-nearest neighbor algorithm and imputes them with machine-learning approaches. Using the pre-processed data, I have developed a functional data analysis framework to visually represent and analyze complex longitudinal patterns using flexible statistical models and used this methodology to predict behavioral activation from passively collected activity data (e.g., step counts). These set of analyses present a digital phenotype of individuals participating in research studies that record their behavior through mHealth apps. The ability to utilize pre-processed passive data has also led the development of prediction models that utilizes a specialized branch of machine learning (semi-supervised learning) to predict adherence to psychotherapy among older adults with depression. I have incorporated this algorithm in a just-in-time digital intervention to promote adherence to psychotherapy and currently proposing to test the intervention in a clinical trial that is part of our renewal proposal for our Weill Cornell ALACRITY Center. Some of this work is ongoing or in peer-review.
2. Predictors of Health Outcomes including Social Determinants with Big Data: I am leading several studies that utilize “Big” real-world data (e.g., health insurance claims, electronic health records and registries) to develop predictive models on various health outcomes (e.g., healthcare utilization, severe outcomes in COVID+ patients). Specifically, I have utilized the Health Care Cost Institute data on health insurance claims of over 50 million individuals who are commercially insured to study adverse mental health outcomes and preventable hospitalization among a cohort of depressed middle-aged and older adults, examine the effect of social deprivation on risk factors for suicidal ideation and suicide attempts in commercially insured US youth and adults and characterize healthcare utilization patterns among patients with psychiatric hospitalization. I have also utilized a COVID registry and the New York City-wide electronic health record repository (NYC-CDRN) to predict severe outcomes such as intubation or death among COVID+ patients. The salient feature of these predictive models is a method to harmonize longitudinal predictors (e.g., laboratory values, vital signs, diagnostic history) in a predictive model framework paying particular attention to missing data and how they can be modeled. I have also utilized the same COVID data to show that social deprivation index (SDI) plays an important role on who acquires COVID-19 and its severity; but once hospitalized, SDI appears less important.
3. Clinical Trials in Mental and Behavioral Health: I have been the primary biostatistician and designed and analyzed several randomized trials (including cluster randomized trials) which studied various behavioral interventions, psychotherapies, drugs, and home care management interventions on older adults with depression, psychosis, and bipolar disorder. With my expertise in multivariate methodology and longitudinal data, I used various statistical models that include linear and generalized linear mixed-effects models, multi-level hierarchical models (for cluster randomized trials) and generalized estimating equations to analyze data from these trials. One salient statistical feature of such trials on older adults is a high degree of missing data. I have incorporated state of the art statistical techniques, such as pattern mixture models and shared parameter analysis, to account for such issues. I have also constructed models to evaluate moderators and mediators of treatment response. I have also applied advanced statistical techniques such as variable selection (e.g. LASSO, ElasticNet etc.), multivariate methodology (e.g. cluster and factor analysis) and sub-group identification using latent class mixed models and latent growth curve models to increase the information yield from data generated in clinical trials by generating hypotheses of personalized treatment effects.
4. Multivariate Methodology: Research studies in medicine do not always analyze multiple correlated outcomes primarily due to the difficulty in interpretation and statistical complexity. My primary research interest is to understand the interplay between multiple correlated outcomes in determining treatment efficacy, mediating treatment effect and discovering patient sub-groups. The main step in analyzing multiple correlated outcomes is to model the covariance/correlation between these traits accurately. To do so, I have studied the estimation of the covariance matrix in higher dimensions and proposed an improved estimator which shows robust performance in a wide range of situations. I have developed a Bayesian multivariate model in the context of quantitative trait loci (see contribution to statistical genetics) to detect genetic loci jointly affecting multiple correlated outcomes/traits. In the spirit of multivariate statistics, I have also developed methods for performing multivariate meta-analysis of survival curves and applied it to distributed health network data. In addition to my research in multivariate methodology, I have applied multivariate clustering and classification techniques (e.g., hierarchical clustering, linear discriminant analysis, factor analysis etc.) to identify patient sub-groups in various applications (e.g., sub-groups based on clinical profile in mental health research.
5. Comparative Effectiveness Research: To further my knowledge of statistical issues in observational studies and as the Chief Statistician of the International Consortium of Orthopaedic Registries (ICOR), I collaborate with a team of researchers to study the comparative effectiveness (CE) of orthopedic devices in hip and knee replacement surgeries. Due to the lack of clinical trials, there is a gap in evidence on the CE of these devices. ICOR is a distributed network of multiple international and national joint registries who participate by providing summary information on these various devices for several risk factors. I have a semi-parametric Bayesian methodology to meta-analyze time to revision surgery (survival outcome) from these multiple registries to provide CE estimates. Unlike conventional approaches of meta-analysis, this approach can investigate time trends of survival curves and explore interaction effects.
6. Cancer genomics: I have collaborated with a team of scientists extensively on studies to understand the molecular pathology of prostate cancer. The first decade of the 21st century saw a large number of genome-wide association studies for single nucleotide polymorphisms or SNPs. However, association studies for other structural variants such as copy number variants or CNVs were rare, partly due to lack of methods to infer CNVs from array data for germline variants. I developed a computational method to detect copy number variants from array data which was applied to a genome-wide association study of germline CNVs on prostate cancer which found a couple of functionally active, low frequency CNVs associated with risk of prostate cancer. In addition, I have worked on gene expression data from various platforms, genome-wide association studies of SNPs and copy number variants, enrichment of molecular pathways, impact of structural variants in the evolution of pathways, and next generation sequencing data.
7. Statistical genetics: In quantitative genetics, one of the goals is to find genomic positions that are associated and linked to complex traits or outcomes. Complex outcomes underlying a disease are rarely uncorrelated, yet they were typically analyzed independently. For my dissertation, I developed a Bayesian model selection procedure to select genomic locations that are jointly associated with multiple correlated traits. I have also jointly worked on developing hierarchical models for various types of traits e.g. ordinal, binary, categorical etc.
Course Director - Big Data in Medicine - (MS program in Biostatistics and Data Science) 2020-Present
Course Director - Data Mining and Statistical Learning - (MS program in Biostatistics and Data Science) 2018-Present
Course Director – Introduction to Biostatistics (4 credit hour course for MS in Health Informatics, MS in Health Policy & Economics, Certificate in Health Analytics) 2014 – 2016
Course Director for Statistical Methods for Observational Studies (Masters in Clinical Investigation) 2014 – 2019
Mentored Clinical Research Training Program, Clinical and Translational Science Center, Weill Cornell Medical College – Small group leader (mentoring junior researchers on their study designs, protocols, grant applications). 2013 - Present
Summer Research Institute in Geriatric Mental Health (NIMH sponsored) – Grouped mentorship on grant proposals of junior researchers. 2017
Instructor for Statistical Methods for Observational Studies (Masters in Clinical Investigation) (Survival Analysis) 2013
Instructor for Medicine, Patients, and Society I Epidemiology and Biostatistics Component 2012
Instructor for Introduction to Biostatistics for residents of Radiotherapy 2009
Teaching Assistant for Quantitative Methods in Epidemiology, (Graduate level Epidemiology course) (Summer 2007), University of Alabama at Birmingham 2007
Tutoring service for Biostatistics for Public Health, (Graduate level course for Public Health personnel) (Fall 2003) at University of Alabama at Birmingham 2003