Research Highlights

Explorations: Some New Problems

Data-driven Discovery of Natural Laws: Designing special sparse deep learning networks to identify equations that represent the data 

Simplicity-Complexity Gap in Epidemic Models: Given a certain amount of noise how much do simple models differ from complex models in homogenous mixing models and network models

Optimizing Policy in Epidemic-Human Behavior Co-evolution: Graph Algorithms to identify best policies under simple models of epidemics and behavior flow.  

Shape-based Representation , Evaluation, Clustering, Classification, and Ensembles

While there is great value in predicting these numerical targets to assess the burden of the disease, we argue that there is also value in communicating the future trend (description of the shape) of the epidemic. Instead of treating this as a classification problem (one out of n shapes), we define a transformation of the numerical forecasts into a shapelet-space representation. In this representation, each dimension corresponds to the similarity of the shape with one of the shapes of interest (a shapelet). We prove that this representation satisfies the property that two shapes that one would consider similar are mapped close to each other, and vice versa. With this representation, we define an evaluation measure and a measure of agreement among multiple models. We also define the shapelet-space ensemble of multiple models which is the mean of the shapelet-space representation of all the models. We show that this ensemble can accurately predict the shape of the future trend for COVID-19 cases and trends. We also show that the agreement between models can provide a good indicator of the reliability of the forecast.

To address the evaluation for long-term forecasts, we extend this idea. (i) First, we use a moving window to transform each window into our Shapelet Space Representation (SSR), where each dimension represents the similarity of the shape to one of the "shapelets" – shapes of interest (e.g., inc, peak, surge, flat). This results in a matrix representation of the time-series where each column is the SSR of a window. (ii) Now, given the matrix representations of two trajectories, we use dynamic time warping to allow flexibility in the alignment of the time-series and compare shapes in the form of columns of the matrix representations. As a result, similar local trends are aligned first before comparison. We have already shown that measuring distance with DTW+S results in better clustering and classification 


COVID-19 Modeling, Forecasting, and Projections

We proposed the SIkJalpha model at the beginning of the COVID-19 pandemic. Over the years, as the pandemic evolved, more complexities were added to capture crucial factors and variables that can assist with projecting desired future scenarios. 

Throughout the pandemic, multi-model collaborative efforts have been organized to predict short-term outcomes (cases, deaths, and hospitalizations) of COVID-19 and long-term scenario projections. We have been participating in five such efforts: US Scenario Modeling Hub, US Forecast Hub, Europe Scenario Modeling Hub, Europe Forecast Hub, Germany/Poland Forecast Hub.

[Paper on an early version] [Paper on evolution of the model][Scenario Modeling at CDC MMWR][PNAS on US Forecast Hub]

Influenza Modeling and Forecasting

The lack of Influenza case tracking makes it difficult to use traditional epidemiological models for influenza hospitalization forecasting. However, hospitalizations data from multiple past seasons provides an opportunity for Machine Learning.  

We hypothesize that we can improve forecasting by using multiple mechanistic models to produce potential trajectories and use machine learning to learn how to combine those trajectories into an improved forecast.  We propose a Tree Ensemble model design that utilizes the individual predictors of our baseline model SIkJalpha to improve its performance. Each predictor is generated by changing a set of hyper-parameters. We compare our prospective forecasts deployed for the FluSight challenge (2022) to all the other submitted approaches. Our approach is fully automated and does not require any manual tuning. We demonstrate that our Random Forest-based approach is able to improve upon the forecasts of the individual predictors in terms of mean absolute error, coverage, and weighted interval score. Our method outperforms all other models in terms of the mean absolute error and the weighted interval score based on the mean across all weekly submissions in the current season (2022). Explainability of the Random Forest (through analysis of the trees) enables us to gain insights into how it improves upon the individual predictors.


Graph Neural Networks

Training and inference on deep GNNs on large graphs are difficult due to computational complexity and lack of accuracy improvements with deeper layers. Subgraph-based methods to address training on large graphs exist, but they do not apply during inference, making inference the bottleneck. Such methods also do not address poor accuracy for deep networks due to "oversmoothing". We address the following challenges: (i) Developing subgraph-based schemes that apply to training and inference. (ii) Identifying good subgraph-sampling strategies. (iii) Pruning weights to reduce computations during inference.

NeurIPS 2021, VLDB 2021, ICLR 2020, IPDPS 2019 

Past Research

Prior to my faculty position, I worked on a range of problems spanning from theoretical to experimental to real-world deployments, that involved a mix of Algorithms, Network Science, and Data Mining. The figure summarizes my past research. Please see my Publications page or contact me to learn more about my contributions to these problems.