Colloquia

When: Thursday, September 7, 2023—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Dr. John Hughes, Department of Community and Population Health, Lehigh University

Abstract: Areal, i.e., spatially aggregated, data arise in many fields, including ecology, disease mapping, bioimaging, and childhood vaccination refusal. Failure to account for spatial dependence in such data can compromise the quality of regression inference. Traditional areal models can accommodate spatial dependence, but these models exhibit well-known pathologies and face serious computational challenges. In this talk I will present Bayesian spatial filtering (BSF) models, which address the challenges faced by traditional areal models. BSF is advantageous from both modeling and computational points of view, offering improved interpretability and scalable computation. I will illustrate how posterior inference for BSF models can be done easily and efficiently by using R and Stan to analyze a large dataset from ecology.

When: Thursday, September 14, 2023—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Drs. Karl Gregory, Lianming Wang, Ray Bai, Dewei Wang, Joshua Tebbs, Xiaoyan Lin, Edsel Pena, Ting Fung Ma, David Hitchcock, Xianzheng Huang, Brian Habing, Yen-Yi Ho

When: Thursday, September 21, 2023—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Dr. Sarah Lotspeich, Department of Statistical Sciences, Wake Forest University

Abstract: Healthy foods are essential for a healthy life, but accessing healthy food can be more challenging for some people than others. With this disparity in food access comes disparities in well-being leading to disproportionate rates of diseases in communities that face more challenges in accessing healthy food (i.e., low-access communities). Identifying low-access, high-risk communities for targeted interventions is a public health priority, but current methods to quantify food access rely on distance measures that are either computationally simple (the length of the shortest straight-line route) or accurate (the length of the shortest map-based route), but not both. We propose a hybrid statistical approach to combine these distance measures, allowing researchers to harness the computational ease of one with the accuracy of the other. The approach incorporates straight-line distances for all neighborhoods and map-based distances for just a subset, offering comparable estimates to the “gold standard” model using map-based distances for all neighborhoods and improved efficiency over the “complete case” model using map-based distances for just the subset. Through the adoption of a multiple imputation for measurement error framework, information from the straight-line distances can be used to fill in map-based measures for the remaining neighborhoods. Using data for Forsyth County, North Carolina, and its surrounding counties, we quantify and compare the associations between various health outcomes (coronary heart disease, diabetes, high blood pressure, and obesity)and different measures of neighborhood-level food access.

When: Thursday, September 28, 2023—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Dr. Edsel Peña, Department of Statistics, University of South Carolina

Abstract: With very high probability, you must have heard of Karl Pearson’s correlation coefficient (PCC), which measures the degree of linear association between two variables, or between paired values of two quantitative features. But, have you ever heard of Lin’s Concordance Correlation Coefficient (CCC), which measures the degree of agreement between two variables, or between paired values of two quantitative features? If you have not heard of the CCC, come to this talk; while, if you have heard about it, you should still come to this talk! One of the most important problems in statistics, machine learning, and in many scientific fields, is that of prediction: how to use the values of a set of features to predict the value of a feature of particular interest. We will demonstrate in this talk predictors that maximize the agreement, measured via the CCC, between the predictor and the predictand, and contrast these maximum agreement predictors (MAP) with the well-known least-squares predictor (LSP), which minimizes the mean-squared prediction error, or which maximizes the PCC between the predictor and the predictand. We will provide finite and asymptotic properties of these predictors, and then demonstrate these predictors using two real data sets: an eye data set, and a body fat data set.

This work is joint with Taeho Kim, George Luta, Matteo Bottai, Ben Lieberman, Pierre Chausse, and Gheorghe Doros. The talk is based on a paper in The American Statistician (2022, vol 76, 313-321) and a preprint in the Math ArXiV: arxiv 2304.04221.

When: Thursday, October 5, 2023 – 2:50 p.m.-3:50 p.m.
Where: LeConte 224

Speakers: Dr. Hongzhe Li, Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania

Abstract: This talk considers estimation and prediction of high-dimensional linear regression model for transfer learning, using samples from the target model as well as auxiliary samples from different but possibly related models. When the set of ``informative" auxiliary samples is known, an estimator and a predictor are proposed and their optimality is established. The optimal rates of convergence for prediction and estimation are faster than the corresponding rates without using the auxiliary samples. This implies that knowledge from the informative auxiliary samples can be transferred to improve the learning performance of the target problem. When sample informativeness is unknown, a data-driven procedure for transfer learning, called Trans-Lasso is proposed, and its robustness to non-informative auxiliary samples and its efficiency in knowledge transfer is established. A related method, Trans-CLIME is developed for estimation and inference of high-dimensional Gaussian graphical models with transfer learning. Several applications in genomics will be presented, including prediction of gene expressions using the GTEx data and polygenetic risk score prediction using GWAS data. It is shown that Trans-Lasso and Trans-CLIME lead to improved performance in gene expression prediction in a target tissue by incorporating the data from multiple different tissues as auxiliary samples.

When: Thursday, October 12, 2023 – 2:50 p.m.-3:50 p.m.
Where: LeConte 224

Speakers: Dr. Sameer Deshpande, Department of Statistics, University of Wisconsin-Madison

Abstract: Default implementations of Bayesian Additive Regression Trees (BART) represent categorical predictors using several binary indicators, one for each level of each categorical predictor. Regression trees built with these indicators partition the levels using a ``remove one a time strategy.'' Unfortunately, an overwhelming majority of partitions of the levels cannot be built with this strategy, severely limiting BART's ability to ``borrow strength'' across groups of levels. We overcome this limitation with a new class of regression trees built around decision rules based on linear combinations of these indicators. Motivated by spatial applications with areal data, we introduce a further decision rule prior that partitions the areas into spatially contiguous regions by deleting edges from random spanning trees of a suitably defined network. We implemented our new regression tree priors in the flexBART package, which, compared to existing implementations, often yields improved out-of-sample predictive performance without much additional computational burden. We will conclude by describing how the flexBART implementation can be further extended to fit BART models over much more general input spaces. A preprint describing this work is available at https://arxiv.org/abs/2211.04459.

When: Tuesday, October 17, 2023 – 2:50 p.m.-3:50 p.m.
Where: LeConte 224

Speakers: Dr. Oh-Ran Kwon, Department of Data Sciences and Operations, University of Southern California

Abstract: The response envelope model provides substantial efficiency gains over the standard multivariate linear regression by identifying the material part of the response to the model and by excluding the immaterial part. In this talk, we propose the enhanced response envelope by incorporating a novel envelope regularization term based on a nonconvex manifold formulation. It is shown that the enhanced response envelope can yield better prediction risk than the original envelope estimator. The enhanced response envelope naturally handles high-dimensional data for which the original response envelope is not serviceable without necessary remedies. In an asymptotic high-dimensional regime where the ratio of the number of predictors over the number of samples converges to a non-zero constant, we characterize the risk function and reveal an interesting double descent phenomenon for the envelope model. A simulation study confirms our main theoretical findings. Simulations and real data applications demonstrate that the enhanced response envelope does have significantly improved prediction performance over the original envelope method, especially when the number of predictors is close to or moderately larger than the number of samples.

When: Thursday, October 26, 2023 – 2:50 p.m.-3:50 p.m.
Where: LeConte 224

Speakers: Dr. Joshua Loyal, Department of Statistics, Florida State University

Abstract: Latent space models (LSMs) are frequently used to model network data by embedding a network’s nodes into a low-dimensional latent space; however, choosing the dimension of this space remains a challenge. To this end, I will begin by formalizing a class of LSMs we call generalized linear network eigenmodels (GLNEMs) that can model various edge types (binary, ordinal, non-negative continuous) found in scientific applications. This model class subsumes the traditional eigenmodel by embedding it in a generalized linear model with an exponential dispersion family random component and fixes identifiability issues that hindered interpretability. Next, we propose a Bayesian approach to dimension selection for GLNEMs based on an ordered spike-and-slab prior that provides improved dimension estimation and satisfies several appealing theoretical properties. In particular, we show that the model’s posterior concentrates on low-dimensional models near the truth. I will demonstrate our approach’s consistent dimension selection on simulated networks with various edge types. Lastly, I will use GLNEMs to study the effect of covariates on the formation of networks from biology, ecology, and economics and the existence of residual latent structure.

When: Thursday, November 2, 2023 – 2:50 p.m.-3:50 p.m.
Where: Zoom (e-mail Ray Bai for the link)

Speakers: Dr. Yao Li, Department of Statistics and Operations Research, University of North Carolina-Chapel Hill

Abstract: Federated learning is a framework for training machine learning models from multiple local data sets without access to the data in its aggregate. Instead, a shared model is jointly learned through an interactive process between the server and clients that combines locally known model gradients or weights. However, the lack of data transparency naturally raises concerns about model security. Recently, several state-of-the-art backdoor attacks have been proposed, which achieve high attack success rates while simultaneously being difficult to detect, leading to compromised federated learning models. In this project, motivated by differences in the outputs of models trained with and without the presence of backdoor attacks, we propose a defense method that can prevent backdoor attacks from influencing the model while maintaining the accuracy of the original classification task. TAG leverages a small validation data set to estimate the most considerable change a benign user's local training can make to the shared model, which can be used as a cutoff for returning user models. Experimental results on multiple data sets show that TAG defends against backdoor attacks even when 40 percent of the user submissions to update the shared model are malicious.

When: Thursday, November 9, 2023 – 2:50 p.m.-3:50 p.m.
Where: LeConte 224

Speakers: Dr. Bernhard Klingenberg, Division of Natural Sciences, New College of Florida

Abstract: By writing the variance of the Mantel-Haenszel estimator under the null of homogeneity we arrive at an improved confidence interval for the common risk difference in stratified 2x2 tables. This interval outperforms a variety of other intervals currently recommended in the literature and implemented in software. We also discuss a score-type confidence interval that allows to incorporate strata/study weights. Both of these intervals work very well under many scenarios common in stratified trials or in a meta analysis, including situations with a mixture of both small and large strata sample sizes, unbalanced treatment allocation or rare events. The new interval has the advantage that it is available in closed form with a simple formula.

When: Thursday, November 16, 2023 – 2:50 p.m.-3:50 p.m.
Where: LeConte 224

Speakers: Dr. Chuan-Fa Tang, Department of Mathematical Sciences, University of Texas at Dallas

Abstract: In this talk, we propose nonparametric methods comparing distributions under uniform stochastic ordering (USO) based on the ordinal dominance curve. We start with equality tests for multiple distributions ordered in USO. Then, we check if multiple distributions are collected in USO. Lastly, we distinguish distributions significantly different from the others. Asymptotic properties and supportive numeric evidence for the proposed methods will be provided. We illustrate our inferential methods to a biomarker, microfibrillar-associated protein 4, and assess its potential of being a diagnosis of fibrosis stages for hepatitis C patients.

When: Thursday, November 30, 2023 – 2:50 p.m.-3:50 p.m.
Where: Zoom (e-mail Ray Bai for the link)

Speakers:

- Chong Ma (PhD '18), Senior Manager, Biomarker Biostatistics, Moderna

- Bereket Kindo (PhD '16), Lead Data Scientist, Humana

- Nick Tice (MS '22), Football Analytics Assistant, Arizona Cardinals

- Yawei Liang (PhD '19), Machine Learning Engineer, Meta

When: Thursday, February 22, 2024 – 2:50 p.m.-3:50 p.m.
Where: LeConte 224

Speakers: Dr. Jessie Jeng, Department of Statistics, North Carolina State University

Abstract: Polygenic risk score (PRS) estimates an individual's genetic predisposition for a trait by aggregating variant effects across the genome. However, mismatches in ancestral background between base and target data sets are common and can compromise the accuracy of PRS analysis. In response, we propose a transfer learning framework that capitalizes on state-of-the-art high-dimensional statistical inference methods. This framework consists of two key steps: (1) false negative control (FNC) marginal screening to extract useful knowledge from base data, and (2) joint model training to integrate knowledge with target data for accurate prediction. Our FNC screening method efficiently retains a high proportion of signal variants in base data with statistical guarantees under arbitrary covariance dependence between variants. This new transfer learning framework provides a novel solution for PRS analysis with mismatched ancestral backgrounds, improving prediction accuracy and facilitating efficient joint-model training.

When: Thursday, March 14, 2024 – 2:50 p.m.-3:50 p.m.
Where: Zoom (email Dr. Ray Bai for the link)

Speakers: Dr. Veronika Rockova, University of Chicago Booth School of Business

Abstract: Bayesian predictive inference provides a coherent description of entire predictive uncertainty through predictive distributions. We examine several widely used sparsity priors from the predictive (as opposed to estimation) inference viewpoint. Our context is estimating a predictive distribution of a high-dimensional Gaussian observation with a known variance but an unknown sparse mean under the Kullback-Leibler loss. First, we show that LASSO (Laplace) priors are incapable of achieving rate-optimal performance. This new result contributes to the literature on negative findings about Bayesian LASSO posteriors. However, deploying the Laplace prior inside the Spike-and-Slab framework (for example with the Spike-and-Slab LASSO prior), rate-minimax performance can be attained with properly tuned parameters (depending on the sparsity level sn). We highlight the discrepancy between prior calibration for the purpose of prediction and estimation. Going further, we investigate popular hierarchical priors which are known to attain adaptive rate-minimax performance for estimation. Whether or not they are rate-minimax also for predictive inference has, until now, been unclear. We answer affirmatively by showing that hierarchical Spike-and-Slab priors are adaptive and attain the minimax rate without the knowledge of sn. This is the first rate-adaptive result in the literature on predictive density estimation in sparse setups. This finding celebrates benefits of a fully Bayesian inference.

When: Thursday, March 21, 2024 – 2:50 p.m.-3:50 p.m.
Where: LeConte 224

Speakers: Dr. Jeff Miller, Department of Biostatistics, Harvard University

Abstract: Under model misspecification, it is known that Bayesian posteriors often do not properly quantify uncertainty about true or pseudo-true parameters. Even more fundamentally, misspecification leads to a lack of reproducibility in the sense that the same model will yield contradictory posteriors on independent data sets from the true distribution. To improve reproducibility, an easy-to-use and widely applicable approach is to apply bagging to the Bayesian posterior ("BayesBag"); that is, to use the average of posterior distributions conditioned on bootstrapped datasets. To define a criterion for reproducible uncertainty quantification under misspecification, we consider the probability that two confidence sets constructed from independent data sets have nonempty overlap, and we establish a lower bound on this overlap probability that holds for any valid confidence sets. We prove that credible sets from the standard posterior can strongly violate this bound, indicating that it is not internally coherent under misspecification, whereas the bagged posterior typically satisfies the bound. We demonstrate on simulated and real data.

When: Thursday, March 28, 2024 – 2:50 p.m.-3:50 p.m.
Where: LeConte 224

Speakers: Dr. Alexander McLain, Department of Epidemilology and Biostatistics, University of South Carolina

Abstract: Bayesian variable selection methods are powerful techniques for fitting and inferring on sparse high-dimensional linear regression models. However, many are computationally intensive or require restrictive prior distributions on model parameters. In this talk, I will discuss various recent methods that utilize a computationally efficient and powerful Bayesian approach for sparse high- dimensional linear regression problems. Their overarching computational strategies can be classified as a quasi-Parameter- Expanded Expectation-Conditional-Maximization (PX-ECM) algorithm that results in maximum a posteriori (MAP) estimation of parameters. The PX-ECM results in a robust computationally efficient coordinate-wise optimization which – when updating the coefficient for a particular predictor – adjusts for the impact of other predictor variables. We use minimal prior assumptions on the parameters through plug-in empirical Bayes estimates of hyperparameters – an approach motivated by the popular two-group approach to multiple testing. The result is a PaRtitiOned empirical Bayes Ecm (PROBE) algorithm, which we have applied to sparse high-dimensional linear regression and extended to heterogeneous error variances and linear mixed models. Simulation studies and real data analyses will demonstrate the properties and applications of the methods.

When: Thursday, April 18, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Dr. Regina Liu, Department of Statistics, Rutgers University

Abstract: We present a general framework for prediction in which a prediction is in the form of a distribution function, called ‘predictive distribution function’. This predictive distribution function is well suited for prescribing the notion of confidence under the frequentist interpretation and providing meaningful answers for prediction-related questions. Its very form of a distribution function also makes it a particularly useful tool for quantifying uncertainty in prediction. A general approach under this framework is formulated and illustrated using the so-called confidence distributions (CDs). This CD-based prediction approach inherits many desirable properties of CD, including its capacity to serve as a common platform for directly connecting the existing procedures of predictive inference in Bayesian, fiducial and frequentist paradigms. We discuss the theory underlying the CD-based predictive distribution and related efficiency and optimality. We also propose a simple yet broadly applicable Monte-Carlo algorithm for implementing the proposed approach. This concrete algorithm together with the proposed definition and associated theoretical development provide a comprehensive statistical inference framework for prediction. Finally, the approach is illustrated by simulation studies and a real project on predicting the application submissions to a government agency. The latter shows the applicability of the proposed approach to even dependent data settings.

This is joint work with Jieli Shen, Goldman Sachs, and Minge Xie, Rutgers University.

When: Friday, April 19, 2024—3:15 p.m. to 4:15 p.m.
Where: Bates West Social Room, USC (1423 Whaley St, Columbia, SC 29201)

Speakers: Dr. Regina Liu, Department of Statistics, Rutgers University

Abstract: Advanced data acquisition technology nowadays has often made inferences from diverse data sources easily accessible. Fusion learning refers to combining inferences from multiple sources or studies to make more effective overall inference. We focus on the tasks: 1) Whether/When to combine inferences? 2) How to combine inferences efficiently? 3) How to combine inference to enhance the inference of a target study?

We present a general framework for nonparametric and efficient fusion learning for inference on multi-parameters, which may be correlated. The main tool underlying this framework is the new notion of depth confidence distribution (depth-CD), which is developed by combining data depth, bootstrap and confidence distributions. We show that a depth-CD is an omnibus form of confidence regions, whose contours of level sets shrink toward the true parameter value, and thus an all-encompassing inferential tool. The approach is shown to be efficient, general and robust. It readily applies to studies from a broad range of complex settings, including heterogeneous and irregular settings. This property also enables the approach to utilize indirect evidence from incomplete studies to gain efficiency for the overall inference. The approach is demonstrated with simulation studies and real applications in tracking aircraft landing performance in aviation safety analysis and in financial forecasting.

Department of Statistics

2023 – 2024 Department of Statistics Colloquium Speakers

Past colloquium talks are archived here.

Challenge the conventional. Create the exceptional. No Limits.