Skip to Content

Department of Statistics


The colloquia listed here are presented by visiting academic researchers, members of the business community, as well by USC faculty and graduate students. The research topics introduced by the speakers delve into all areas of statistics.

Faculty, students, and off-campus visitors are invited to attend any of our colloquia and Palmetto Lecture Series.

2021 – 2022 Department of Statistics Colloquium Speakers 

When: Thursday, September 23, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Yen-Yi Ho for access)

Speaker: Dr. David Matteson, Department of Statistics and Data Science and Department of Social Statistics, Cornell University

Abstract: We propose a novel class of dynamic shrinkage processes for Bayesian time series and regression analysis. Building on a global–local framework of prior construction, in which continuous scale mixtures of Gaussian distributions are employed for both desirable shrinkage properties and computational tractability, we model dependence between the local scale parameters. The resulting processes inherit the desirable shrinkage behavior of popular global–local priors, such as the horseshoe prior, but provide additional localized adaptivity, which is important for modeling time series data or regression functions with local features. We construct a computationally efficient Gibbs sampling algorithm based on a Pólya–gamma scale mixture representation of the process proposed. Using dynamic shrinkage processes, we develop a Bayesian trend filtering model that produces more accurate estimates and tighter posterior credible intervals than do competing methods, and we apply the model for irregular curve fitting of minute‐by‐minute Twitter central processor unit usage data. In addition, we develop an adaptive time varying parameter regression model to assess the efficacy of the Fama–French five‐factor asset pricing model with momentum added as a sixth factor. Our dynamic analysis of manufacturing and healthcare industry data shows that, with the exception of the market risk, no other risk factors are significant except for brief periods. If time permits, we will also highlight extensions to change point analysis and adaptive outlier detection. 

When: Thursday, September 30, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Yen-Yi Ho for access)

Speaker: Dr. Sayar Karmakar, Department of Statistics, University of Florida

Abstract: Since its inception, ARCH (1982, Engle) model and its generalization GARCH (1986, Bollerslev) model have remained the preferred models to analyze financial datasets. To econometricians the log-returns are more important than the actual value of the stock and thus with the mean-zero data in hand, all interesting dynamics of the datasets get transferred to the variance or volatility of the datasets conditional on the past observations. In this talk, we first focus on formally introducing these conditional heteroscedastic (CH) models. Then we steer the talk towards three of my recent works on these models. In the first one, we briefly touch upon simultaneous inference for time-varying ARCH-GARCH model. Addressing a concern of the need for large sample size, we next introduce a Bayesian estimation for the same models that has valid posterior contraction rate and an easily amenable computation. In the last part of this talk, we discuss a new model-free technique for time-aggregated forecasts for CH-datasets. Time-permitting, we discuss some ongoing works for the GARCHX model, where the inclusion of covariates can yield some forecasting gains.

When: Thursday, October 14, 2021—2:50 p.m. to 3:50 p.m.
Where: Coliseum Room 3001

Speaker: Dr. Brian Habing, Department of Statistics, University of South Carolina

Abstract: The National Institute of Statistical Sciences (NISS) runs a wide variety of seminars, meetings, and workshops.  The office in Washington DC also provides services to a variety of government agencies.  One of these is the running of expert panels for the National Center for Education Statistics in the US Department of Education.  This past academic year the three panels were "Post COVID Surveys", "Innovative Graphics for NCES Online Reports", and "Setting Priorities for Federal Data Access to Expand The Context for Education Data".  This talk will go over some of the topics discussed in these panels - both the statistical and non-statistical - with some thoughts on statistics education at the graduate level.

When: Thursday, October 21, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Yen-Yi Ho for access)

Speaker: Dr. Sandra Safo, Division of Biostatistics, University of Minnesota 

Abstract: Classification methods that leverage the strengths of multiple molecular data (e.g., genomics, proteomics) and a clinical outcome (e.g., disease status) simultaneously have more potential to reveal molecular relationships and functions associated with a clinical outcome. Most existing methods for joint association and classification (i.e., one-step methods) have focused on linear relationships, but the relationships among multiple molecular data, and between molecular data and an outcome are too complicated to be solely understood by linear methods. Further, existing nonlinear one-step methods do not allow for feature selection, a key factor for having explainable models. We propose Deep IDA to address limitations in existing works. Deep IDA learns nonlinear projections of two or more molecular data (or views) that maximally associate the views and separate the classes in each view, and permits feature selection for interpretable models. Our framework for feature ranking is general and adaptable to many deep learning methods and has potential to yield explainable deep learning models for associating multiple views. We applied Deep IDA to RNA sequencing, proteomics, and metabolomics data with the goal of identifying biomolecules that separate patients with and without COVID-19 who were admitted or not admitted to the intensive care unit. Our results demonstrate that Deep IDA works well, compares favorably to other linear and nonlinear multi-view methods, and is able to identify important molecules that facilitate an understanding of the molecular architecture of COVID-19 disease severity.

When: Thursday, October 28, 2021—2:50 p.m. to 3:50 p.m.
Where: Coliseum Room 3001

Speaker: Dr. Ray Bai, Department of Statistics, University of South Carolina

Abstract: Many studies have reported associations between later-life cognition and socioeconomic position throughout the life course. The majority of these studies, however, are unable to quantify how these associations vary over time or with respect to several demographic factors. Varying coefficient (VC) models, which treat the coefficients in a linear model as nonparametric functions of additional effect modifiers, offer an appealing way to overcome these limitations. Unfortunately, state-of-the-art VC modeling methods require computationally prohibitive hand-tuning or make restrictive assumptions about the functional form of the coefficients. In response, we propose VCBART, which estimates each varying coefficient using Bayesian Additive Regression Trees. On both simulated and real data, with simple default hyperparameters, VCBART outperforms existing methods in terms of coefficient estimation and prediction. Theoretically, we show that the VCBART posterior concentrates around the true varying coefficients at the near-minimax rate.

Using VCBART, we predict the cognitive trajectories of 3,739 subjects from the Health and Retirement Study using multiple measures of socioeconomic position throughout the life course. We find that the association between later-life cognition and educational attainment does not vary substantially with age, but there is evidence of a persistent benefit of education after adjusting for later-life health. In contrast, the association between later-life cognition and health conditions like diabetes is negative and time-varying.

When: Thursday, November 4, 2021—2:50 p.m. to 3:50 p.m.
Where: Coliseum Room 3001

Speaker: Dr. Candace Berrett, Department of Statistics, Brigham Young University 

Abstract: Changes to the environment surrounding a temperature measuring station can cause local changes to the recorded temperature that deviate from regional temperature trends. This phenomenon -- often caused by construction or urbanization -- occurs at a local level. If these local changes are assumed to represent regional or global processes it can have significant impacts on historical data analyses. These changes or deviations are generally gradual, but can be abrupt, and arise as construction or other environment changes occur near a recording station. We propose a methodology to examine if changes in temperature trends at a point in time exist at a local level at various locations in a region. Specifically, we propose a Bayesian change point model for spatio-temporally dependent data where we select the number of change points at each location using a "œforwards" selection process using deviance information criterion (DIC). We then fit the selected model and examine the linear slopes across time to quantify the local changes in long-term temperature behavior. We show the utility of this model and method using a synthetic data set and observed temperature measurements from eight stations in Utah consisting of daily temperature data for 60 years.

When: Thursday, November 11, 2021—2:50 p.m. to 3:50 p.m.
Where: Coliseum Room 3001

Speaker: Dr. Jesse Frey, Department of Mathematics and Statistics, Villanova University 

Abstract: There are several existing methods for finding a confidence interval for a proportion using balanced ranked-set sampling (RSS).  We improve on these methods, working in two directions.  First, we develop improved exact confidence intervals for cases where perfect rankings can be assumed.  We do this by using the method of Blaker (2000) rather than the method of Clopper and Pearson (1934) and by working with the MLE for the unknown proportion rather than with the total number of successes.  The resulting intervals outperform the existing intervals and also come close to a new theoretical lower bound on the expected length for RSS-based exact confidence intervals.  Second, we develop approximate confidence intervals that are robust to imperfect rankings.  One of the methods we develop is a mid-P-adjusted interval based on the Clopper and Pearson (1934) method, and the other is based on the Wilson (1927) method.  We compare the intervals in terms of coverage and length, finding that both have their advantages.  Both of these robust methods rely on a new maximum-likelihood-based method for estimating in-stratum proportions when the overall proportion is known.  We also report on investigations into using adjusted Wald intervals in the imperfect rankings case. 

When: Thursday, March 24, 2022—2:50 p.m. to 3:50 p.m.
Where: Coliseum Room 3001

Speaker: Dr. Ching-Ti Liu, Department of Biostatistics, Boston University

Abstract: TBD

When: TBD
Where: TBD

Speaker: Dr. Karen Messer, Professor and Division Chief in the Department of Family Medicine and Public Health, Division of Biostatistics & Bioinformatics, UC San Diego

Abstract: TBD



2020 – 2021 Department of Statistics Colloquium Speakers 

When: Thursday, September 17, 2020—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Jonathan Schildcrout, Vanderbilt University

Abstract: Response-selective sampling (RSS) designs are efficient when response and covariate data are available, but exposure ascertainment costs limit sample size. In longitudinal studies with a continuous outcome, RSS designs can lead to efficiency gains by using low-dimensional summaries of individual longitudinal trajectories to identify the subjects in whom the expensive exposure variable will be collected. Analyses can be conducted using the outcome, confounder and target exposure data from sampled subjects, or they can combine complete data from the sampled subjects with the partial data (i.e., outcome and confounder only) from the unsampled subjects. In this talk, we will discuss designs for longitudinal and multivariate longitudinal data, and an imputation approach for ODS designs that can be more efficient than the complete case analysis, may be easier to implement than full-likelihood approaches, and is more broadly applicable than other imputation approaches. We will examine finite sampling operating characteristics of the imputation approach under several ODS designs and longitudinal data features.

When: Thursday, October 22, 2020—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Bereket Kindo, Humana Inc.

Abstract: We first, using examples, explore the types of business problems statistical concepts help solve in the insurance and healthcare industry. The purpose will be to showcase the usefulness of different statistical concepts such as design of experiments, development of statistical models for business insight generation, and development of statistical models for prediction in solving business problems.

Secondly, we discuss key topics if built into graduate program curriculums would better prepare students for a career in industry. Examples include teaching ethical and societal implications of statistical models implemented in the “real world”, and statistical communication and computing skills needed to collaboratively work with non-statistician colleagues. The purpose is to spur conversations of whether and how (more of) these topics should be embedded in statistics program curriculums.

We will then leave time for questions either related to the topics/examples discussed or for general questions of how statistical concepts are applied in industry.

When: Thursday, November 05, 2020—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Xuan Cao, University of Cincinnati

Abstract: We first, using examples, explore the types of business problems statistical concepts help solve in the insurance and healthcare industry. The purpose will be to showcase the usefulness of different statistical concepts such as design of experiments, development of statistical models for business insight generation, and development of statistical models for prediction in solving business problems.

Secondly, we discuss key topics if built into graduate program curriculums would better prepare students for a career in industry. Examples include teaching ethical and societal implications of statistical models implemented in the “real world”, and statistical communication and computing skills needed to collaboratively work with non-statistician colleagues. The purpose is to spur conversations of whether and how (more of) these topics should be embedded in statistics program curriculums.

We will then leave time for questions either related to the topics/examples discussed or for general questions of how statistical concepts are applied in industry.

When: Thursday, November 19, 2020—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Robert Lyles, Emory University

Abstract: When feasible, the application of serial principled sampling designs for testing provides an arguably ideal approach to monitoring prevalence and case counts for infectious diseases. Given the logistics involved and required resources, however, surveillance efforts can benefit from any available means to improve the precision of sampling-based estimates. In this regard, one option is to augment the analysis with data from other available surveillance streams that may be highly non-representative but that identify cases from the population of interest over the same timeframe. We consider the monitoring of a closed population (e.g., a long-term care facility, inpatient hospital setting, university, etc.), and encourage the use of capture-recapture (CRC) methodology to produce an alternative case total and/or prevalence estimate to the one obtained by principled sampling. With care in its implementation, the random or representative sample not only provides its own valid estimate, but justifies another based on classical CRC methods. We initially consider weighted averages of the two estimators as a way to enhance precision, and then propose a single CRC estimator that offers a unified and arguably preferable solution. We illustrate using artificial data from a hypothetical daily COVID-19 monitoring program applied in a closed community, along with simulation studies to explore the benefits of the proposed estimators and the properties of confidence and credible intervals for the case total. A setting in which some of the tests applied to the community are imperfect with known sensitivity and specificity is briefly considered.

When: Thursday, March 11, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Dr. Joseph Antonelli, University of Florida 

Abstract: Communities often self select into implementing a regulatory policy, and adopt the policy at different time points. In New York City, neighborhood policing was adopted at the police precinct level over the years 2015-2018, and it is of interest to both (1) evaluate the impact of the policy, and (2) understand what types of communities are most impacted by the policy, raising questions of heterogeneous treatment effects. We develop novel statistical approaches that are robust to unmeasured confounding bias to study the causal effect of policies implemented at the community level. Using techniques from high-dimensional Bayesian time-series modeling, we estimate treatment effects by predicting counterfactual values of what would have happened in the absence of neighborhood policing. We couple the posterior predictive distribution of the treatment effect with flexible modeling to identify how the impact of the policy varies across time and community characteristics. Using pre-treatment data from New York City, we show our approach produces unbiased estimates of treatment effects with valid measures of uncertainty. Lastly, we find that neighborhood policing decreases discretionary arrests, but has little effect on crime or racial disparities in arrest rates.

When: Thursday, March 25, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Dr. Dan Nordman, Iowa State University

Abstract: The talk overviews a prediction problem encountered in reliability engineering, where a need arises to predict the number of future events (e.g., failures) among a cohort of units associated with a time-to-event process.  Examples include the prediction of warranty returns or the prediction of the number of product failures that could cause serious harm.  Important decisions, such as a product recall, are often based on such predictions.  The data consist of a collection of units observed over time where, at some freeze point in time, inspection then ends, leaving some units with realized event times while other units have not yet experienced events.  In other words, data are right-censored, where (crudely expressed) some units have "died" by the freeze time while other units continue to "survive."  The problem becomes predicting how many of the "surviving" units will have events (i.e., "death") in a next future interval of time, as quantified in terms of a prediction bound or interval.  Because all units belong to the same data set, either by providing direct information (i.e., observed event times) or by becoming the subject of prediction (i.e., censored event times), such predictions are called within-sample predictions and differ from other prediction problems considered in most literature.  A standard plug-in (also known as estimative) prediction approach turns out to be invalid for this problem (i.e., for even large amounts of data, the method fails to have correct coverage probability).  However, several bootstrap-based methods can provide valid prediction intervals. To this end, a commonly used prediction calibration method is shown to be asymptotically correct for within-sample predictions, and two alternative predictive-distribution-based methods are presented that perform better than the calibration method.  

When: Thursday, April 1, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

When: Part I: Thursday, April 8, 2021—2:50 p.m. to 3:50 p.m.; Part II: Friday, April 9, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Dr. Michael Kosorok, Distinguished Professor of Biostatistics and Professor of Statistics and Operations Research at UNC-Chapel Hill

Abstract: Precision health is the science of data-driven decision support for improving health at the individual and population levels. This includes precision medicine and precision public health and circumscribes all health-related challenges which can benefit from a precision operations approach. This framework strives to develop study design, data collection and analysis tools to discover empirically valid solutions to optimize outcomes in both the short and long term. This includes incorporating and addressing heterogeneity across persons, time, location, institutions, communities, key stakeholders, and other contexts. In these two lectures, we review some recent developments in this area, imagine what the future of precision health can look like, and outline a possible path to achieving it.  Several applications in a number of disease areas will be examined.

When: Thursday, April 15, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Dr. Shili Lin, Ohio State University

Abstract: Advances in molecular technologies have revolutionized research in genomics in the past two decades. Among them is Hi-C, a molecular assay coupled with the Next Generation Sequencing (NGS) technology, which enables genome-wide detection of physical chromosomal contacts. The availability of data generated from Hi-C makes it possible to recapitulate the underlying three-dimensional (3D) spatial chromatin structure of a genome. However, several special features (including dependency and over-dispersion) of the NGS-based high-throughput data have posed challenges for statistical modeling and inference. As the technology is further developed from bulk to single cells, enabling the study of cell-to-cell variation, further complications arise due to extreme data sparsity — as a result of insufficient sequencing depth — leading to a great deal of dropouts (e.g. observed zeros) that are confounded with true structural zeros. In this talk, I will discuss tREX, a random effect modeling strategy that specifically addresses the special features of Hi-C data. I will further discuss SRS, a self-representation smoothing methodology for identifying structural zeros and imputing dropouts, and for overall data quality improvement. Simulation studies and real data applications will be described to illustrate the methods.


2019 – 2020 Department of Statistics Colloquium Speakers 

John Grego: Brief encounters with the tidyverse

Karl Gregory: How and why to make R packages

Yen-Yi Ho: Using high-performance computing clusters managed by research computing (RCI) at USC

Minsuk Shin: A natural combination of R and Python using the reticulate R package

Dewei Wang: Website construction (an easy-to-use html editor: KompoZer)

Check Dr. Grego's folder for all lightening talks

When: Thursday, September 19, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Brook Russell, Clemson University

Abstract: Hurricane Harvey brought extreme levels of rainfall to the Houston, TX area over a seven-day period in August 2017, resulting in catastrophic flooding that caused loss of human life and damage to personal property and public infrastructure. In the wake of this event, there has been interest in understanding the degree to which this event was unusual and estimating the probability of experiencing a similar event in other locations. Additionally, researchers have aimed to better understand the ways in which the sea surface temperature (SST) in the Gulf of Mexico (GoM) is associated with precipitation extremes in this region. This work addresses all of these issues through the development of a multivariate spatial extreme value model.

Our analysis indicates that warmer GoM SSTs are associated with higher precipitation extremes in the western Gulf Coast (GC) region during hurricane season, and that the precipitation totals observed during Hurricane Harvey are less unusual based on the warm GoM SST in 2017. As SSTs in the GoM are expected to steadily increase over the remainder of this century, this analysis suggests that western GC locations may experience more severe precipitation extremes during hurricane season.

When: Thursday, October 3, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Ana-Maria Staicu, North Carolina State University

Abstract: We consider longitudinal functional regression, where for each subject the response consists of multiple curves observed at different time visits. We discuss tests of significance in two general settings. First we focus on the bivariate mean function and develop tests to formally assess that it is time-invariant. Second, in the presence of additional covariate, we develop hypothesis testing to assess the significance of the covariate's time-varying effect. The procedures account for the complex dependent structure of the response and are computationally efficient. Numerical studies confirm that the testing approaches have the correct size and are have a superior power relative to available competitors. The methods are illustrated on a data application.

When: Thursday, October 31, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Robert Krafty, University of Pittsburgh

Abstract: Many studies collect functional data from multiple subjects that have both multilevel and multivariate structures. An example of such data comes from popular neuroscience experiments where participants' brain activity is recorded using modalities such as EEG or fMRI and summarized as power within multiple time-varying frequency bands at multiple brain regions. An important question is summarizing the joint variation across multiple frequency bands for both whole-brain variability between subjects, as well as location-variation within subjects. In this talk, we discuss a novel approach to conducting interpretable principal components analysis on multilevel multivariate functional data that decomposes total variation into subject-level and replicate-within-subject-level variation, and provides interpretable components that can be both sparse among variates and have localized support over time within each frequency band. The sparsity and localization of components is achieved by solving an innovative rank-one based convex optimization problem with block Frobenius and matrix L1-norm based penalties. The method is used to analyze data from a study to better understand blunted affect, revealing new neurophysiological insights into how subject- and electrode-level brain activity are connected to the phenomenon of trauma patients ``shutting down'' when presented with emotional information.

When: Thursday, November 14, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Paramita Sarah Chaudhuri, McGill University

Abstract: In this talk, I will introduce a pooling design and likelihood-based estimation technique for estimating parameters of a Cox Proportional Hazards model. To consistently estimate HRs with a continuous time-to-event outcome and a continuous exposure that is subject to pooling, I first focus on a riskset-matched nested case-control (NCC) subcohort of the original cohort. Refocusing on the NCC subcohort, in turn, allows one to use pooled exposure levels to estimate HRs. This is done by first recognizing that the likelihood for estimating HRs from an NCC design is similar to a likelihood for estimating odds ratios (ORs) from a matched case-control design. Secondly, pooled exposure levels from a matched case-control design can be used to consistently estimate ORs. Recasting the original problem within an NCC subcohort enables us to form riskset-matched strata, randomly form pooling groups with the matched strata and pool the exposures correspondingly, without explicitly considering event times in the estimation. Pooled exposure levels can then be used within a standard maximum likelihood framework for matched case-control design to consistently estimate HR and obtain the model-based standard error.  Confounder effects can also be estimated based on pooled confounder levels. I will share simulation results and analysis of a real dataset to demonstrate a practical application of this approach.

When: Thursday, November 21, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Fabrizio Ruggeri, the Istituto di Matematica Applicata e Tecnologie Informatiche "Enrico Magenes"

Abstract: Many complex real-world systems such as airport terminals, manufacturing processes and hospitals are modeled with networks of queues. To estimate parameters, restrictive assumptions are usually placed on these models. For instance arrival and service distributions are assumed to be time-invariant. Violating this assumption are so-called dynamic queuing networks (DQNs) which are more realistic but do not allow for likelihood-based parameter estimation. We consider the problem of using data to estimate the parameters of a DQN.  The is the first example of Approximate Bayesian Computation (ABC) being used for parameter inference of DQNs.

We combine computationally efficient simulation of DQNs with ABC and an estimator for maximum mean discrepancy.  DQNs are simulated in a computationally efficient manner with Queue Departure Computation (a simulation techniques we are proposing), without the need for time-invariance assumptions, and parameters are inferred from data without strict data-collection requirements. Forecasts are made which account for parameter uncertainty. We embed this queuing simulation within an ABC sampler to estimate parameters for DQNs in a straightforward manner. We motivate and demonstrate this work with the example of passengers arriving at the passport control in an international airport.

When: Thursday, December 05, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Amanda Hering, Baylor University

Abstract: Decentralized wastewater treatment (WWT) facilities monitor many features that are complexly related.  The ability to detect faults quickly and accurately identify the variables that are affected by a fault are vital to maintaining proper system operation and high quality produced water. Various multivariate methods have been proposed to perform fault detection and isolation, but most methods require the data to be independent and identically distributed when the process is in control (IC) as well as a distributional assumption. We propose a distribution-free fault isolation method for nonstationary and dependent multivariate processes. We detrend the data using observations from an IC time period to account for expected changes due external or user-controlled factors. Next, we perform fused lasso, which penalizes differences in consecutive observations, to detect faults and identify affected variables. To account for autocorrelation, the regularization parameter is chosen using an estimated effective sample size in the Extended Bayesian Information Criterion. We demonstrate the performance of our method compared to a state-of-the-art competitor in a simulation study. Finally, we apply our method to WWT facility data with a known fault, and the affected variables identified by our proposed method are consistent with the operators' diagnosis of the cause of the fault.


When: Tuesday, January 14, 2020—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Ray Bai, University of Pennsylvania

Abstract: Nonparametric varying coefficient (NVC) models are widely used for modeling time-varying effects on responses that are measured repeatedly. In this talk, we introduce the nonparametric varying coefficient spike-and-slab lasso (NVC-SSL) for Bayesian estimation and variable selection in NVC models. The NVC-SSL simultaneously selects and estimates the functionals of the significant time-varying covariates, while also accounting for temporal correlations. Our model can be implemented using a highly efficient expectation-maximization (EM) algorithm, thus avoiding the computational intensiveness of Markov chain Monte Carlo (MCMC) in high dimensions. In contrast to frequentist NVC models, hardly anything is known about the large-sample properties for Bayesian NVC models. In this talk, we take a step towards addressing this longstanding gap between methodology and theory by deriving posterior contraction rates under the NVC-SSL model when the number of covariates grows at nearly exponential rate with sample size. Finally, we introduce a simple method to make our method robust to misspecification of the temporal correlation structure. We illustrate our methodology through simulation studies and data analysis.

When: Wednesday, January 15, 2020—4:30 p.m. to 5:00 p.m.
Where: LeConte 210

Speaker: Jessica Gronsbell, Verily Life Sciences

Abstract: The widespread adoption of electronic health records (EHR) and their subsequent linkage to specimen biorepositories has generated massive amounts of routinely collected medical data for use in translational research.  These integrated data sets enable real-world predictive modeling of disease risk and progression.  However, data heterogeneity and quality issues impose unique analytical challenges to the development of EHR-based prediction models.  For example, ascertainment of validated outcome information, such as presence of a disease condition or treatment response, is particularly challenging as it requires manual chart review.  Outcome information is therefore only available for a small number of patients in the cohort of interest, unlike the standard setting where this information is available for all patients.  In this talk I will discuss semi-supervised and weakly-supervised learning methods for predictive modeling in such constrained settings where the proportion of labeled data is very small.  I demonstrate that leveraging unlabeled examples can improve the efficiency of model estimation and evaluation and in turn substantially reduce the amount of labeled data required for developing prediction models. 

When: Friday, January 17, 2020—4:30 p.m. to 5:00 p.m.
Where: LeConte 210

Speaker: Nicholas Syring, Washington University

Abstract: Bayesian methods provide a standard framework for statistical inference in which prior beliefs about a population under study are combined with evidence provided by data to produce revised posterior beliefs.  As with all likelihood-based methods, Bayesian methods may present drawbacks stemming from model misspecification and over-parametrization.  A generalization of Bayesian posteriors, called Gibbs posteriors, link the data and population parameters of interest via a loss function rather than a likelihood, thereby avoiding these potential difficulties. At the same time, Gibbs posteriors retain the prior-to-posterior updating of beliefs. We will illustrate the advantages of Gibbs methods in examples and highlight newly developed strategies to analyze the large-sample properties of Gibbs posterior distributions.


When: Tuesday, January 21, 2020—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Yixuan Qiu, Purdue University and Carnegie Mellon University

Abstract: Sparse principal component analysis (PCA) is an important technique for dimensionality reduction of high-dimensional data. However, most existing sparse PCA algorithms are based on non-convex optimization, which provide little guarantee on the global convergence. Sparse PCA algorithms based on a convex formulation, for example the Fantope projection and selection (FPS), overcome this difficulty, but are computationally expensive. In this work we study sparse PCA based on the convex FPS formulation, and propose a new gradient-based algorithm that is computationally efficient and applicable to large and high-dimensional data sets. Nonasymptotic and explicit bounds are derived for both the optimization error and the statistical accuracy, which can be used for testing and inference problems. We also extend our algorithm to online learning problems, where data are obtained in a streaming fashion. The proposed algorithm is applied to high-dimensional gene expression data for the detection of functional gene groups.


When: Thursday, March 05, 2020—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Jeff Gill, American University

Abstract: Unseen grouping, often called latent clustering, is a common feature in social science data.  Subjects may intentionally or unintentionally group themselves in ways that complicate the statistical analysis of substantively important relationships.

This work introduces a new model-based clustering design which incorporates two sources of heterogeneity. The first source is a random effect that introduces substantively unimportant grouping but must be accounted-for. The second source is more important and more difficult to handle since it is directly related to the relationships of interest in the data.  We develop a model to handle both of these challenges and apply it to data on terrorist groups, which are notoriously hard to model with conventional tools.


2018 – 2019 Department of Statistics Colloquium Speakers

When: Thursday, August 30, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Jun-Ying Tzeng, North Carolina State University

Abstract: Rare variants are of increasing interest to genetic association studies because of their etiological contributions to human complex diseases. Due to the rarity of the mutant events, rare variants are routinely analyzed on an aggregate level. While aggregation analyses improve the detection of global-level signal, they are not able to pinpoint causal variants within a variant set. To perform inference on a localized level, additional information, e.g. biological annotation, is often needed to boost the information content of a rare variant. Following the observation that important variants are likely to cluster together on functional domains, we propose a protein structure guided local test (POINT) to provide variant-specific association information using structure-guided aggregation of signal. Constructed under a kernel machine framework, POINT performs local association testing by borrowing information from neighboring variants in the three-dimensional protein space in a data-adaptive fashion. Besides merely providing a list of promising variants, POINT assigns each variant a p-value to permit variant ranking and prioritization. We assess the selection performance of POINT using simulations and illustrate how it can be used to prioritize individual rare variants in PCSK9 associated with low-density lipoprotein in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data. [PDF]

When: Thursday, September 13, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Frank Konietschke, University of Texas at Dallas

Abstract: Existing tests for factorial designs in the nonparametric case are based on hypotheses formulated in terms of distribution functions. Typical null hypotheses, however, are formulated in terms of some parameters or effect measures, particularly in heteroscedastic settings. In this talk we extend this idea to nonparametric models by introducing a novel nonparametric ANOVA-type-statistic based on ranks which is suitable for testing hypotheses formulated in meaningful nonparametric treatment effects in general factorial designs. This is achieved by a careful in-depth study of the common distribution of rank-based estimators for the treatment effects. Since the statistic is asymptotically not a pivotal quantity we propose different approximation techniques, discuss their theoretic properties and compare them in extensive simulations together with two additional Wald-type tests. An extension of the presented idea to general repeated measures designs is briefly outlined. The proposed rank-based procedures maintain the pre-assigned type-I error rate quite accurately, also in unbalanced and heteroscedastic models. A real data example illustrates the application of the proposed methods.

When: Thursday, October 11, 2018 —2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Aboozar Hadavand, Johns Hopkins University

Abstract: Massive Open Online Courses (MOOCs) exploded into popular consciousness in the early 2010s. Huge enrollment numbers set expectations that MOOCs might be a major disruption to the educational landscape. However, there is still uncertainty about the new types of credentials awarded upon completion of MOOCs. We surveyed close to 9,000 learners of the largest data science MOOC program to assess the economic impact of completing these programs on employment prospects. We find that completing the program that costs less than $500 led to, on average, an increase in salary of $8,230 and an increase in the likelihood of job mobility of 30 percentage points. This high return on investment suggests that MOOCs can have real economic benefits for participants.

When: Thursday, November 1, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Dulal Bhaumik, University of Illinois at Chicago

Abstract: The human brain is an amazingly complex network. Aberrant activities in this network can lead to various neurological disorders such as multiple sclerosis, Parkinson’s disease, Alzheimer’s disease and autism.  fMRI has emerged as an important tool to delineate the neural networks affected by such diseases, particularly autism.  In this talk, we will present a special type of mixed-effects model together with an appropriate procedure for controlling false discoveries to detect disrupted connectivities for developing a neural network in whole brain studies.  Results are illustrated with a large dataset known as Autism Brain Imaging Data Exchange (ABIDE) which includes 361 subjects from eight medical centers.

When: Thursday, November 15, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Yuan Wang, University of South Carolina

Abstract: Topological data analysis (TDA) extracts hidden topological features in signals that cannot be easily decoded by standard signal processing tools. A key TDA method is persistent homology (PH), which summarizes the changes of connected components in a signal through a multiscale descriptor such as persistent landscape (PL). Recent development indicates that statistical inference on PLs of scalp electroencephalographic (EEG) signals produces markers for localizing seizure foci. However, a key obstacle of applying PH to large-scale clinical EEGs is the ambiguity of performing statistical inference. To address this problem, we develop a unified permutation-based inference framework for testing statistical indifference in PLs of EEG signals before and during an epileptic seizure. Compared with the standard permutation test, the proposed framework is shown to have more robustness when signals undergo non-topological changes and more sensitivity when topological changes occur. Furthermore, the proposed new method improves the average computation time by 15,000 fold.

When: Thursday, December 6, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Aiyi Liu, National Institutes of Health

Abstract: Group testing has been widely used as a cost-effective strategy to screen for and estimate the prevalence of a rare disease. While it is well-recognized that retesting is necessary for identifying infected subjects, it is not required for estimating the prevalence. However, one can expect gains in statistical efficiency from incorporating retesting results in the estimation. Research in this context is scarce, particularly for tests with classification errors. For an imperfect test we show that retesting subjects in either positive or negative groups can substantially improve the efficiency of the estimates, and retesting positive groups yields higher efficiency than retesting a same number or proportion of negative groups. Moreover, when the test is subject to no misclassification, performing retesting on positive groups still results in more efficient estimates.

When: Thursday, January 17, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Minsuk Shin, Havard University

Abstract: In this talk, I introduce two novel classes of shrinkage priors for different purposes: functional HorseShoe (fHS) prior for nonparametric subspace shrinkage and neuronized priors for general sparsity inference. In function estimation problems, the fHS prior encourages shrinkage towards parametric classes of functions. Unlike other shrinkage priors for parametric models, the fHS shrinkage acts on the shape of the function rather than inducing sparsity on model parameters. I study some desirable theoretical properties including an optimal posterior concentration property on the function and the model selection consistency. I apply the fHS prior to nonparametric additive models for some simulated and real data sets, and the results show that the proposed procedure outperforms the state-of-the-art methods in terms of estimation and model selection. For general sparsity inference, I propose the neuronized priors to unify and extend existing shrinkage priors such as one-group continuous shrinkage priors, continuous spike-and-slab priors, and discrete spike-and-slab priors with point-mass mixtures. The new priors are formulated as the product of a weight variable and a transformed scale variable via an activation function. By altering the activation function, practitioners can easily implement a large class of Bayesian variable selection procedures. Compared with classic spike and slab priors, the neuronized priors achieve the same explicit variable selection without employing any latent indicator variable, which results in more efficient MCMC algorithms and more effective posterior modal estimates. I also show that these new formulations can be applied to more general and complex sparsity inference problems, which are computationally challenging, such as structured sparsity and spatially correlated sparsity problems.

When: Thursday, January 24, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Andrés Felipe Barrientos, Duke University

Abstract: We propose Bayesian nonparametric procedures for density estimation for compositional data, i.e., data in the simplex space. To this aim, we propose prior distributions on probability measures based on modified classes of multivariate Bernstein polynomials. The resulting prior distributions are induced by mixtures of Dirichlet distributions, with random weights and a random number of components. Theoretical properties of the proposal are discussed, including large support and consistency of the posterior distribution. We use the proposed procedures to define latent models and apply them to data on employees of the U.S. federal government. Specifically, we model data comprising federal employees’ careers, i.e., the sequence of agencies where the employees have worked. Our modeling of the employees’ careers is part of a broader undertaking to create a synthetic dataset of the federal workforce. The synthetic dataset will facilitate access to relevant data for social science research while protecting subjects’ confidential information.

When: Tuesday, January 29, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Ryan Sun, Havard School of Public Health

Abstract: The increasing popularity of biobanks and other genetic compendiums has introduced exciting opportunities to extract knowledge using datasets combining information from a variety of genetic, genomic, environmental, and clinical sources. To manage the large number of association tests that may be performed with such data, set-based inference strategies have emerged as a popular alternative to testing individual features. However, existing methods are often challenged to provide adequate power due to three issues in particular: sparse signals, weak effect sizes, and features exhibiting a diverse variety of correlation structures. Motivated by these challenges, we propose the Generalized Berk-Jones (GBJ) statistic, a set-based association test designed to detect rare and weak signals while explicitly accounting for arbitrary correlation patterns. We apply GBJ to perform inference on sets of genotypes and sets of phenotypes, and we also discuss strategies for situations where the global null is not the null hypothesis of interest.

When: Thursday, February 21, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Ian Dryden, University of Nottingham

Abstract: Networks can be used to represent many systems such as text documents, social interactions and brain activity, and it is of interest to develop statistical techniques to compare networks. We develop a general framework for extrinsic statistical analysis of samples of networks, motivated by networks representing text documents in corpus linguistics. We identify networks with their graph Laplacian matrices, for which we define metrics, embeddings, tangent spaces, and a projection from Euclidean space to the space of graph Laplacians. This framework provides a way of computing means, performing principal component analysis and regression, and performing hypothesis tests, such as for testing for equality of means between two samples of networks. We apply the methodology to the set of novels by Jane Austen and Charles Dickens. This is joint work with Katie Severn and Simon Preston.

When: Thursday, March 7, 2019— 3:00 p.m. to 3:50 p. m.


Where: LeConte 210

Speaker: Xihong Lin, Havard University

Xihong LinHavard University

Title: Analysis of Massive Data from Genome, Exposome and Phenome: Challenges and Opportunities

Abstract: Massive data from genome, exposome, and phenome are becoming available at an increasing rate with no apparent end in sight. Examples include Whole Genome Sequencing data, smartphone data, wearable devices, Electronic Medical Records and biobanks. The emerging field of Health Data Science presents statisticians and quantitative scientists with many exciting research and training opportunities and challenges. Success in health data science requires scalable statistical inference integrated with computational science, information science and domain science. In this talk, I discuss some of such challenges and opportunities, and emphasize the importance of incorporating domain knowledge in health data science method development and application. I illustrate the key points using several use cases, including analysis of data from Whole Genome Sequencing (WGS) association studies, integrative analysis of different types and sources of data using causal mediation analysis, analysis of multiple phenotypes (pleiotropy) using biobanks and Electronic Medical Records (EMRs), reproducible and replicable research, and cloud computing.

When: Friday March 8, 2019—2:30 p.m. to 3:30 p.m.

Where: Capstone Hourse Campus Room

Speaker: Xihong Lin, Havard University

Title: Scalable Statistical Inference for Dense and Sparse Signals in Analysis of Massive Health Data

Abstract: Massive data in health science, such as whole genome sequencing data and electronic medical record (EMR) data, present many exciting opportunities as well as challenges in data analysis, e.g., how to develop effective strategies for signal detection using whole-genome data when signals are weak and sparse, and how to analyze high-dimensional phenome-wide (PheWAS) data in EMRs. In this talk, I will discuss scalable statistical inference for analysis of high-dimensional independent variables and high-dimensional outcomes in the presence of dense and sparse signals. Examples include testing for signal detection using the Generalized Higher Criticism (GHC) test, and the Generalized Berk-Jones test, the Cauchy-based omnibus for combining different correlated tests, and omnibus principal component analysis of multiple phenotypes. The results are illustrated using data from genome-wide association studies and sequencing studies.

When: Thursday, March 28, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Alessio Farcomeni, Sapienza Università di Roma

Abstract: One limitation of several longitudinal model-based clustering methods is that the number of groups is fixed over time, apart from (mostly) heuristic approaches. In this work we propose a latent Markov model admitting variation in the number of latent states at each time period. The consequence is that (i) subjects can switch from one group to another at each time period and (ii) the number of groups can change at each time period. Clusters can merge, split, or be re-arranged. For a fixed sequence of the number of groups, inference is carried out through maximum likelihood, using forward-backward recursions taken from the hidden Markov literature once an assumption of local independence is granted. A penalized likelihood form is introduced to simultaneously choose an optimal sequence for the number of groups and cluster subjects. The penalized likelihood is optimized through a novel Expectation-Maximization- Markov-Metropolis algorithm. The work is motivated by an analysis of the progress of wellbeing of nations, as measured by the three dimensions of the Human Development Index over the last 25 years. The main findings are that (i) transitions among nation clubs are scarce, and mostly linked to historical events (like dissolution of USSR or war in Syria) and (ii) there is mild evidence that the number of clubs has shrunk over time, where we have four clusters before 2005 and three afterwards. In a sense, nations are getting more and more polarized with respect to standards of well-being. Non-optimized R code for general implementation of the method, data used for the application, code and instructions for replicating the simulation studies and the real data analysis are available at

When: Thursday, April 11, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Scott Bruce, George Mason University

Abstract: The time-varying power spectrum of a time series process is a bivariate function that quantifies the magnitude of oscillations at different frequencies and times. To obtain low-dimensional, parsimonious measures from this functional parameter, applied researchers consider collapsed measures of power within local bands that partition the frequency space. Frequency bands commonly used in the scientific literature were historically derived from manual inspection and are not guaranteed to be optimal or justified for adequately summarizing information from a given time series process under current study. There is a dearth of methods for empirically constructing statistically optimal bands for a given signal. The goal of this talk is to discuss a standardized, unifying approach for deriving and analyzing customized frequency bands. A consistent, frequency-domain, iterative cumulative sum based scanning procedure is formulated to identify frequency bands that best preserve nonstationary information. A formal hypothesis testing procedure is also dedicatedly developed to test which, if any, frequency bands remain stationary. The proposed method is used to analyze heart rate variability of a patient during sleep and uncovers a refined partition of frequency bands that best summarize the time-varying power spectrum.

When: Thursday, April 25, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Xianyang Zhang, Texas A&M University

Abstract: We present new metrics to quantify and test for (i) the equality of distributions and (ii) the independence between two high-dimensional random vectors. In the first part of this work, we show that the energy distance based on the usual Euclidean distance cannot completely characterize the homogeneity of two high dimensional distributions in the sense that it only detects the equality of means and a specific covariance structure in the high dimensional setup. To overcome such a limitation, we propose a new class of metrics which inherit some nice properties of the energy distance and maximum mean discrepancy in the low-dimensional setting and is capable of detecting the pairwise homogeneity of the low dimensional marginal distributions in the high dimensional setup. The key to our methodology is a new way of defining the distance between sample points in the high-dimensional Euclidean spaces. We construct a high-dimensional two sample t-test based on the U-statistic type estimator of the proposed metric. In the second part of this work, we propose new distance-based metrics to quantify the dependence between two high dimensional random vectors. In the low dimensional case, the new metrics completely characterize the dependence and inherit other desirable properties of the distance covariance. In the growing dimensional case, we show that both the population and sample versions of the new metrics behave as an aggregation of the group-wise population and sample distance covariances. Thus it can quantify group-wise non-linear and non-monotone dependence between two high-dimensional random vectors, going beyond the scope of the distance covariance based on the usual Euclidean distance and the Hilbert-Schmidt Independence Criterion which have been recently shown only to capture the componentwise linear dependence in high dimension. We further propose a t-test based on the new metrics to perform high dimensional independence testing and study its asymptotic size and power behaviors under both the HDLSS and HDMSS setups.

Challenge the conventional. Create the exceptional. No Limits.