Department of Statistics

Colloquia

The colloquia listed here are presented by visiting academic researchers, members of the business community, as well by USC faculty and graduate students. The research topics introduced by the speakers delve into all areas of statistics.

Faculty, students, and off-campus visitors are invited to attend any of our colloquia and Palmetto Lecture Series.

2022 – 2023 Department of Statistics Colloquium Speakers

When: Thursday, September 22, 2022—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Dr. David Hitchcock, Department of Statistics, University of South Carolina, and Aimée Petitbon, Naval Information Warfare Center Atlantic, U.S. Navy

Abstract: Popular music genre preferences can be measured by consumer sales, listening habits, and critics' opinions. We analyze trends in genre preferences from 1974 through 2018 presented in annual Billboard Hot 100 charts and annual Village Voice Pazz & Jop critics' polls. We model yearly counts of appearances in these lists for eight music genres with two multinomial logit models, using various demographic, social, and industry variables as predictors. Since the counts are correlated over time, we use a partial likelihood approach to fit the models. Our models provide strong fits to the observed genre proportions and illuminate trends in the popularity of genres over the sampled years, such as the rise of country music and the decline of rock music in consumer preferences, and the rise of rap/hip-hop in popularity among both consumers and critics. We forecast the genre proportions (for consumers and critics) for 2019 using fitted multinomial probabilities constructed from forecasts of 2019 predictor values and compare our Hot 100 forecasts to observed 2019 Hot 100 proportions. We model over time the association between consumer and critics’ preferences using Cramér's measure of association between nominal variables and forecast how this association might trend in the future.

When: Thursday, September 29, 2022—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Drs. David Hitchcock, Xianzheng Huang, Brian Habing, Xiaoyan Lin, Yen-Yi Ho, Ray Bai, Lianming Wang, Ting Fung Ma, Joshua Tebbs, Karl Gregory, Dewei Wang, Department of Statistics, University of South Carolina

When: Thursday, October 20, 2022—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Dr. Harry Feng, Department of Population and Quantitative Health Sciences, Case-Western Reserve University

Abstract: Over the past several years, both single-cell sequencing and bulk sequencing technology are being widely adopted. In clinical settings, the input samples often contain mixtures of multiple pure cell types. The observed bulk sequencing data are thus the weighted average of signals from different sources. It is of substantial interests to use computational methods to deconvolute signals and/or infer cell types. In this talk, I will introduce my recent work in dissecting admixed transcriptome data:

1. A hybrid neural network model for single-cell RNA-seq cell type prediction.
2. A statistical method to recover individual-specific and cell-type-specific transcriptome reference panels in bulk data.

Our proposed methods will help clinical studies to identify individual-specific biomarkers and guide personalized prevention, diagnosis, and treatment of disease.

When: Thursday, October 27, 2022—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Dr. Blakeley McShane, Department of Marketing, Northwestern University

Abstract: Over the last decade, large-scale replication projects across the biomedical and social sciences have reported relatively low replication rates. In these large-scale replication projects, replication has typically been evaluated based on a single replication study of some original study and dichotomously as successful or failed. However, evaluations of replicability that are based on a single study and are dichotomous are inadequate, and evaluations of replicability should instead be based on multiple studies, be continuous, and be multi-faceted. Further, such evaluations are in fact possible due to two characteristics shared by many large-scale replication projects. In this article, we provide such an evaluation for two prominent large- scale replication projects, one which replicated a phenomenon from cognitive psychology and another which replicated 13 phenomena from social psychology and behavioral economics. Our results indicate a very high degree of replicability in the former and a medium to low degree of replicability in the latter. They also suggest an unidentified covariate in each, namely ocular dominance in the former and political ideology in the latter, that is theoretically pertinent. We conclude by discussing evaluations of replicability at large, recommendations for future large-scale replication projects, and design-based model generalization. Supplementary materials for this article are available online.

When: Thursday, November 03, 2022—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Dr. Pingfu Fu, Department of Population and Quantitative Health Sciences, Case-Western Reserve University

Abstract: Treatment effects on time-to-event endpoints in oncology studies are almost exclusively estimated by using Cox proportional hazards (PH) model. However, the PH assumption is quite restrictive as we have witnessed many challenges and issues when PH assumption is violated. There is increasing need and interest in developing and applying alternative metrics and inference tools to assess the benefit-risk profile more efficiently. In recent years, RMST has gained growing interests as an important alternative metric to evaluate survival data. In the presentation, RMST and piece-wise Cox model will be reviewed and discussed as alternatives to the PH model in cancer clinical trials followed by the application of those methods to a retrospective study in patients with resectable pancreatic cancer (RPC) treated with either upfront surgery (UFS) followed by adjuvant treatment or neoadjuvant chemotherapy (NAC) followed by surgery.

When: Tuesday, November 08, 2022—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Conor Kresin, Department of Statistics, University of California, Los Angeles

Abstract: Point process models offer concise mathematical representations of diverse data, ranging from disease spread and wildfire occurrences to non-physical phenomena such as financial asset price movements. Models for point process data are commonly fit using MLE or MCMC, but such methods are slow or computationally intractable for data with large n. In this talk, I will present a novel estimator based on the Stoyan-Grabarnik (sum of inverse intensity) statistic. Unlike MLE or MCMC approaches, the proposed estimator does not require approximation of a computationally expensive integral. I will show that under quite general conditions, this estimator is consistent for estimating parameters governing spatial-temporal point processes. Finally, I will present simulations demonstrating the performance of the estimator. In the second portion of the talk, I will motivate a new problem in point processes: nonparametric estimation of the complexity of thresholded and embedded point process data. I will examine this problem in the context of measuring fracturing in ballistic-struck materials, as captured by high-resolution tomographic imagery. This work is in collaboration with the Army Research Office.

When: Thursday, November 10, 2022—1:30 p.m. to 2:30 p.m.
Where: LeConte 224

Speakers: Dr. Ben Hansen, Department of Statistics, University of Michigan

Abstract: To match closely on a propensity score is harder than it would appear. Nearest-neighbor matching on an estimate of the score may pair subjects quite closely, particularly if the procedure imposes a radius or caliper restriction.  Even after matching within the strictest of such calipers, however, paired differences on the actual score can be surprisingly large.  A first contribution of this research is to marshall evidence that univariate propensity score matching typically does leave large discrepancies on the underlying true score, regardless of how closely it matches on estimated scores. Even such luxuries as (a) covariates of moderate dimension ($p = o(n^{1/2}$), (b) optimal matching techniques and (c) significant overlap between the groups do not in themselves address the problem; nor does matching within the calipers suggested in extant literature.  These conclusions are supported by insights from high-dimensional probability, and illustrated with a health policy example.

There's good news as well.  With appropriately restrained ambitions, and appropriate methods, we can match in such a way as to force paired differences toward zero as the sample size grows without bound, while also limiting distortions of matching proximity.  Under our proposed solution, a novel form of caliper matching, each of (a), (b) and (c) makes a contribution to the feasibility and quality of matching, but none of them emerge as necessary. The method can also be used to identify regions of common support.  In this usage, it is expected to deliver broader regions than what common heuristics would
suggest. The presentation will focus on propensity scores, but similar considerations apply to prognostic scores, principal scores and predictive mean matching.

When: Tuesday, November 15, 2022—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Annie Sauer, Department of Statistics, Virginia Polytechnic Institute and State University

Abstract: Deep Gaussian processes (DGPs) upgrade ordinary GPs through functional composition, in which intermediate GP layers warp the original inputs, providing flexibility to model non-stationary dynamics. Recent applications in machine learning favor approximate, optimization-based inference, but applications to computer surrogate modeling demand broader uncertainty quantification (UQ). We prioritize UQ through full posterior integration in a Bayesian scheme, hinging on elliptical slice sampling of latent layers. We demonstrate how our DGP's non-stationary flexibility, combined with appropriate UQ, allows for active learning: a virtuous cycle of data acquisition and model updating that departs from traditional space-filling design and yields more accurate surrogates for fixed simulation effort. But not all simulation campaigns can be developed sequentially, and many existing computer experiments are simply too big for full DGP posterior integration because of cubic scaling bottlenecks. For this case we introduce the Vecchia approximation, popular for ordinary GPs in spatial data settings. We show that Vecchia-induced sparsity of Cholesky factors allows for linear computational scaling without compromising DGP accuracy or UQ. We showcase implementation in the "deepgp" package for R on CRAN.

When: Thursday, November 17, 2022—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Dr. Fan Bu, Department of Biostatistics, University of California, Los Angeles

Abstract: Infectious disease transmission relies on interpersonal contact networks, but traditional epidemic models often assume a random-mixing population where all individuals are equally likely to get infected. We seek to develop a more realistic and generalized stochastic epidemic model that considers transmission over dynamic networks. We propose a joint epidemic-network model through a continuous-time Markov chain such that disease transmission is constrained by the contact network structure, and network evolution is in turn influenced by individual disease statuses. To accommodate partial epidemic observations commonly seen in real-world data, we design efficient inference algorithms through data augmentation that leverages dynamic network features and infection mechanisms. At the same time, our approach can account for individual heterogeneity and explore intervention effects on disease transmission. Experiments on both synthetic and real datasets demonstrate that our inference method can accurately and efficiently recover model parameters and provide valuable insight at the presence of unobserved disease episodes in epidemic data.

When: Tuesday, November 29, 2022—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Dr. Lu Xia, Department of Biostatistics, University of Washington

Abstract: Modern technologies have made it possible to collect a large amount of information in biomedical studies, where the number of covariates is comparable to or even larger than the sample size. Such high-dimensionality of the collected covariates has confronted the traditional parameter estimation and uncertainty quantification in regression models. Hence, statistical methods for reliable inference on regression parameters in the presence of high-dimensional covariates are warranted. While much of the existing literature focuses on linear regression, other widely used models for analysis of binary, count, time-to-event or correlated data in biomedical research pose additional challenges in both theoretical development and empirical performance. In this talk, I will first introduce a projection-based approach for inference on linear combinations of regression parameters in generalized estimating equations, under the “large p, small n” regime. The proposed approach is applied to a longitudinal study investigating the joint associations between gene expressions and riboflavin production rate in Bacillus subtilis. Secondly, I will present a de-biased lasso approach for drawing inference on stratified Cox models under the “large n, diverging p” regime. Its usefulness is demonstrated by analyzing a comprehensive list of risk factors for renal graft failure in the Scientific Registry of Transplant Recipients data. For both approaches, we provide theoretical guarantees for the asymptotic distributions of the resulting estimators, and show via simulations that the proposed methods enjoy more reliable empirical performance, particularly in estimation bias and confidence interval coverage, than their competitors.

When: Thursday, December 01, 2022—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Emily Diana, The Wharton School, University of Pennsylvania

Abstract: While data science enables rapid societal advancement, deferring decisions to machines does not automatically avoid egregious equity or privacy violations. Without safeguards in the scientific process --- from data collection to algorithm design to model deployment --- machine learning models can easily inherit or amplify existing biases and vulnerabilities present in society. My research focuses on explicitly encoding algorithms with ethical norms and constructing frameworks ensuring that statistics and machine learning methods are deployed in a socially responsible manner. In particular, I develop theoretically rigorous and empirically verified algorithms to mitigate automated bias and protect individual privacy.I will highlight this through two main contributions:(1) A new oracle-efficient and convergent algorithm to provably achieve minimax group fairness -- fairness measured by worst-case outcomes across groups -- in general settings (“Minimax Group Fairness: Algorithms and Experiments”) (2) A framework for producing a sensitive attribute proxy that allows one to train a fair model even when the original sensitive features are not available (“Multiaccurate Proxies for Downstream Fairness”)

When: Tuesday, January 10, 2023—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Joshua Agterberg, Department of Applied Mathematics and Statistics, Johns Hopkins University

Abstract: Higher-order multiway data is ubiquitous in machine learning and statistics and often exhibits community-like structures, where each component (node) along each different mode has a community membership associated with it. In this talk we propose the tensor mixed-membership blockmodel, a generalization of the tensor blockmodel positing that memberships need not be discrete, but instead are convex combinations of latent communities. We establish the identifiability of our model and propose a computationally efficient estimation procedure based on the higher-order orthogonal iteration algorithm (HOOI) for tensor SVD composed of a simplex corner-finding algorithm. We then demonstrate the consistency of our estimation procedure by providing a per-node error bound, which showcases the effect of higher-order structures on estimation accuracy. To prove our consistency result, we develop the $\ell_{2,\infty}$ tensor perturbation bound for HOOI under independent, possibly heteroskedastic, subgaussian noise that may be of independent interest. Our analysis uses a novel leave-one-out construction for the iterates, and our bounds depend only on spectral properties of the underlying low-rank tensor under nearly optimal signal-to-noise ratio conditions such that tensor SVD is computationally feasible. Whereas other leave-one-out analyses typically focus on sequences constructed by analyzing the output of a given algorithm with a small part of the noise removed, our leave-one-out analysis constructions use both the previous iterates and the additional tensor structure to eliminate a potential additional source of error. Finally, we apply our methodology to US flight data, showcasing the effect of COVID-19 on flights.

When: Thursday, January 12, 2023—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Dr. Tianying Wang, Center for Statistical Science, Tsinghua University

Abstract: Gene-trait associations can be complex due to underlying population heterogeneity, gene-environment interactions, and various other reasons. Existing gene-based tests are mean-based, and may miss or underestimate higher-order associations that could be scientifically interesting. In this talk, we will introduce a new family of gene-level association tests that integrate quantile rank score process to better accommodate complex associations. The resulting test statistics have multiple advantages: (1) they are almost as efficient as the best existing tests when the associations are homogeneous across quantile levels and have improved efficiency for complex and heterogeneous associations, (2) they provide useful insights into risk stratification, (3) the test statistics are distribution-free and could hence accommodate a wide range of underlying distributions, and (4) they are computationally efficient. We established the asymptotic properties of the proposed tests under the null and alternative hypotheses and conducted large-scale simulation studies to investigate their finite sample performance. The performance of the proposed approach is compared with that of conventional mean-based tests through simulation studies and applications to a Metabochip dataset on lipid traits, and to the genotype-tissue expression data in GTEx to identify genes whose expression levels are associated with cis-eQTLs.

When: Tuesday, January 17, 2023—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Robert Trangucci, Department of Statistics, University of Michigan

Abstract: Despite the importance of vaccine efficacy against post-infection outcomes like transmission or severe illness, these estimands are unidentifiable without extra measurements, even under strong assumptions that are rarely satisfied in real-world trials. We develop a novel method to nonparametrically point identify these principal effects while eliminating the monotonicity assumption and allowing for measurement error. Furthermore, our results allow for multiple treatments, and are general enough to be applicable outside of vaccine efficacy. Our method relies on the fact that many vaccine trials are run across geographically disparate sites, and measure biologically-relevant categorical pretreatment covariates. We show that our method can be applied to a variety of clinical trial settings where vaccine efficacy against infection and a post-infection outcome can be jointly inferred. This can yield new insights from existing vaccine efficacy trial data and will aid researchers in designing new multi-arm clinical trials.

When: Thursday, January 19, 2023—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Dr. Charles Doss, University of Minnesota, School of Statistics

Abstract: Much of the literature on evaluating the significance of a causal treatment effect based on observational data has been confined to discrete treatments. These methods are not applicable to drawing inference for a continuous treatment, which arises in many important applications. To adjust for confounders when evaluating a continuous treatment, existing inference methods often rely on discretizing the treatment or using (possibly misspecified) parametric models for the effect curve. We develop a nonparametric and "doubly robust" (i.e., efficient) procedure for testing whether there is a treatment effect. We establish the test statistic's asymptotic distribution (by using empirical process techniques for local U- and V-processes). Furthermore, we propose a wild bootstrap procedure for implementing the test in practice and show it is asymptotically valid. We illustrate the new method via simulations and a study of a dataset relating the effect of nurse staffing hours on hospital performance.

When: Thursday, March 30, 2023—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speakers: Dr. Frank Berger, Research and Outreach Liaison, Colorectal Cancer Prevention Network, University of South Carolina

Abstract: Colorectal cancer is a major cause of cancer incidence and death in the United States and in South Carolina. Each year, the disease is diagnosed in ~2,300 South Carolinians, and kills ~800. While progress in preventive screening and treatment has led to reductions in the toll of colorectal cancer, challenges remain, particularly with regard to uninsured and medically underserved populations. The Colorectal Cancer Prevention Network within USC’s College of Arts & Sciences has developed a robust program that provides awareness, education, and access to screening throughout South Carolina, with a focus on underserved communities. I will provide a brief overview of the history, operation, and impact of this program. In addition, I will summarize future research opportunities involving biostatistics, bioinformatics, and computational biology that exist within the extensive database currently utilized to oversee and monitor the program and the patients it serves.

2021 – 2022 Department of Statistics Colloquium Speakers

When: Thursday, September 23, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Yen-Yi Ho for access)

Speaker: Dr. David Matteson, Department of Statistics and Data Science and Department of Social Statistics, Cornell University

Abstract: We propose a novel class of dynamic shrinkage processes for Bayesian time series and regression analysis. Building on a global–local framework of prior construction, in which continuous scale mixtures of Gaussian distributions are employed for both desirable shrinkage properties and computational tractability, we model dependence between the local scale parameters. The resulting processes inherit the desirable shrinkage behavior of popular global–local priors, such as the horseshoe prior, but provide additional localized adaptivity, which is important for modeling time series data or regression functions with local features. We construct a computationally efficient Gibbs sampling algorithm based on a Pólya–gamma scale mixture representation of the process proposed. Using dynamic shrinkage processes, we develop a Bayesian trend filtering model that produces more accurate estimates and tighter posterior credible intervals than do competing methods, and we apply the model for irregular curve fitting of minute‐by‐minute Twitter central processor unit usage data. In addition, we develop an adaptive time varying parameter regression model to assess the efficacy of the Fama–French five‐factor asset pricing model with momentum added as a sixth factor. Our dynamic analysis of manufacturing and healthcare industry data shows that, with the exception of the market risk, no other risk factors are significant except for brief periods. If time permits, we will also highlight extensions to change point analysis and adaptive outlier detection.

When: Thursday, September 30, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Yen-Yi Ho for access)

Speaker: Dr. Sayar Karmakar, Department of Statistics, University of Florida

Abstract: Since its inception, ARCH (1982, Engle) model and its generalization GARCH (1986, Bollerslev) model have remained the preferred models to analyze financial datasets. To econometricians the log-returns are more important than the actual value of the stock and thus with the mean-zero data in hand, all interesting dynamics of the datasets get transferred to the variance or volatility of the datasets conditional on the past observations. In this talk, we first focus on formally introducing these conditional heteroscedastic (CH) models. Then we steer the talk towards three of my recent works on these models. In the first one, we briefly touch upon simultaneous inference for time-varying ARCH-GARCH model. Addressing a concern of the need for large sample size, we next introduce a Bayesian estimation for the same models that has valid posterior contraction rate and an easily amenable computation. In the last part of this talk, we discuss a new model-free technique for time-aggregated forecasts for CH-datasets. Time-permitting, we discuss some ongoing works for the GARCHX model, where the inclusion of covariates can yield some forecasting gains.

When: Thursday, October 14, 2021—2:50 p.m. to 3:50 p.m.
Where: Coliseum Room 3001

Speaker: Dr. Brian Habing, Department of Statistics, University of South Carolina

Abstract: The National Institute of Statistical Sciences (NISS) runs a wide variety of seminars, meetings, and workshops.  The office in Washington DC also provides services to a variety of government agencies.  One of these is the running of expert panels for the National Center for Education Statistics in the US Department of Education.  This past academic year the three panels were "Post COVID Surveys", "Innovative Graphics for NCES Online Reports", and "Setting Priorities for Federal Data Access to Expand The Context for Education Data".  This talk will go over some of the topics discussed in these panels - both the statistical and non-statistical - with some thoughts on statistics education at the graduate level.

When: Thursday, October 21, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Yen-Yi Ho for access)

Speaker: Dr. Sandra Safo, Division of Biostatistics, University of Minnesota

Abstract: Classification methods that leverage the strengths of multiple molecular data (e.g., genomics, proteomics) and a clinical outcome (e.g., disease status) simultaneously have more potential to reveal molecular relationships and functions associated with a clinical outcome. Most existing methods for joint association and classification (i.e., one-step methods) have focused on linear relationships, but the relationships among multiple molecular data, and between molecular data and an outcome are too complicated to be solely understood by linear methods. Further, existing nonlinear one-step methods do not allow for feature selection, a key factor for having explainable models. We propose Deep IDA to address limitations in existing works. Deep IDA learns nonlinear projections of two or more molecular data (or views) that maximally associate the views and separate the classes in each view, and permits feature selection for interpretable models. Our framework for feature ranking is general and adaptable to many deep learning methods and has potential to yield explainable deep learning models for associating multiple views. We applied Deep IDA to RNA sequencing, proteomics, and metabolomics data with the goal of identifying biomolecules that separate patients with and without COVID-19 who were admitted or not admitted to the intensive care unit. Our results demonstrate that Deep IDA works well, compares favorably to other linear and nonlinear multi-view methods, and is able to identify important molecules that facilitate an understanding of the molecular architecture of COVID-19 disease severity.

When: Thursday, October 28, 2021—2:50 p.m. to 3:50 p.m.
Where: Coliseum Room 3001

Speaker: Dr. Ray Bai, Department of Statistics, University of South Carolina

Abstract: Many studies have reported associations between later-life cognition and socioeconomic position throughout the life course. The majority of these studies, however, are unable to quantify how these associations vary over time or with respect to several demographic factors. Varying coefficient (VC) models, which treat the coefficients in a linear model as nonparametric functions of additional effect modifiers, offer an appealing way to overcome these limitations. Unfortunately, state-of-the-art VC modeling methods require computationally prohibitive hand-tuning or make restrictive assumptions about the functional form of the coefficients. In response, we propose VCBART, which estimates each varying coefficient using Bayesian Additive Regression Trees. On both simulated and real data, with simple default hyperparameters, VCBART outperforms existing methods in terms of coefficient estimation and prediction. Theoretically, we show that the VCBART posterior concentrates around the true varying coefficients at the near-minimax rate.

Using VCBART, we predict the cognitive trajectories of 3,739 subjects from the Health and Retirement Study using multiple measures of socioeconomic position throughout the life course. We find that the association between later-life cognition and educational attainment does not vary substantially with age, but there is evidence of a persistent benefit of education after adjusting for later-life health. In contrast, the association between later-life cognition and health conditions like diabetes is negative and time-varying.

When: Thursday, November 4, 2021—2:50 p.m. to 3:50 p.m.
Where: Coliseum Room 3001

Speaker: Dr. Candace Berrett, Department of Statistics, Brigham Young University

Abstract: Changes to the environment surrounding a temperature measuring station can cause local changes to the recorded temperature that deviate from regional temperature trends. This phenomenon -- often caused by construction or urbanization -- occurs at a local level. If these local changes are assumed to represent regional or global processes it can have significant impacts on historical data analysis. These changes or deviations are generally gradual, but can be abrupt, and arise as construction or other environment changes occur near a recording station. We propose a methodology to examine if changes in temperature trends at a point in time exist at a local level at various locations in a region. Specifically, we propose a Bayesian change point model for spatio-temporally dependent data where we select the number of change points at each location using a "œforwards" selection process using deviance information criterion (DIC). We then fit the selected model and examine the linear slopes across time to quantify the local changes in long-term temperature behavior. We show the utility of this model and method using a synthetic data set and observed temperature measurements from eight stations in Utah consisting of daily temperature data for 60 years.

When: Thursday, November 11, 2021—2:50 p.m. to 3:50 p.m.
Where: Coliseum Room 3001

Speaker: Dr. Jesse Frey, Department of Mathematics and Statistics, Villanova University

Abstract: There are several existing methods for finding a confidence interval for a proportion using balanced ranked-set sampling (RSS).  We improve on these methods, working in two directions.  First, we develop improved exact confidence intervals for cases where perfect rankings can be assumed.  We do this by using the method of Blaker (2000) rather than the method of Clopper and Pearson (1934) and by working with the MLE for the unknown proportion rather than with the total number of successes.  The resulting intervals outperform the existing intervals and also come close to a new theoretical lower bound on the expected length for RSS-based exact confidence intervals.  Second, we develop approximate confidence intervals that are robust to imperfect rankings.  One of the methods we develop is a mid-P-adjusted interval based on the Clopper and Pearson (1934) method, and the other is based on the Wilson (1927) method.  We compare the intervals in terms of coverage and length, finding that both have their advantages.  Both of these robust methods rely on a new maximum-likelihood-based method for estimating in-stratum proportions when the overall proportion is known.  We also report on investigations into using adjusted Wald intervals in the imperfect rankings case.

When: Thursday, January 13, 2022—2:50 p.m. to 3:50 p.m.
Where: Coliseum Room 3001

Speaker: Dr. Likun Zhang, Lawrence Berkeley National Laboratory

Abstract: The detection of changes over time in the distribution of precipitation extremes is significantly complicated by noise at the spatial scale of daily weather systems. This so-called "storm dependence" is non-negligible for extreme precipitation and makes detecting changes over time very difficult. To appropriately separate spatial signals from spatial noise due to storm dependence, we first utilize a well-developed Gaussian scale mixture model that directly incorporates extremal dependence. Our method uses a data-driven approach to determine the dependence strength of the observed process (either asymptotic independence or dependence) and is generalized to analyze changes over time and increase the scalability of computations. We apply the model to daily measurements of precipitation over the central United States and compare our results with single-station and conditional independence methods. Our main finding is that properly accounting for storm dependence leads to increased detection of statistically significant trends in the climatology of extreme daily precipitation. Next, in order to extend our analysis to much larger spatial domains, we propose a mixture component model that achieves flexible dependence properties and allows truly high-dimensional inference for extremes of spatial processes. We modify the popular random scale construction via adding non-stationarity to the Gaussian process while allowing the radial variable to vary smoothly across space. As the level of extremeness increases, this single model exhibits both long-range asymptotic independence and short-range weakening dependence strength that leads to either asymptotic dependence or independence. To make inference on the model parameters, we construct global Bayesian hierarchical models and run adaptive Metropolis algorithms concurrently via parallelization. For future work to allow efficient computation, we plan to explore local likelihood and dimension reduction approaches.

When: Friday, January 14, 2022—3:30 p.m. to 4:20 p.m.
Where: Coliseum Room 3001

Speaker: Dr. Yuexia Zhang, University of Toronto

Abstract: Mediation analysis is an important tool to study casual associations in biomedical and other scientific areas and has recently gained attention in microbiome studies. With a microbiome study of acute myeloid leukemia (AML) patients, we investigate whether the effect of induction chemotherapy intensity levels on the infection status is mediated by the microbial taxa abundance. The unique characteristics of the microbial mediators---highdimensionality, zero-inflation, and dependence---call for new methodological developments in mediation analysis. The presence of an exposure-induced mediator-outcome confounder, antibiotics usage, further requires a delicate treatment in the analysis. To address these unique challenges brought by our motivating microbiome study, we propose a novel nonparametric identification formula for the interventional indirect effect (IIE), a measure recently developed for studying mediation effects. We develop the corresponding estimation algorithm and test the presence of mediation effects via constructing the nonparametric bias-corrected and accelerated bootstrap confidence intervals. Simulation studies show that the proposed method has good finite sample performance in terms of the IIE estimation, and type-I error rate and power of the corresponding test. In the AML microbiome study, our findings suggest that the effect of induction chemotherapy intensity levels on infection is mainly mediated by patients' gut microbiome.

When: Tuesday, January 18, 2022—2:50 p.m. to 3:50 p.m.
Where: Coliseum Room 3001

Speaker: Mr. Yabo Niu, Texas A&M University

Abstract: In traditional Gaussian graphical models, data homogeneity is routinely assumed with no extra variables affecting the conditional independence. In modern genomic datasets, there is an abundance of auxiliary information, which often gets under-utilized in determining the joint dependency structure. In this talk, I will present a new Bayesian approach to model undirected graphs underlying heterogeneous multivariate observations with additional assistance from covariates. Building on product partition models, we propose a novel covariate-dependent Gaussian graphical model that allows graphs to vary with covariates so that observations whose covariates are similar share a similar undirected graph. To efficiently embed Gaussian graphical models into the proposed framework, we explore both Gaussian likelihood and pseudo-likelihood functions. For Gaussian likelihood, a G-Wishart prior is used as a natural conjugate prior, and for the pseudo-likelihood, a product of Gaussian-conditionals is used. Moreover, the proposed model induced by the prior has large support and is flexible to approximate any piece-wise constant conditional variance-covariance matrices. Furthermore, based on the theory of fractional likelihood, the rate of posterior contraction is minimax optimal assuming the true density to be a Gaussian mixture with a known number of components. I will demonstrate the efficacy of the approach via numerical studies and analysis of protein networks for a breast cancer dataset assisted by genetic covariates.

When: Thursday, January 20, 2022—2:50 p.m. to 3:50 p.m.
Where: Coliseum Room 3001

Speaker: Mr. Ting Fung Ma, University of Wisconsin-Madison

Abstract: This talk concerns the development of a unified and computationally efficient method for change-point inference in non-stationary spatio-temporal processes. By modeling a non-stationary spatio-temporal process as a piecewise stationary spatio-temporal process, I consider simultaneous estimation of the number and locations of change-points, and model parameters in each segment. A composite likelihood-based criterion is developed for change-point and parameters estimation. Under the framework of increasing domain asymptotics and mild regularity conditions, theoretical results including consistency and distribution of the estimators are derived. In contrast to the classical results in fixed dimensional time series where the localization error of change-point estimator is Op(1), exact recovery of true change-points can be achieved in the spatio-temporal setting. More surprisingly, the consistency of change-point estimation can be achieved without any penalty term in the criterion function. In addition, we establish consistency of the number and locations of the change-point estimator under the infill asymptotics framework where the time domain is increasing while the spatial sampling domain is fixed. A computationally efficient pruned dynamic programming algorithm is developed for the challenging criterion optimization problem. Extensive simulation studies and an application to U.S. precipitation data are provided to demonstrate the effectiveness and practicality of the proposed method. I will also briefly describe my other on-going projects in spatio-temporal statistics.

When: Monday, January 24, 2022—3:30 p.m. to 4:20 p.m.
Where: Coliseum Room 3001

Speaker: Dr. Qi Zheng, University of Louisville

Abstract: With the availability of high dimensional genetic biomarkers, it is of interest to identify heterogeneous effects of these predictors on patients' survival, along with proper statistical inference. Censored quantile regression has emerged as a powerful tool for detecting heterogeneous effects of covariates on survival outcomes. However, to our knowledge, few works are available to draw inference on the effects of high dimensional predictors for censored quantile regression. We propose a novel fused procedure to draw inference on all predictors within the framework of "global" censored quantile regression, where the quantile level is over an interval, instead of several discrete values. The proposed estimator combines a sequence of low dimensional model fitting based on multi-sample splitting and variable selection. We show that, under some regularity conditions, the estimator is consistent and asymptotically follows a Gaussian process indexed by the quantile level. Simulation studies indicate that our procedure properly quantifies the uncertainty of effect estimates in high-dimensional settings. We apply our method to analyze the heterogeneous effects of SNPs residing in the lung cancer pathways on patients' survival, using the Boston Lung Cancer Survivor Cohort, a cancer epidemiology study investigating the molecular mechanism of lung cancer.

When: Friday, January 28, 2022—3:30 p.m. to 4:20 p.m.
Where: Coliseum Room 3001

Speaker: Dr. Yunpeng Zhao, Arizona State University

Abstract: Consider a Boolean function f: {0, 1, ..., N} --> {0,1}, where x is called a good element if f(x)=1 and a bad element otherwise. Estimating the proportion of good elements, or equivalently, approximate counting, is one of the most fundamental problems in statistics and computer science. The error bound of the confidence interval based on the classical sample proportion scales as O(1/√n) according to the central limit theorem, where n is the number of times the function f is queried. Quantum algorithms can break this fundamental limit. We propose an adaptive quantum algorithm for proportion estimation and prove that the error bound achieves O(1/n) up to a double-logarithmic factor. The algorithm is based on Grover's algorithm, without the use of quantum phase estimation. We show with an empirical study that our algorithm uses a similar number of queries to achieve the same level of precision compared to other algorithms, but the classical part of our algorithm is significantly more efficient.

When: Monday, January 31, 2022—2:50 p.m. to 3:50 p.m.
Where: Coliseum Room 3001

Speaker: Dr. John Stufken, University of North Carolina-Greensboro

Abstract: Whether due to the size of datasets or computational challenges, there is a vast amount of literature on using only some of the data (subdata) for estimation or prediction. This raises the question how subdata should be selected from the entire dataset (full data). One possibility is to select the subdata completely at random from the full data, but this is typically not the best method. The literature contains various suggestions for better alternatives. After introducing leverage-based methods and the idea of information-based optimal subdata selection, we will discuss extensions, weaknesses, and open problems.

When: Thursday, April 14, 2022—2:50 p.m. to 3:50 p.m.
Where: Coliseum Room 3001

Speaker: Dr. Karen Messer, Professor and Division Chief in the Department of Family Medicine and Public Health, Division of Biostatistics & Bioinformatics, UC San Diego

Abstract: Motivated by the high levels of informative dropout that are common in clinical trials of Alzheimer’s disease, we develop an imputation approach to doubly-robust estimation in longitudinal studies with monotone dropout. In this approach, the missing data are first imputed using a doubly robust algorithm, and then the completed data are substituted into a standard full-data estimating equation. An advantage is that, once the missing data are imputed, any estimand defined by a wide class of estimating equations will inherit the doubly robust properties of the imputation method. Theoretical properties are developed. We illustrate the approach using data from a large randomized trial from the Alzheimer’s disease Cooperative Study (ADCS) group.

When: Friday, April 15, 2022—3:00 p.m. to 4:00 p.m.
Where: Discovery Building Room 3001

Speaker: Dr. Karen Messer, Professor and Division Chief in the Department of Family Medicine and Public Health, Division of Biostatistics & Bioinformatics, UC San Diego

Abstract: Matching on the propensity score (PS) is a popular approach to causal inference in observational studies, because of interpretability and robustness. A doubly-matched estimator matches simultaneously on the PS, which models the probability of selecting treatment (e.g. of using an e-cigarette to help quit smoking), and the prognostic score (PGS), which models the potential outcome (e.g. successful smoking cessation). In this talk we discuss two issues:  how to make such matching estimators doubly robust, and how to make proper inference from matching estimators which use complex survey data. We define and study several doubly-matched estimators, and show that they are doubly robust under some assumptions; that is, they are consistent when either the PS or the PGS is correctly estimated. We compare other doubly robust approaches (PS matching followed by regression adjustment, and augmented inverse probability weighting). Approaches to inference for matching estimators are investigated, namely, the full bootstrap, the conditional bootstrap, the jackknife (Fay’s balanced repeated replication) and a conditional parametric approach. The problem of inference from complex survey data is discussed. Use of the proposed estimators is demonstrated for a case study investigating the causal effect of e-cigarette use for smoking cessation in the US, using data from a large nationally representative longitudinal survey.

2020 – 2021 Department of Statistics Colloquium Speakers

When: Thursday, September 17, 2020—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Jonathan Schildcrout, Vanderbilt University

Abstract: Response-selective sampling (RSS) designs are efficient when response and covariate data are available, but exposure ascertainment costs limit sample size. In longitudinal studies with a continuous outcome, RSS designs can lead to efficiency gains by using low-dimensional summaries of individual longitudinal trajectories to identify the subjects in whom the expensive exposure variable will be collected. Analyses can be conducted using the outcome, confounder and target exposure data from sampled subjects, or they can combine complete data from the sampled subjects with the partial data (i.e., outcome and confounder only) from the unsampled subjects. In this talk, we will discuss designs for longitudinal and multivariate longitudinal data, and an imputation approach for ODS designs that can be more efficient than the complete case analysis, may be easier to implement than full-likelihood approaches, and is more broadly applicable than other imputation approaches. We will examine finite sampling operating characteristics of the imputation approach under several ODS designs and longitudinal data features.

When: Thursday, October 22, 2020—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Bereket Kindo, Humana Inc.

Abstract: We first, using examples, explore the types of business problems statistical concepts help solve in the insurance and healthcare industry. The purpose will be to showcase the usefulness of different statistical concepts such as design of experiments, development of statistical models for business insight generation, and development of statistical models for prediction in solving business problems.

Secondly, we discuss key topics if built into graduate program curriculums would better prepare students for a career in industry. Examples include teaching ethical and societal implications of statistical models implemented in the “real world”, and statistical communication and computing skills needed to collaboratively work with non-statistician colleagues. The purpose is to spur conversations of whether and how (more of) these topics should be embedded in statistics program curriculums.

We will then leave time for questions either related to the topics/examples discussed or for general questions of how statistical concepts are applied in industry.

When: Thursday, November 05, 2020—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Xuan Cao, University of Cincinnati

Abstract: We first, using examples, explore the types of business problems statistical concepts help solve in the insurance and healthcare industry. The purpose will be to showcase the usefulness of different statistical concepts such as design of experiments, development of statistical models for business insight generation, and development of statistical models for prediction in solving business problems.

Secondly, we discuss key topics if built into graduate program curriculums would better prepare students for a career in industry. Examples include teaching ethical and societal implications of statistical models implemented in the “real world”, and statistical communication and computing skills needed to collaboratively work with non-statistician colleagues. The purpose is to spur conversations of whether and how (more of) these topics should be embedded in statistics program curriculums.

We will then leave time for questions either related to the topics/examples discussed or for general questions of how statistical concepts are applied in industry.

When: Thursday, November 19, 2020—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Robert Lyles, Emory University

Abstract: When feasible, the application of serial principled sampling designs for testing provides an arguably ideal approach to monitoring prevalence and case counts for infectious diseases. Given the logistics involved and required resources, however, surveillance efforts can benefit from any available means to improve the precision of sampling-based estimates. In this regard, one option is to augment the analysis with data from other available surveillance streams that may be highly non-representative but that identify cases from the population of interest over the same timeframe. We consider the monitoring of a closed population (e.g., a long-term care facility, inpatient hospital setting, university, etc.), and encourage the use of capture-recapture (CRC) methodology to produce an alternative case total and/or prevalence estimate to the one obtained by principled sampling. With care in its implementation, the random or representative sample not only provides its own valid estimate, but justifies another based on classical CRC methods. We initially consider weighted averages of the two estimators as a way to enhance precision, and then propose a single CRC estimator that offers a unified and arguably preferable solution. We illustrate using artificial data from a hypothetical daily COVID-19 monitoring program applied in a closed community, along with simulation studies to explore the benefits of the proposed estimators and the properties of confidence and credible intervals for the case total. A setting in which some of the tests applied to the community are imperfect with known sensitivity and specificity is briefly considered.

When: Thursday, March 11, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Dr. Joseph Antonelli, University of Florida

Abstract: Communities often self select into implementing a regulatory policy, and adopt the policy at different time points. In New York City, neighborhood policing was adopted at the police precinct level over the years 2015-2018, and it is of interest to both (1) evaluate the impact of the policy, and (2) understand what types of communities are most impacted by the policy, raising questions of heterogeneous treatment effects. We develop novel statistical approaches that are robust to unmeasured confounding bias to study the causal effect of policies implemented at the community level. Using techniques from high-dimensional Bayesian time-series modeling, we estimate treatment effects by predicting counterfactual values of what would have happened in the absence of neighborhood policing. We couple the posterior predictive distribution of the treatment effect with flexible modeling to identify how the impact of the policy varies across time and community characteristics. Using pre-treatment data from New York City, we show our approach produces unbiased estimates of treatment effects with valid measures of uncertainty. Lastly, we find that neighborhood policing decreases discretionary arrests, but has little effect on crime or racial disparities in arrest rates.

When: Thursday, March 25, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Dr. Dan Nordman, Iowa State University

Abstract: The talk overviews a prediction problem encountered in reliability engineering, where a need arises to predict the number of future events (e.g., failures) among a cohort of units associated with a time-to-event process.  Examples include the prediction of warranty returns or the prediction of the number of product failures that could cause serious harm.  Important decisions, such as a product recall, are often based on such predictions.  The data consist of a collection of units observed over time where, at some freeze point in time, inspection then ends, leaving some units with realized event times while other units have not yet experienced events.  In other words, data are right-censored, where (crudely expressed) some units have "died" by the freeze time while other units continue to "survive."  The problem becomes predicting how many of the "surviving" units will have events (i.e., "death") in a next future interval of time, as quantified in terms of a prediction bound or interval.  Because all units belong to the same data set, either by providing direct information (i.e., observed event times) or by becoming the subject of prediction (i.e., censored event times), such predictions are called within-sample predictions and differ from other prediction problems considered in most literature.  A standard plug-in (also known as estimative) prediction approach turns out to be invalid for this problem (i.e., for even large amounts of data, the method fails to have correct coverage probability).  However, several bootstrap-based methods can provide valid prediction intervals. To this end, a commonly used prediction calibration method is shown to be asymptotically correct for within-sample predictions, and two alternative predictive-distribution-based methods are presented that perform better than the calibration method.

When: Thursday, April 1, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

When: Part I: Thursday, April 8, 2021—2:50 p.m. to 3:50 p.m.; Part II: Friday, April 9, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Dr. Michael Kosorok, Distinguished Professor of Biostatistics and Professor of Statistics and Operations Research at UNC-Chapel Hill

Abstract: Precision health is the science of data-driven decision support for improving health at the individual and population levels. This includes precision medicine and precision public health and circumscribes all health-related challenges which can benefit from a precision operations approach. This framework strives to develop study design, data collection and analysis tools to discover empirically valid solutions to optimize outcomes in both the short and long term. This includes incorporating and addressing heterogeneity across persons, time, location, institutions, communities, key stakeholders, and other contexts. In these two lectures, we review some recent developments in this area, imagine what the future of precision health can look like, and outline a possible path to achieving it.  Several applications in a number of disease areas will be examined.

When: Thursday, April 15, 2021—2:50 p.m. to 3:50 p.m.
Where: Zoom (contact Dr. Karl Gregory for access)

Speaker: Dr. Shili Lin, Ohio State University

Abstract: Advances in molecular technologies have revolutionized research in genomics in the past two decades. Among them is Hi-C, a molecular assay coupled with the Next Generation Sequencing (NGS) technology, which enables genome-wide detection of physical chromosomal contacts. The availability of data generated from Hi-C makes it possible to recapitulate the underlying three-dimensional (3D) spatial chromatin structure of a genome. However, several special features (including dependency and over-dispersion) of the NGS-based high-throughput data have posed challenges for statistical modeling and inference. As the technology is further developed from bulk to single cells, enabling the study of cell-to-cell variation, further complications arise due to extreme data sparsity — as a result of insufficient sequencing depth — leading to a great deal of dropouts (e.g. observed zeros) that are confounded with true structural zeros. In this talk, I will discuss tREX, a random effect modeling strategy that specifically addresses the special features of Hi-C data. I will further discuss SRS, a self-representation smoothing methodology for identifying structural zeros and imputing dropouts, and for overall data quality improvement. Simulation studies and real data applications will be described to illustrate the methods.

2019 – 2020 Department of Statistics Colloquium Speakers

John Grego: Brief encounters with the tidyverse

Karl Gregory: How and why to make R packages

Yen-Yi Ho: Using high-performance computing clusters managed by research computing (RCI) at USC

Minsuk Shin: A natural combination of R and Python using the reticulate R package

Dewei Wang: Website construction (an easy-to-use html editor: KompoZer)

Check Dr. Grego's folder for all lightening talks

When: Thursday, September 19, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Brook Russell, Clemson University

Abstract: Hurricane Harvey brought extreme levels of rainfall to the Houston, TX area over a seven-day period in August 2017, resulting in catastrophic flooding that caused loss of human life and damage to personal property and public infrastructure. In the wake of this event, there has been interest in understanding the degree to which this event was unusual and estimating the probability of experiencing a similar event in other locations. Additionally, researchers have aimed to better understand the ways in which the sea surface temperature (SST) in the Gulf of Mexico (GoM) is associated with precipitation extremes in this region. This work addresses all of these issues through the development of a multivariate spatial extreme value model.

Our analysis indicates that warmer GoM SSTs are associated with higher precipitation extremes in the western Gulf Coast (GC) region during hurricane season, and that the precipitation totals observed during Hurricane Harvey are less unusual based on the warm GoM SST in 2017. As SSTs in the GoM are expected to steadily increase over the remainder of this century, this analysis suggests that western GC locations may experience more severe precipitation extremes during hurricane season.

When: Thursday, October 3, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Ana-Maria Staicu, North Carolina State University

Abstract: We consider longitudinal functional regression, where for each subject the response consists of multiple curves observed at different time visits. We discuss tests of significance in two general settings. First we focus on the bivariate mean function and develop tests to formally assess that it is time-invariant. Second, in the presence of additional covariate, we develop hypothesis testing to assess the significance of the covariate's time-varying effect. The procedures account for the complex dependent structure of the response and are computationally efficient. Numerical studies confirm that the testing approaches have the correct size and are have a superior power relative to available competitors. The methods are illustrated on a data application.

When: Thursday, October 31, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Robert Krafty, University of Pittsburgh

Abstract: Many studies collect functional data from multiple subjects that have both multilevel and multivariate structures. An example of such data comes from popular neuroscience experiments where participants' brain activity is recorded using modalities such as EEG or fMRI and summarized as power within multiple time-varying frequency bands at multiple brain regions. An important question is summarizing the joint variation across multiple frequency bands for both whole-brain variability between subjects, as well as location-variation within subjects. In this talk, we discuss a novel approach to conducting interpretable principal components analysis on multilevel multivariate functional data that decomposes total variation into subject-level and replicate-within-subject-level variation, and provides interpretable components that can be both sparse among variates and have localized support over time within each frequency band. The sparsity and localization of components is achieved by solving an innovative rank-one based convex optimization problem with block Frobenius and matrix L1-norm based penalties. The method is used to analyze data from a study to better understand blunted affect, revealing new neurophysiological insights into how subject- and electrode-level brain activity are connected to the phenomenon of trauma patients shutting down'' when presented with emotional information.

When: Thursday, November 14, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Paramita Sarah Chaudhuri, McGill University

Abstract: In this talk, I will introduce a pooling design and likelihood-based estimation technique for estimating parameters of a Cox Proportional Hazards model. To consistently estimate HRs with a continuous time-to-event outcome and a continuous exposure that is subject to pooling, I first focus on a riskset-matched nested case-control (NCC) subcohort of the original cohort. Refocusing on the NCC subcohort, in turn, allows one to use pooled exposure levels to estimate HRs. This is done by first recognizing that the likelihood for estimating HRs from an NCC design is similar to a likelihood for estimating odds ratios (ORs) from a matched case-control design. Secondly, pooled exposure levels from a matched case-control design can be used to consistently estimate ORs. Recasting the original problem within an NCC subcohort enables us to form riskset-matched strata, randomly form pooling groups with the matched strata and pool the exposures correspondingly, without explicitly considering event times in the estimation. Pooled exposure levels can then be used within a standard maximum likelihood framework for matched case-control design to consistently estimate HR and obtain the model-based standard error.  Confounder effects can also be estimated based on pooled confounder levels. I will share simulation results and analysis of a real dataset to demonstrate a practical application of this approach.

When: Thursday, November 21, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Fabrizio Ruggeri, the Istituto di Matematica Applicata e Tecnologie Informatiche "Enrico Magenes"

Abstract: Many complex real-world systems such as airport terminals, manufacturing processes and hospitals are modeled with networks of queues. To estimate parameters, restrictive assumptions are usually placed on these models. For instance arrival and service distributions are assumed to be time-invariant. Violating this assumption are so-called dynamic queuing networks (DQNs) which are more realistic but do not allow for likelihood-based parameter estimation. We consider the problem of using data to estimate the parameters of a DQN.  The is the first example of Approximate Bayesian Computation (ABC) being used for parameter inference of DQNs.

We combine computationally efficient simulation of DQNs with ABC and an estimator for maximum mean discrepancy.  DQNs are simulated in a computationally efficient manner with Queue Departure Computation (a simulation techniques we are proposing), without the need for time-invariance assumptions, and parameters are inferred from data without strict data-collection requirements. Forecasts are made which account for parameter uncertainty. We embed this queuing simulation within an ABC sampler to estimate parameters for DQNs in a straightforward manner. We motivate and demonstrate this work with the example of passengers arriving at the passport control in an international airport.

When: Thursday, December 05, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Amanda Hering, Baylor University

Abstract: Decentralized wastewater treatment (WWT) facilities monitor many features that are complexly related.  The ability to detect faults quickly and accurately identify the variables that are affected by a fault are vital to maintaining proper system operation and high quality produced water. Various multivariate methods have been proposed to perform fault detection and isolation, but most methods require the data to be independent and identically distributed when the process is in control (IC) as well as a distributional assumption. We propose a distribution-free fault isolation method for nonstationary and dependent multivariate processes. We detrend the data using observations from an IC time period to account for expected changes due external or user-controlled factors. Next, we perform fused lasso, which penalizes differences in consecutive observations, to detect faults and identify affected variables. To account for autocorrelation, the regularization parameter is chosen using an estimated effective sample size in the Extended Bayesian Information Criterion. We demonstrate the performance of our method compared to a state-of-the-art competitor in a simulation study. Finally, we apply our method to WWT facility data with a known fault, and the affected variables identified by our proposed method are consistent with the operators' diagnosis of the cause of the fault.

When: Tuesday, January 14, 2020—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Ray Bai, University of Pennsylvania

Abstract: Nonparametric varying coefficient (NVC) models are widely used for modeling time-varying effects on responses that are measured repeatedly. In this talk, we introduce the nonparametric varying coefficient spike-and-slab lasso (NVC-SSL) for Bayesian estimation and variable selection in NVC models. The NVC-SSL simultaneously selects and estimates the functionals of the significant time-varying covariates, while also accounting for temporal correlations. Our model can be implemented using a highly efficient expectation-maximization (EM) algorithm, thus avoiding the computational intensiveness of Markov chain Monte Carlo (MCMC) in high dimensions. In contrast to frequentist NVC models, hardly anything is known about the large-sample properties for Bayesian NVC models. In this talk, we take a step towards addressing this longstanding gap between methodology and theory by deriving posterior contraction rates under the NVC-SSL model when the number of covariates grows at nearly exponential rate with sample size. Finally, we introduce a simple method to make our method robust to misspecification of the temporal correlation structure. We illustrate our methodology through simulation studies and data analysis.

When: Wednesday, January 15, 2020—4:30 p.m. to 5:00 p.m.
Where: LeConte 210

Speaker: Jessica Gronsbell, Verily Life Sciences

Abstract: The widespread adoption of electronic health records (EHR) and their subsequent linkage to specimen biorepositories has generated massive amounts of routinely collected medical data for use in translational research.  These integrated data sets enable real-world predictive modeling of disease risk and progression.  However, data heterogeneity and quality issues impose unique analytical challenges to the development of EHR-based prediction models.  For example, ascertainment of validated outcome information, such as presence of a disease condition or treatment response, is particularly challenging as it requires manual chart review.  Outcome information is therefore only available for a small number of patients in the cohort of interest, unlike the standard setting where this information is available for all patients.  In this talk I will discuss semi-supervised and weakly-supervised learning methods for predictive modeling in such constrained settings where the proportion of labeled data is very small.  I demonstrate that leveraging unlabeled examples can improve the efficiency of model estimation and evaluation and in turn substantially reduce the amount of labeled data required for developing prediction models.

When: Friday, January 17, 2020—4:30 p.m. to 5:00 p.m.
Where: LeConte 210

Speaker: Nicholas Syring, Washington University

Abstract: Bayesian methods provide a standard framework for statistical inference in which prior beliefs about a population under study are combined with evidence provided by data to produce revised posterior beliefs.  As with all likelihood-based methods, Bayesian methods may present drawbacks stemming from model misspecification and over-parametrization.  A generalization of Bayesian posteriors, called Gibbs posteriors, link the data and population parameters of interest via a loss function rather than a likelihood, thereby avoiding these potential difficulties. At the same time, Gibbs posteriors retain the prior-to-posterior updating of beliefs. We will illustrate the advantages of Gibbs methods in examples and highlight newly developed strategies to analyze the large-sample properties of Gibbs posterior distributions.

When: Tuesday, January 21, 2020—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Yixuan Qiu, Purdue University and Carnegie Mellon University

Abstract: Sparse principal component analysis (PCA) is an important technique for dimensionality reduction of high-dimensional data. However, most existing sparse PCA algorithms are based on non-convex optimization, which provide little guarantee on the global convergence. Sparse PCA algorithms based on a convex formulation, for example the Fantope projection and selection (FPS), overcome this difficulty, but are computationally expensive. In this work we study sparse PCA based on the convex FPS formulation, and propose a new gradient-based algorithm that is computationally efficient and applicable to large and high-dimensional data sets. Nonasymptotic and explicit bounds are derived for both the optimization error and the statistical accuracy, which can be used for testing and inference problems. We also extend our algorithm to online learning problems, where data are obtained in a streaming fashion. The proposed algorithm is applied to high-dimensional gene expression data for the detection of functional gene groups.

When: Thursday, March 05, 2020—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Jeff Gill, American University

Abstract: Unseen grouping, often called latent clustering, is a common feature in social science data.  Subjects may intentionally or unintentionally group themselves in ways that complicate the statistical analysis of substantively important relationships.

This work introduces a new model-based clustering design which incorporates two sources of heterogeneity. The first source is a random effect that introduces substantively unimportant grouping but must be accounted-for. The second source is more important and more difficult to handle since it is directly related to the relationships of interest in the data.  We develop a model to handle both of these challenges and apply it to data on terrorist groups, which are notoriously hard to model with conventional tools.

2018 – 2019 Department of Statistics Colloquium Speakers

When: Thursday, August 30, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Jun-Ying Tzeng, North Carolina State University

Abstract: Rare variants are of increasing interest to genetic association studies because of their etiological contributions to human complex diseases. Due to the rarity of the mutant events, rare variants are routinely analyzed on an aggregate level. While aggregation analyses improve the detection of global-level signal, they are not able to pinpoint causal variants within a variant set. To perform inference on a localized level, additional information, e.g. biological annotation, is often needed to boost the information content of a rare variant. Following the observation that important variants are likely to cluster together on functional domains, we propose a protein structure guided local test (POINT) to provide variant-specific association information using structure-guided aggregation of signal. Constructed under a kernel machine framework, POINT performs local association testing by borrowing information from neighboring variants in the three-dimensional protein space in a data-adaptive fashion. Besides merely providing a list of promising variants, POINT assigns each variant a p-value to permit variant ranking and prioritization. We assess the selection performance of POINT using simulations and illustrate how it can be used to prioritize individual rare variants in PCSK9 associated with low-density lipoprotein in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data. [PDF]

When: Thursday, September 13, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Frank Konietschke, University of Texas at Dallas

Abstract: Existing tests for factorial designs in the nonparametric case are based on hypotheses formulated in terms of distribution functions. Typical null hypotheses, however, are formulated in terms of some parameters or effect measures, particularly in heteroscedastic settings. In this talk we extend this idea to nonparametric models by introducing a novel nonparametric ANOVA-type-statistic based on ranks which is suitable for testing hypotheses formulated in meaningful nonparametric treatment effects in general factorial designs. This is achieved by a careful in-depth study of the common distribution of rank-based estimators for the treatment effects. Since the statistic is asymptotically not a pivotal quantity we propose different approximation techniques, discuss their theoretic properties and compare them in extensive simulations together with two additional Wald-type tests. An extension of the presented idea to general repeated measures designs is briefly outlined. The proposed rank-based procedures maintain the pre-assigned type-I error rate quite accurately, also in unbalanced and heteroscedastic models. A real data example illustrates the application of the proposed methods.

When: Thursday, October 11, 2018 —2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Aboozar Hadavand, Johns Hopkins University

Abstract: Massive Open Online Courses (MOOCs) exploded into popular consciousness in the early 2010s. Huge enrollment numbers set expectations that MOOCs might be a major disruption to the educational landscape. However, there is still uncertainty about the new types of credentials awarded upon completion of MOOCs. We surveyed close to 9,000 learners of the largest data science MOOC program to assess the economic impact of completing these programs on employment prospects. We find that completing the program that costs less than $500 led to, on average, an increase in salary of$8,230 and an increase in the likelihood of job mobility of 30 percentage points. This high return on investment suggests that MOOCs can have real economic benefits for participants.

When: Thursday, November 1, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Dulal Bhaumik, University of Illinois at Chicago

Abstract: The human brain is an amazingly complex network. Aberrant activities in this network can lead to various neurological disorders such as multiple sclerosis, Parkinson’s disease, Alzheimer’s disease and autism.  fMRI has emerged as an important tool to delineate the neural networks affected by such diseases, particularly autism.  In this talk, we will present a special type of mixed-effects model together with an appropriate procedure for controlling false discoveries to detect disrupted connectivities for developing a neural network in whole brain studies.  Results are illustrated with a large dataset known as Autism Brain Imaging Data Exchange (ABIDE) which includes 361 subjects from eight medical centers.

When: Thursday, November 15, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Yuan Wang, University of South Carolina

Abstract: Topological data analysis (TDA) extracts hidden topological features in signals that cannot be easily decoded by standard signal processing tools. A key TDA method is persistent homology (PH), which summarizes the changes of connected components in a signal through a multiscale descriptor such as persistent landscape (PL). Recent development indicates that statistical inference on PLs of scalp electroencephalographic (EEG) signals produces markers for localizing seizure foci. However, a key obstacle of applying PH to large-scale clinical EEGs is the ambiguity of performing statistical inference. To address this problem, we develop a unified permutation-based inference framework for testing statistical indifference in PLs of EEG signals before and during an epileptic seizure. Compared with the standard permutation test, the proposed framework is shown to have more robustness when signals undergo non-topological changes and more sensitivity when topological changes occur. Furthermore, the proposed new method improves the average computation time by 15,000 fold.

When: Thursday, December 6, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Aiyi Liu, National Institutes of Health

Abstract: Group testing has been widely used as a cost-effective strategy to screen for and estimate the prevalence of a rare disease. While it is well-recognized that retesting is necessary for identifying infected subjects, it is not required for estimating the prevalence. However, one can expect gains in statistical efficiency from incorporating retesting results in the estimation. Research in this context is scarce, particularly for tests with classification errors. For an imperfect test we show that retesting subjects in either positive or negative groups can substantially improve the efficiency of the estimates, and retesting positive groups yields higher efficiency than retesting a same number or proportion of negative groups. Moreover, when the test is subject to no misclassification, performing retesting on positive groups still results in more efficient estimates.

When: Thursday, January 17, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Minsuk Shin, Havard University

Abstract: In this talk, I introduce two novel classes of shrinkage priors for different purposes: functional HorseShoe (fHS) prior for nonparametric subspace shrinkage and neuronized priors for general sparsity inference. In function estimation problems, the fHS prior encourages shrinkage towards parametric classes of functions. Unlike other shrinkage priors for parametric models, the fHS shrinkage acts on the shape of the function rather than inducing sparsity on model parameters. I study some desirable theoretical properties including an optimal posterior concentration property on the function and the model selection consistency. I apply the fHS prior to nonparametric additive models for some simulated and real data sets, and the results show that the proposed procedure outperforms the state-of-the-art methods in terms of estimation and model selection. For general sparsity inference, I propose the neuronized priors to unify and extend existing shrinkage priors such as one-group continuous shrinkage priors, continuous spike-and-slab priors, and discrete spike-and-slab priors with point-mass mixtures. The new priors are formulated as the product of a weight variable and a transformed scale variable via an activation function. By altering the activation function, practitioners can easily implement a large class of Bayesian variable selection procedures. Compared with classic spike and slab priors, the neuronized priors achieve the same explicit variable selection without employing any latent indicator variable, which results in more efficient MCMC algorithms and more effective posterior modal estimates. I also show that these new formulations can be applied to more general and complex sparsity inference problems, which are computationally challenging, such as structured sparsity and spatially correlated sparsity problems.

When: Thursday, January 24, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Andrés Felipe Barrientos, Duke University

Abstract: We propose Bayesian nonparametric procedures for density estimation for compositional data, i.e., data in the simplex space. To this aim, we propose prior distributions on probability measures based on modified classes of multivariate Bernstein polynomials. The resulting prior distributions are induced by mixtures of Dirichlet distributions, with random weights and a random number of components. Theoretical properties of the proposal are discussed, including large support and consistency of the posterior distribution. We use the proposed procedures to define latent models and apply them to data on employees of the U.S. federal government. Specifically, we model data comprising federal employees’ careers, i.e., the sequence of agencies where the employees have worked. Our modeling of the employees’ careers is part of a broader undertaking to create a synthetic dataset of the federal workforce. The synthetic dataset will facilitate access to relevant data for social science research while protecting subjects’ confidential information.

When: Tuesday, January 29, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Ryan Sun, Havard School of Public Health

Abstract: The increasing popularity of biobanks and other genetic compendiums has introduced exciting opportunities to extract knowledge using datasets combining information from a variety of genetic, genomic, environmental, and clinical sources. To manage the large number of association tests that may be performed with such data, set-based inference strategies have emerged as a popular alternative to testing individual features. However, existing methods are often challenged to provide adequate power due to three issues in particular: sparse signals, weak effect sizes, and features exhibiting a diverse variety of correlation structures. Motivated by these challenges, we propose the Generalized Berk-Jones (GBJ) statistic, a set-based association test designed to detect rare and weak signals while explicitly accounting for arbitrary correlation patterns. We apply GBJ to perform inference on sets of genotypes and sets of phenotypes, and we also discuss strategies for situations where the global null is not the null hypothesis of interest.

When: Thursday, February 21, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Ian Dryden, University of Nottingham

Abstract: Networks can be used to represent many systems such as text documents, social interactions and brain activity, and it is of interest to develop statistical techniques to compare networks. We develop a general framework for extrinsic statistical analysis of samples of networks, motivated by networks representing text documents in corpus linguistics. We identify networks with their graph Laplacian matrices, for which we define metrics, embeddings, tangent spaces, and a projection from Euclidean space to the space of graph Laplacians. This framework provides a way of computing means, performing principal component analysis and regression, and performing hypothesis tests, such as for testing for equality of means between two samples of networks. We apply the methodology to the set of novels by Jane Austen and Charles Dickens. This is joint work with Katie Severn and Simon Preston.

When: Thursday, March 7, 2019— 3:00 p.m. to 3:50 p. m.

Where: LeConte 210

Speaker: Xihong Lin, Havard University

Title: Analysis of Massive Data from Genome, Exposome and Phenome: Challenges and Opportunities

Abstract: Massive data from genome, exposome, and phenome are becoming available at an increasing rate with no apparent end in sight. Examples include Whole Genome Sequencing data, smartphone data, wearable devices, Electronic Medical Records and biobanks. The emerging field of Health Data Science presents statisticians and quantitative scientists with many exciting research and training opportunities and challenges. Success in health data science requires scalable statistical inference integrated with computational science, information science and domain science. In this talk, I discuss some of such challenges and opportunities, and emphasize the importance of incorporating domain knowledge in health data science method development and application. I illustrate the key points using several use cases, including analysis of data from Whole Genome Sequencing (WGS) association studies, integrative analysis of different types and sources of data using causal mediation analysis, analysis of multiple phenotypes (pleiotropy) using biobanks and Electronic Medical Records (EMRs), reproducible and replicable research, and cloud computing.

When: Friday March 8, 2019—2:30 p.m. to 3:30 p.m.

Speaker: Xihong Lin, Havard University

Title: Scalable Statistical Inference for Dense and Sparse Signals in Analysis of Massive Health Data

Abstract: Massive data in health science, such as whole genome sequencing data and electronic medical record (EMR) data, present many exciting opportunities as well as challenges in data analysis, e.g., how to develop effective strategies for signal detection using whole-genome data when signals are weak and sparse, and how to analyze high-dimensional phenome-wide (PheWAS) data in EMRs. In this talk, I will discuss scalable statistical inference for analysis of high-dimensional independent variables and high-dimensional outcomes in the presence of dense and sparse signals. Examples include testing for signal detection using the Generalized Higher Criticism (GHC) test, and the Generalized Berk-Jones test, the Cauchy-based omnibus for combining different correlated tests, and omnibus principal component analysis of multiple phenotypes. The results are illustrated using data from genome-wide association studies and sequencing studies.

When: Thursday, March 28, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Alessio Farcomeni, Sapienza Università di Roma

Abstract: One limitation of several longitudinal model-based clustering methods is that the number of groups is fixed over time, apart from (mostly) heuristic approaches. In this work we propose a latent Markov model admitting variation in the number of latent states at each time period. The consequence is that (i) subjects can switch from one group to another at each time period and (ii) the number of groups can change at each time period. Clusters can merge, split, or be re-arranged. For a fixed sequence of the number of groups, inference is carried out through maximum likelihood, using forward-backward recursions taken from the hidden Markov literature once an assumption of local independence is granted. A penalized likelihood form is introduced to simultaneously choose an optimal sequence for the number of groups and cluster subjects. The penalized likelihood is optimized through a novel Expectation-Maximization- Markov-Metropolis algorithm. The work is motivated by an analysis of the progress of wellbeing of nations, as measured by the three dimensions of the Human Development Index over the last 25 years. The main findings are that (i) transitions among nation clubs are scarce, and mostly linked to historical events (like dissolution of USSR or war in Syria) and (ii) there is mild evidence that the number of clubs has shrunk over time, where we have four clusters before 2005 and three afterwards. In a sense, nations are getting more and more polarized with respect to standards of well-being. Non-optimized R code for general implementation of the method, data used for the application, code and instructions for replicating the simulation studies and the real data analysis are available at https://github.com/afarcome/LMrectangular.

When: Thursday, April 11, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Scott Bruce, George Mason University

Abstract: The time-varying power spectrum of a time series process is a bivariate function that quantifies the magnitude of oscillations at different frequencies and times. To obtain low-dimensional, parsimonious measures from this functional parameter, applied researchers consider collapsed measures of power within local bands that partition the frequency space. Frequency bands commonly used in the scientific literature were historically derived from manual inspection and are not guaranteed to be optimal or justified for adequately summarizing information from a given time series process under current study. There is a dearth of methods for empirically constructing statistically optimal bands for a given signal. The goal of this talk is to discuss a standardized, unifying approach for deriving and analyzing customized frequency bands. A consistent, frequency-domain, iterative cumulative sum based scanning procedure is formulated to identify frequency bands that best preserve nonstationary information. A formal hypothesis testing procedure is also dedicatedly developed to test which, if any, frequency bands remain stationary. The proposed method is used to analyze heart rate variability of a patient during sleep and uncovers a refined partition of frequency bands that best summarize the time-varying power spectrum.

When: Thursday, April 25, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Xianyang Zhang, Texas A&M University

Abstract: We present new metrics to quantify and test for (i) the equality of distributions and (ii) the independence between two high-dimensional random vectors. In the first part of this work, we show that the energy distance based on the usual Euclidean distance cannot completely characterize the homogeneity of two high dimensional distributions in the sense that it only detects the equality of means and a specific covariance structure in the high dimensional setup. To overcome such a limitation, we propose a new class of metrics which inherit some nice properties of the energy distance and maximum mean discrepancy in the low-dimensional setting and is capable of detecting the pairwise homogeneity of the low dimensional marginal distributions in the high dimensional setup. The key to our methodology is a new way of defining the distance between sample points in the high-dimensional Euclidean spaces. We construct a high-dimensional two sample t-test based on the U-statistic type estimator of the proposed metric. In the second part of this work, we propose new distance-based metrics to quantify the dependence between two high dimensional random vectors. In the low dimensional case, the new metrics completely characterize the dependence and inherit other desirable properties of the distance covariance. In the growing dimensional case, we show that both the population and sample versions of the new metrics behave as an aggregation of the group-wise population and sample distance covariances. Thus it can quantify group-wise non-linear and non-monotone dependence between two high-dimensional random vectors, going beyond the scope of the distance covariance based on the usual Euclidean distance and the Hilbert-Schmidt Independence Criterion which have been recently shown only to capture the componentwise linear dependence in high dimension. We further propose a t-test based on the new metrics to perform high dimensional independence testing and study its asymptotic size and power behaviors under both the HDLSS and HDMSS setups.