Skip to Content

Department of Statistics


The colloquia listed here are presented by visiting academic researchers, members of the business community, as well by USC faculty and graduate students. The research topics introduced by the speakers delve into all areas of statistics.

Faculty, students, and off-campus visitors are invited to attend any of our colloquia.

2018 – 2019 Department of Statistics Colloquium Speakers

When: Thursday, August 30, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Jun-Ying Tzeng, North Carolina State University

Abstract: Rare variants are of increasing interest to genetic association studies because of their etiological contributions to human complex diseases. Due to the rarity of the mutant events, rare variants are routinely analyzed on an aggregate level. While aggregation analyses improve the detection of global-level signal, they are not able to pinpoint causal variants within a variant set. To perform inference on a localized level, additional information, e.g. biological annotation, is often needed to boost the information content of a rare variant. Following the observation that important variants are likely to cluster together on functional domains, we propose a protein structure guided local test (POINT) to provide variant-specific association information using structure-guided aggregation of signal. Constructed under a kernel machine framework, POINT performs local association testing by borrowing information from neighboring variants in the three-dimensional protein space in a data-adaptive fashion. Besides merely providing a list of promising variants, POINT assigns each variant a p-value to permit variant ranking and prioritization. We assess the selection performance of POINT using simulations and illustrate how it can be used to prioritize individual rare variants in PCSK9 associated with low-density lipoprotein in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data. [PDF]

When: Thursday, September 13, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Frank Konietschke, University of Texas at Dallas

Abstract: Existing tests for factorial designs in the nonparametric case are based on hypotheses formulated in terms of distribution functions. Typical null hypotheses, however, are formulated in terms of some parameters or effect measures, particularly in heteroscedastic settings. In this talk we extend this idea to nonparametric models by introducing a novel nonparametric ANOVA-type-statistic based on ranks which is suitable for testing hypotheses formulated in meaningful nonparametric treatment effects in general factorial designs. This is achieved by a careful in-depth study of the common distribution of rank-based estimators for the treatment effects. Since the statistic is asymptotically not a pivotal quantity we propose different approximation techniques, discuss their theoretic properties and compare them in extensive simulations together with two additional Wald-type tests. An extension of the presented idea to general repeated measures designs is briefly outlined. The proposed rank-based procedures maintain the pre-assigned type-I error rate quite accurately, also in unbalanced and heteroscedastic models. A real data example illustrates the application of the proposed methods.

When: Thursday, September 27, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Tim Hanson, Medtronic

Abstract: To be determined

When: Thursday, October 11, 2018 —2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Aboozar Hadavand, Johns Hopkins University

Abstract: To be determined

When: Thursday, November 1, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Dulal Bhaumik, University of Chicago

Abstract: To be determined

When: Thursday, November 15, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Yuan Wang, University of South Carolina

Abstract: To be determined

When: Thursday, December 6, 2018—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Aiyi Liu, National Institutes of Health

Abstract: Group testing has been widely used as a cost-effective strategy to screen for and estimate the prevalence of a rare disease. While it is well-recognized that retesting is necessary for identifying infected subjects, it is not required for estimating the prevalence. However, one can expect gains in statistical efficiency from incorporating retesting results in the estimation. Research in this context is scarce, particularly for tests with classification errors. For an imperfect test we show that retesting subjects in either positive or negative groups can substantially improve the efficiency of the estimates, and retesting positive groups yields higher efficiency than retesting a same number or proportion of negative groups. Moreover, when the test is subject to no misclassification, performing retesting on positive groups still results in more efficient estimates.

When: Thursday, January 17, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Minsuk Shin, Havard University

Abstract: In this talk, I introduce two novel classes of shrinkage priors for different purposes: functional HorseShoe (fHS) prior for nonparametric subspace shrinkage and neuronized priors for general sparsity inference. In function estimation problems, the fHS prior encourages shrinkage towards parametric classes of functions. Unlike other shrinkage priors for parametric models, the fHS shrinkage acts on the shape of the function rather than inducing sparsity on model parameters. I study some desirable theoretical properties including an optimal posterior concentration property on the function and the model selection consistency. I apply the fHS prior to nonparametric additive models for some simulated and real data sets, and the results show that the proposed procedure outperforms the state-of-the-art methods in terms of estimation and model selection. For general sparsity inference, I propose the neuronized priors to unify and extend existing shrinkage priors such as one-group continuous shrinkage priors, continuous spike-and-slab priors, and discrete spike-and-slab priors with point-mass mixtures. The new priors are formulated as the product of a weight variable and a transformed scale variable via an activation function. By altering the activation function, practitioners can easily implement a large class of Bayesian variable selection procedures. Compared with classic spike and slab priors, the neuronized priors achieve the same explicit variable selection without employing any latent indicator variable, which results in more efficient MCMC algorithms and more effective posterior modal estimates. I also show that these new formulations can be applied to more general and complex sparsity inference problems, which are computationally challenging, such as structured sparsity and spatially correlated sparsity problems.

When: Thursday, January 24, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Andrés Felipe Barrientos, Duke University

Abstract: We propose Bayesian nonparametric procedures for density estimation for compositional data, i.e., data in the simplex space. To this aim, we propose prior distributions on probability measures based on modified classes of multivariate Bernstein polynomials. The resulting prior distributions are induced by mixtures of Dirichlet distributions, with random weights and a random number of components. Theoretical properties of the proposal are discussed, including large support and consistency of the posterior distribution. We use the proposed procedures to define latent models and apply them to data on employees of the U.S. federal government. Specifically, we model data comprising federal employees’ careers, i.e., the sequence of agencies where the employees have worked. Our modeling of the employees’ careers is part of a broader undertaking to create a synthetic dataset of the federal workforce. The synthetic dataset will facilitate access to relevant data for social science research while protecting subjects’ confidential information.

When: Tuesday, January 29, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Ryan Sun, Havard School of Public Health

Abstract: The increasing popularity of biobanks and other genetic compendiums has introduced exciting opportunities to extract knowledge using datasets combining information from a variety of genetic, genomic, environmental, and clinical sources. To manage the large number of association tests that may be performed with such data, set-based inference strategies have emerged as a popular alternative to testing individual features. However, existing methods are often challenged to provide adequate power due to three issues in particular: sparse signals, weak effect sizes, and features exhibiting a diverse variety of correlation structures. Motivated by these challenges, we propose the Generalized Berk-Jones (GBJ) statistic, a set-based association test designed to detect rare and weak signals while explicitly accounting for arbitrary correlation patterns. We apply GBJ to perform inference on sets of genotypes and sets of phenotypes, and we also discuss strategies for situations where the global null is not the null hypothesis of interest.

When: Thursday, February 21, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Ian Dryden, University of Nottingham

Abstract: Networks can be used to represent many systems such as text documents, social interactions and brain activity, and it is of interest to develop statistical techniques to compare networks. We develop a general framework for extrinsic statistical analysis of samples of networks, motivated by networks representing text documents in corpus linguistics. We identify networks with their graph Laplacian matrices, for which we define metrics, embeddings, tangent spaces, and a projection from Euclidean space to the space of graph Laplacians. This framework provides a way of computing means, performing principal component analysis and regression, and performing hypothesis tests, such as for testing for equality of means between two samples of networks. We apply the methodology to the set of novels by Jane Austen and Charles Dickens. This is joint work with Katie Severn and Simon Preston.

When: Thursday, March 7, 2019— 3:00 p.m. to 3:50 p. m.


Where: LeConte 210

Speaker: Xihong LinHavard University

Title: Analysis of Massive Data from Genome, Exposome and Phenome: Challenges and Opportunities

Abstract: Massive data from genome, exposome, and phenome are becoming available at an increasing rate with no apparent end in sight. Examples include Whole Genome Sequencing data, smartphone data, wearable devices, Electronic Medical Records and biobanks. The emerging field of Health Data Science presents statisticians and quantitative scientists with many exciting research and training opportunities and challenges. Success in health data science requires scalable statistical inference integrated with computational science, information science and domain science. In this talk, I discuss some of such challenges and opportunities, and emphasize the importance of incorporating domain knowledge in health data science method development and application. I illustrate the key points using several use cases, including analysis of data from Whole Genome Sequencing (WGS) association studies, integrative analysis of different types and sources of data using causal mediation analysis, analysis of multiple phenotypes (pleiotropy) using biobanks and Electronic Medical Records (EMRs), reproducible and replicable research, and cloud computing.

When: Friday March 8, 2019—2:30 p.m. to 3:30 p.m.

Where: Capstone Hourse Campus Room

Speaker: Xihong LinHavard University

Title: Scalable Statistical Inference for Dense and Sparse Signals in Analysis of Massive Health Data

Abstract: Massive data in health science, such as whole genome sequencing data and electronic medical record (EMR) data, present many exciting opportunities as well as challenges in data analysis, e.g., how to develop effective strategies for signal detection using whole-genome data when signals are weak and sparse, and how to analyze high-dimensional phenome-wide (PheWAS) data in EMRs. In this talk, I will discuss scalable statistical inference for analysis of high-dimensional independent variables and high-dimensional outcomes in the presence of dense and sparse signals. Examples include testing for signal detection using the Generalized Higher Criticism (GHC) test, and the Generalized Berk-Jones test, the Cauchy-based omnibus for combining different correlated tests, and omnibus principal component analysis of multiple phenotypes. The results are illustrated using data from genome-wide association studies and sequencing studies.

When: Thursday, March 28, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Alessio Farcomeni, Sapienza Università di Roma

Abstract: One limitation of several longitudinal model-based clustering methods is that the number of groups is fixed over time, apart from (mostly) heuristic approaches. In this work we propose a latent Markov model admitting variation in the number of latent states at each time period. The consequence is that (i) subjects can switch from one group to another at each time period and (ii) the number of groups can change at each time period. Clusters can merge, split, or be re-arranged. For a fixed sequence of the number of groups, inference is carried out through maximum likelihood, using forward-backward recursions taken from the hidden Markov literature once an assumption of local independence is granted. A penalized likelihood form is introduced to simultaneously choose an optimal sequence for the number of groups and cluster subjects. The penalized likelihood is optimized through a novel Expectation-Maximization- Markov-Metropolis algorithm. The work is motivated by an analysis of the progress of wellbeing of nations, as measured by the three dimensions of the Human Development Index over the last 25 years. The main findings are that (i) transitions among nation clubs are scarce, and mostly linked to historical events (like dissolution of USSR or war in Syria) and (ii) there is mild evidence that the number of clubs has shrunk over time, where we have four clusters before 2005 and three afterwards. In a sense, nations are getting more and more polarized with respect to standards of well-being. Non-optimized R code for general implementation of the method, data used for the application, code and instructions for replicating the simulation studies and the real data analysis are available at

When: Thursday, April 11, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Scott Bruce, George Mason University

Abstract: The time-varying power spectrum of a time series process is a bivariate function that quantifies the magnitude of oscillations at different frequencies and times. To obtain low-dimensional, parsimonious measures from this functional parameter, applied researchers consider collapsed measures of power within local bands that partition the frequency space. Frequency bands commonly used in the scientific literature were historically derived from manual inspection and are not guaranteed to be optimal or justified for adequately summarizing information from a given time series process under current study. There is a dearth of methods for empirically constructing statistically optimal bands for a given signal. The goal of this talk is to discuss a standardized, unifying approach for deriving and analyzing customized frequency bands. A consistent, frequency-domain, iterative cumulative sum based scanning procedure is formulated to identify frequency bands that best preserve nonstationary information. A formal hypothesis testing procedure is also dedicatedly developed to test which, if any, frequency bands remain stationary. The proposed method is used to analyze heart rate variability of a patient during sleep and uncovers a refined partition of frequency bands that best summarize the time-varying power spectrum.

When: Thursday, April 25, 2019—2:50 p.m. to 3:50 p.m.
Where: LeConte 210

Speaker: Xianyang Zhang, Texas A&M University

Abstract: We present new metrics to quantify and test for (i) the equality of distributions and (ii) the independence between two high-dimensional random vectors. In the first part of this work, we show that the energy distance based on the usual Euclidean distance cannot completely characterize the homogeneity of two high dimensional distributions in the sense that it only detects the equality of means and a specific covariance structure in the high dimensional setup. To overcome such a limitation, we propose a new class of metrics which inherit some nice properties of the energy distance and maximum mean discrepancy in the low-dimensional setting and is capable of detecting the pairwise homogeneity of the low dimensional marginal distributions in the high dimensional setup. The key to our methodology is a new way of defining the distance between sample points in the high-dimensional Euclidean spaces. We construct a high-dimensional two sample t-test based on the U-statistic type estimator of the proposed metric. In the second part of this work, we propose new distance-based metrics to quantify the dependence between two high dimensional random vectors. In the low dimensional case, the new metrics completely characterize the dependence and inherit other desirable properties of the distance covariance. In the growing dimensional case, we show that both the population and sample versions of the new metrics behave as an aggregation of the group-wise population and sample distance covariances. Thus it can quantify group-wise non-linear and non-monotone dependence between two high-dimensional random vectors, going beyond the scope of the distance covariance based on the usual Euclidean distance and the Hilbert-Schmidt Independence Criterion which have been recently shown only to capture the componentwise linear dependence in high dimension. We further propose a t-test based on the new metrics to perform high dimensional independence testing and study its asymptotic size and power behaviors under both the HDLSS and HDMSS setups.