Beyond Gaussian : copula graphical models for non-Gaussian data
Date of Issue2015
School of Electrical and Electronic Engineering
Shell Research, MIT
Graphical models, which can be viewed as a marriage of graph theory and probability theory, provide a powerful formalism for multivariate statistical modeling of complex systems. Graphical models harness the complexity of large-scale systems by representing the statistical relations among a large number of variables in a compact manner. This compact structure can in turn be leveraged to derive highly efficient techniques for data analysis. However, research on graphical models for continuous variables so far mostly focuses on Gaussian statistics. Unfortunately, this limitation severely handicaps the utility of graphical models in real-world applications that are often associated with non-Gaussian variables. Physics and earth sciences, for instance, are often characterized by all positive quantities (e.g., amplitude, energy and magnitude), and thus cannot be described accurately by Gaussian distributions. In addition, the behavior of extreme events, such as hurricanes and floods, are theoretically governed by extreme-value distributions with fat tails instead of by Gaussian distributions. In this thesis, we move beyond Gaussian graphical models, and propose a portfolio of novel graphical models for non-Gaussian data. Such graphical models are powerful tools to solve real-life inference problems, while avoiding restrictive assumptions of Gaussian statistics, hence yielding more reliable solutions. The first part of the thesis copes with ``nominal'' (non-extremal) data. This type of data follows neither Gaussian nor fat-tailed distributions. Gaussian copulas are employed here to tie any kind of marginal distributions (Gaussian, non-Gaussian and even non-parametric) together to form a joint distribution. Through the language of graphical models, we further impose constraints of sparse dependence structure on the resulting non-Gaussian distribution, leading to sparse copula Gaussian graphical models (CGGM). Such models have the same mathematical convenience of Gaussian graphical models, yet are applicable to marginally non-Gaussian data. Along this line, we proceed to construct hidden variable copula Gaussian graphical models (HVCGGM) and discrete copula Gaussian graphical models (DCGGM). The two models are applicable to different practical scenarios. Specifically, the HVCGGM yields sparse graphical models when data is unavailable for some relevant variables; the DCGGM extends CGGM to discrete data in a straightforward manner. Since real data are often non-stationary and statistical models designed for stationary data may not yield accurate results, we further consider learning graphical models for piecewise-stationary data. In other words, we first detect change points in the time series, and then infer graphical models within each stationary segment. Besides modeling nominal data, we also build graphical models to handle extreme events. Extreme events are often modeled in two stages: first the extreme-value marginal distributions are estimated, and then the joint distribution of extreme values is constructed based on the marginals. The second part of the thesis aims to exploit graphical models to estimate the marginal distributions accurately. The main difficulty of fitting extreme-value marginal distributions to measurements is the lack of data, as extreme events are by definition rare. To improve the accuracy of the estimated parameter values, we utilize graphical models to couple together the marginal parameters with similar values of covariates. For instance, extreme wave heights vary smoothly across an ocean, and such prior knowledge can be expressed by graphical models; in this example, the spatial coordinates are covariates. Additionally, we also propose to utilize graphical model to capture the spatio-temporal dependence among marginal parameters, and further forecast the trend of extreme events in the future. The last part of the thesis is devoted to the second stage of extreme events modeling, that is, coupling all the marginals to form a joint distribution. We first investigate the performance of CGGMs on extreme-value data and propose an efficient interpolation algorithm to predict the extreme events at unmonitored locations in a spatial domain. The CGGM, however, does not possess tail dependence, and it is therefore theoretically unfavorable for quantifying extreme events. This motivates us to build a graphical model on based extreme-value copulas. As multivariate extreme-value copulas are intractable, we aim to construct pairwise copula graphical models. In particular, the ensemble-of-trees of pairwise copulas (ETPC) is introduced. Concretely, extreme-value marginal distributions are stitched together by means of pairwise copulas, which in turn are the building blocks of the ensemble of trees. We further prove that, under mild conditions, the ETPC model exhibits the favorable property of tail-dependence between an arbitrary pair of sites (variables); consequently, the model is able to reliably capture statistical dependence between extreme values at different sites. The proposed models have been successfully applied in such diverse areas as computational biology and neuroscience, geophysics and earth science, and sociology, indicating that the portfolio of graphical models provides a powerful toolbox to tackle real-world problems with non-Gaussian variables.
DRNTU::Engineering::Computer science and engineering::Computing methodologies