Please use this identifier to cite or link to this item:
Title: Unsupervised bayesian generative methods
Authors: Li, Shaohua
Keywords: DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Issue Date: 2016
Abstract: Unsupervised Learning is a type of machine learning algorithm for learning hidden structures from unlabeled data only. Probabilistic generative models are recent development in unsupervised learning. The generative modeling framework can model rich structures and learn diverse types of relations within the data, which are usually difficult to do using traditional methods. The Bayesian method is a principled paradigm of statistical inference for incorporating prior knowledge into the given data using Bayes' theorem. It aims to find robust estimation of various model hypotheses (model parameters, model structures or values of latent variables), even if the data are scarce, noisy or biased. After prior distributions are put on different places within a generative model, e.g. the parameters, the latent variables, or even the model structure, the model inference converts to Bayesian inference, which is usually more robust. In this thesis, Bayesian modeling and inference techniques are applied in three different learning tasks involving unsupervised generative models, which are: 1) inference of a single latent variable, 2) inference of the model structure, 3) inference of model parameters, respectively. First, in a scenario of disambiguating different academic authors with the same name, a central problem is to determine the probability whether two sets of publication are authored by the same person, based on categorical features of the publications, such as coauthor names and publication venues. Previous methods usually use heuristic similarity measures, such as Jaccard Coefficient or Cosine Similarity. Such measures perform poorly when the two sets are small, which is typical in Author Name Disambiguation. To better address this problem, a Bayesian generative model of a publication set is proposed, in which an author's preference is an unknown latent parameter, and a likelihood ratio is derived to estimate the probability that a person authors a publication set. This likelihood ratio, as a novel similarity measure, is mathematically principled and verified to perform well even when the two compared sets are small. In a conventional hierarchical agglomerative clustering framework, this likelihood ratio is used to decide whether to treat the two sets of papers as being authored by the same person and merge them. The whole system outperforms previous methods with a significant margin. Second, in a mixture model, it is often difficult to choose an appropriate number of hidden mixing components, to avoid both overfitting and underfitting. This problem is referred to as Model Selection. Conventionally the component numbers are tuned using grid-search against certain criteria on training data. Recent development applies nonparametric Bayesian methods, such as Dirichlet Process (DP) and Indian Buffet Process (IBP), to assign different prior probabilities to models with different hidden states, and determine the most likely model structure. However nonparametric methods have a few limitations. A principled automatic model selection method is desirable. In a multi-layer Markov model --- Factorial Hidden Markov Model (FHMM), the demand of model selection is more severe, as there are several component numbers to be specified. The recently developed Factorized Information Criterion (FIC) and Factorized Asymptotic Bayesian (FAB) have been proven to be an effective model selection framework for mixture models. FIC computes an approximation of the marginal log-likelihood of the data, given the hidden state configurations. FAB finds the best component numbers and model parameters. The contributions are: 1) The FAB inference is extended to FHMM and FAB-FHMM is obtained. Experimental results show that FAB-FHMM significantly outperforms state-of-the-art nonparametric Bayesian iFHMM and Variational FHMM, in terms of model selection accuracy and held-out perplexity. 2) The model selection process of FAB is explained by the proof of a novel “winner-take-all” theorem. 3) It is proved that under certain conditions, the FIC regularization is equivalent to the Chinese Restaurant Process (CRP) prior, a popular Bayesian nonparametric prior. This finding bridges the two very different lines of model selection methods. The third task is about Word Embedding, a technique that maps words in a natural language to continuous vectors which encode the semantic/syntactic regularities among the words. Most existing word embedding methods can be categorized into Neural Embedding Models and Matrix Factorization (MF)-based methods. However some models are opaque to probabilistic interpretation, and existing MF-based word embedding methods may incur loss of corpus information. In addition, it is desirable to incorporate global latent factors, such as topics, sentiments or writing styles, into the word embedding model. Since generative models provide a principled way to incorporate latent factors, a Bayesian generative word embedding model is proposed, which is easy to interpret, and can serve as a basis of more sophisticated latent factor models. The model inference is reduced to a low rank weighted positive semidefinite approximation problem. This problem is approached by eigendecomposition on a submatrix, followed by online blockwise regression, which is scalable and avoids the information loss in Singular Value Decomposition (SVD). In experiments on 7 common benchmark datasets, our vectors are competitive to word2vec, and better than other MF-based methods. Finally, in the future work, three extensions to the above three works are proposed. Especially, the Bayesian generative word embedding model is extended to incorporate topic embeddings, where the document topics are represented in the same form as the word vectors, denoted as topic embeddings. The topic embeddings modify the word distributions in a way like “latent words”. This model can be viewed as a countinuous counterpart of Latent Dirichlet Allocation (LDA), with a few advantages.
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Theses

Files in This Item:
File Description SizeFormat 
Thesis - Shaohua Li v3.pdf
  Restricted Access
main article3.41 MBAdobe PDFView/Open

Google ScholarTM


Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.