Collaborative learning from multiple data sources
Date of Issue2013
School of Computer Engineering
Centre for Advanced Information Systems
A machine learning classifier can be trained on an labeled input data set, which comprise samples and their corresponding labels, to predict the labels of samples that it has never seen before. The problem of combining several machine learning classifiers to achieve a result that is greater than the sum of its parts has been studied under the guise of collaborative learning. The vast majority of collaborative learning methods simply train several base classifiers and take a majority vote on the predictions made by each base classifier. In this thesis, we study how machine learning classifiers can work collaboratively in either a data-centric or task-centric fashion. Unlike typical methods, we formulate collaborative learning into a coherent optimization framework, in which different classifiers is learned from different representations or partitions of the data. We first propose a data-centric collaborative method that is related to semi-supervised learning, which aims to improve the performance of a classifier trained on a limited number of labeled data by utilizing unlabeled data. Specifically, we improve the transductive support vector machine (SVM), which is an existing semi-supervised learning algorithm, by employing a multi-view learning paradigm. Multi-view learning makes use of multiple perspectives, so called views, of each data sample. For example, in text classification, the standard view typically contains large number of raw content features such as term frequency, while a secondary view may contain a small but highly-informative number of domain specific features. We propose a novel two-view transductive SVM that takes advantage of both the abundant amount of unlabeled data and their multiple representations, to improve the overall classifier performance. The idea is fairly simple: train a classifier on each of the two views of both labeled and unlabeled data while imposing a global constraint: all classifiers should assign the same class label to the same data sample, labeled or unlabeled. We also incorporate manifold regularization, which is a type of graph-based semi-supervised learning into our learning framework. The proposed two-view transductive SVM was evaluated on both synthetic and real-life datasets. Experimental results show that our algorithm performs up to 5% better than a single view learning approach, especially when the amount of labeled data is small. Following this, we consider the situation in which learning tasks are equipped with plain data (only labeled data with conventional representation), but several similar learning problems are presented at the same time. Algorithms working in this scenario are referred to as the task-centric collaborative learning. We first study the problem of online learning to classify instances belonging to several different, but related tasks. Despite their uniqueness, individual classification problems still share certain characteristics with others in the group. Conventional methods treat each task independently, without considering the latent commonality among tasks. We consider the case where the information learned by one task can be used to enhance the learning of other tasks, and proposed a collaborative online multitask learning method that learns several classification models for each of the related tasks in parallel. The basic idea is to first build a generic global model from the pooled task data, and subsequently leverage the global model to build the personalized classification model of individual tasks in a collaborative learning manner. Building upon the success of collaborative learning, we then tailor the collaborative online classification method to cope with ranking problems. Ranking resembles classification as both assign one of several possible labels to a new instance, but differs in that there is an order relation between the labels. Our proposed collaborative online ranking algorithm is able to rank data generated by a group of tasks by combining group and individual characteristics. To the best of our knowledge, our work is the first learning-to-rank attempt in an online multitask learning setting. We illustrate the efficacy of the proposed task-centric collaborative online learning on a synthetic dataset and several real-life problems -- spam email filtering, bioinformatics classification, and user generated content analysis. Experimental results show that the proposed algorithms are able to take advantage of both the individual and global models to collaboratively learn multiple correlated tasks, achieving an overall improvement in classification/ranking performance. Moreover, our online collaborative algorithms are highly efficient and scalable, and are especially suitable for streaming data or massive datasets that are too large to fit in memory. Due to the fundamental differences between data-centric and task-centric collaborative learning, it may be difficult, if not impossible, to combine both into a coherent scheme. Thus one major limitation in the scope of our research is that we do not seek to combine them. Instead, we will study both models independently of one another. Moreover, practitioners are free to use our data or task centric collaborative algorithms based on their real-life needs.
DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition