Enhanced prediction of several protein structural attributes with machine learning algorithms
Date of Issue2012
School of Physical and Mathematical Sciences
The classical sequence-structure-function paradigm for proteins illustrates that the amino acid sequence of a protein determines its three-dimensional (3D) structure and function. With the great success of genome sequencing projects, the gap between the number of sequence-known proteins and the number of structure-known proteins is widening rapidly. In-silico prediction of protein structure from amino acid sequence has the potential to bridge this gap. This thesis presents the machine learning-based computational methods that we developed to predict four protein structural attributes: (1) protein structural class, (2) protein fold, (3) G-protein-coupled receptors, and (4) protein contact map. First, for protein structural class prediction, we propose to use the chaos game representation and recurrence quantification analysis to extract a set of features directly from the amino acid sequences. Fisher's discriminant algorithm is adopted as the classification algorithm, and about 65% overall accuracy is achieved for proteins from low-similarity datasets. Comparisons with other methods (that use the same kind of input information) show that the proposed method has higher or comparable accuracy depending on different datasets tested. When the similar idea is applied using the predicted protein secondary structure to predict the class, the resulting prediction accuracy could exceed 80%. Second, for taxonomy-based protein fold recognition, a new method named TAXFOLD is proposed by extracting a comprehensive set of global and local features from the PSI-BLAST and PSIPRED profiles. These features are then fed into support vector machine to make fold recognition. Experimental tests on seven datasets demonstrate that TAXFOLD makes an average 6.9% improvement over the best available taxonomic method and performs comparably well with the best conventional template-based fold recognition methods. Third, for hierarchical classification of GPCRs, we develop a new method named PCA-GPCR that could classify GPCRs at all the five levels of the GPCR classification hierarchy. It relies on a comprehensive set of 1497 sequence-derived features. Because the number of dimensions of the feature space is very high, the principal component analysis is employed to reduce the dimensionality to 32. Jackknife tests on a large dataset show that the overall accuracies of PCA-GPCR at five levels (from the first to the fifth) are 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. Experimental comparisons show that PCA-GPCR consistently outperforms the BLAST-based classification and other competing predictors. At last, for protein contact map prediction, a consensus approach named LRcon is proposed to improve the performance of existing predictors. Our new approach combines the prediction results from several complementary predictors by using a logistic regression model. Tests on the targets from the recent CASP9 experiment and a large dataset consisting of 856 protein chains show that LRcon not only outperforms its component predictors but also simple averaging and voting schemes.
DRNTU::Science::Mathematics::Applied mathematics::Information theory