Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/105937
Full metadata record
DC FieldValueLanguage
dc.contributor.authorZhou, Xinruien
dc.contributor.authorYin, Ruien
dc.contributor.authorZheng, Jieen
dc.contributor.authorKwoh, Chee-Keongen
dc.date.accessioned2019-06-19T06:46:18Zen
dc.date.accessioned2019-12-06T22:01:06Z-
dc.date.available2019-06-19T06:46:18Zen
dc.date.available2019-12-06T22:01:06Z-
dc.date.issued2018en
dc.identifier.citationZhou, X., Yin, R., Zheng, J., & Kwoh, C.-K. (2019). An encoding scheme capturing generic priors and properties of amino acids improves protein classification. IEEE Access, 7, 7348-7356. doi:10.1109/ACCESS.2018.2890096en
dc.identifier.urihttps://hdl.handle.net/10356/105937-
dc.description.abstractFeature engineering aims at representing non-numeric data with numeric features that keep the essential information of the underlying problem, and it is a non-trivial process in building a predictive model. In bioinformatics, there is a profound scale of DNA and protein sequences available, but far from being fully utilized. Computational models can facilitate the analyses of large-scale data. However, most computational models require a numeric representation as input. Expert knowledge can help design features to cast the raw symbolic data effectively. But generally, the features vary from case to case and have to be redesigned for a problem. Automated feature engineering, i.e., an encoding scheme automating the construction of features, saves the redesigning process and allows the researchers to try different representations with minimal effort. This is more in line with the explosion of data and the goal of building an intelligent system. In this paper, we introduce an encoding scheme for protein sequences, which encodes the representative sequence dataset into a numeric matrix that can be fed into a downstream learning model. The method, Context-Free EncodingScheme (CFreeEnS), was proposed for a dataset with labels for pairwise sequences. Here, we improve the method by making it applicable to a batch of protein sequences, requiring no sequence alignment beforehand. The improved method is applied to protein classification at the functional level, including identifying antimicrobial peptides, screening tumor homing peptides, and detecting hemolytic peptides and phage virion proteins. Compared with the traditional methods using task-specific designed features, CFreeEnS improves the predicting accuracy, with an increase ranging from 5.54% to 14.14%. The results indicate that the improved CFreeEnS, free from dependence on carefully designed features, is promising in capturing generic priors and essential properties of amino acids, thereby serving as an automated feature engineering method for protein sequences.en
dc.description.sponsorshipMOE (Min. of Education, S’pore)en
dc.format.extent9 p.en
dc.language.isoenen
dc.relation.ispartofseriesIEEE Accessen
dc.relation.urihttps://doi.org/10.21979/N9/4YDZEDen_US
dc.rights© 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.en
dc.subjectDRNTU::Engineering::Computer science and engineeringen
dc.subjectEncoding Schemeen
dc.subjectFeature Engineeringen
dc.titleAn encoding scheme capturing generic priors and properties of amino acids improves protein classificationen
dc.typeJournal Articleen
dc.contributor.schoolSchool of Computer Science and Engineeringen
dc.identifier.doi10.1109/ACCESS.2018.2890096en
dc.description.versionPublished versionen
item.fulltextWith Fulltext-
item.grantfulltextopen-
Appears in Collections:SCSE Journal Articles

Page view(s)

152
Updated on Jun 16, 2021

Download(s) 50

20
Updated on Jun 16, 2021

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.