dc.contributor.authorNguyen, Thuy Diem
dc.date.accessioned2015-11-12T07:18:42Z
dc.date.accessioned2017-07-23T08:30:33Z
dc.date.available2015-11-12T07:18:42Z
dc.date.available2017-07-23T08:30:33Z
dc.date.copyright2015en_US
dc.date.issued2015
dc.identifier.citationNguyen, T. D. (2015). Efficient agglomerative hierarchical clustering for biological sequence analysis. Doctoral thesis, Nanyang Technological University, Singapore.
dc.identifier.urihttp://hdl.handle.net/10356/65568
dc.description.abstractCluster analysis or clustering is an important data mining technique widely used for pattern recognition and information retrieval. In the context of computational biology and bioinformatics, clustering has been successfully applied in many subfields such as in evolutionary biology to group homologous genetic sequences into gene families, in transcriptomics to group genes with related expression patterns, or in ecology to describe communities of organisms in heterogeneous environments. Cluster analysis originated from an anthropology study by Driver and Kroeber in 1932. Since then, over a hundred clustering algorithms have been developed to target input datasets with different characteristics. Out of these algorithms, two most prevalent methods are hierarchical clustering and k-means clustering. The former algorithm is particularly useful for analysing genetic datasets in evolutionary biology studies because there are inherent hierarchical relationships amongst the genetic sequences extracted from related organisms in these datasets. The hierarchical clustering analysis of biological sequences is however computational expensive in terms of both execution time and memory usage. Consequently, this analysis was rarely applied to input datasets with more than ten thousand sequences. In recent years, new high-throughput sequencing technologies can produce more data in a shorter time and for a cheaper cost. As a result, more and more raw sequence data is efficiently produced from the genetic materials of live organisms, many of which are examined for the first time. Hence, there is a pressing need for more effective computational techniques to study the relationships among the newly-discovered species. Motivated by the necessity for more effective clustering algorithms to study biological sequence data, I explore the use of parallel computing technologies with new algorithms to perform agglomerative hierarchical sequence clustering in a more effective way without compromising the accuracy of the results. Specifically, I develop new parallel algorithms using general-purpose computing on graphics processing units (or GPGPU) to speed-up the most compute intensive part of the hierarchical clustering process: the pairwise distance matrix computation of the input sequences. Graphic cards from NVIDIA are becoming a commodity in recent years and are available in many personal computers nowadays. With an NVIDIA GPU card, any laptop or desktop running these parallel algorithms can speed up the matrix computation process by ten times to a hundred times faster compared to the computation on a CPU using a traditional sequential (single-threaded) algorithm. Besides reducing execution time, I have built a more memory-efficient and robust agglomerative hierarchical clustering algorithm. This new clustering method reduces memory usage by applying a data summarization technique to maintain a compact version of only a part of the distance matrix instead of loading the whole matrix into the main memory. An important feature of this algorithm is the capability to produce the same hierarchical structure as the standard method. The new algorithm supports all three popular linkage schemes including: average-linkage, single-linkage, and complete-linkage. Among them, average-linkage clustering is widely used in many research and real-world applications, making a memory-efficient average-linkage clustering algorithm in great demand. Amongst various types of biological sequence cluster analyses, this thesis focuses on a particular type of sequence clustering application called operational taxonomic unit (OTU) clustering to demonstrate the usefulness of the afore-mentioned efficient algorithms for processing large genomic datasets. I have developed two OTU clustering pipelines for 454 pyrosequencing datasets called CRiSPy-Embed and CRiSPy-CUDA. A comprehensive evaluation benchmark using randomly simulated datasets and popular mock datasets has been designed to test the performance of these pipelines against existing tools. The benchmark results show that these tools can produce similar or more accurate OTU groupings than most existing OTU hierarchical clustering tools in a much more efficient manner.en_US
dc.format.extent168 p.en_US
dc.language.isoenen_US
dc.subjectDRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognitionen_US
dc.subjectDRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processingen_US
dc.subjectDRNTU::Engineering::Computer science and engineering::Hardware::Performance and reliabilityen_US
dc.subjectDRNTU::Engineering::Computer science and engineering::Computer applications::Life and medical sciencesen_US
dc.titleEfficient agglomerative hierarchical clustering for biological sequence analysisen_US
dc.typeThesis
dc.contributor.researchParallel and Distributed Computing Centreen_US
dc.contributor.schoolSchool of Computer Engineeringen_US
dc.contributor.supervisorBertil Schmidt
dc.contributor.supervisorKwoh Chee Keongen_US
dc.description.degreeDOCTOR OF PHILOSOPHY (SCE)en_US


Files in this item

FilesSizeFormatView
main.pdf5.686Mbapplication/pdfView/Open

This item appears in the following Collection(s)

Show simple item record