Please use this identifier to cite or link to this item:
https://hdl.handle.net/10356/183943
Title: | Parallel processing framework for analyzing taxi mobility data | Authors: | Ho, Guo Liang Ken | Keywords: | Computer and Information Science | Issue Date: | 2025 | Publisher: | Nanyang Technological University | Source: | Ho, G. L. K. (2025). Parallel processing framework for analyzing taxi mobility data. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/183943 | Project: | CCDS24-0722 | Abstract: | The unprecedented growth of data generation has created a demand and need for efficient data processing frameworks that can handle large volumes of raw, structured and unstructured data. These data then undergo transformation and preprocessing to turn them into a suitable format that is easy to perform data analysis and machine learning. This report examines the limitations of using pandas with multiprocessing for large-scale data processing and explores parallel processing frameworks such as Hadoop and Apache Spark. A mobility data set spanning 253GB, consisting of taxi trips data between January to December 2022 in Singapore was provided by Land Transport Authority (LTA) and used to study the performance of different data processing frameworks. A Spark cluster was set up to process the raw mobility data, utilizing profiling techniques to identify and resolve logical and hardware bottlenecks, ensuring optimal utilization of the compute resources Spark ran on. A comparison between performance revealed that Apache Spark consistently outperforms pandas with multiprocessing when processing large mobility datasets across multiple months. Subsequently, the processed mobility data was structured using different data models, namely One Big Table and Fact-Dimension and their query performance were evaluated. Eventually, these data models were hosted on Google BigQuery, where curated multi-layered data models were implemented to optimize data retrieval, improve accessibility and support different analytics and machine learning applications. | URI: | https://hdl.handle.net/10356/183943 | Schools: | College of Computing and Data Science | Research Centres: | Singapore-ETH Centre | Fulltext Permission: | restricted | Fulltext Availability: | With Fulltext |
Appears in Collections: | CCDS Student Reports (FYP/IA/PA/PI) |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
CCDS24-0722-FinalSubmission.pdf Restricted Access | 1.67 MB | Adobe PDF | View/Open |
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.