Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/146888
Title: In-memory analytical query processing on GPUs
Authors: Paul, Johns
Keywords: Engineering::Computer science and engineering::Computer systems organization::Performance of systems
Engineering::Computer science and engineering::Computer systems organization::Computer system implementation
Issue Date: 2021
Publisher: Nanyang Technological University
Source: Paul, J. (2021). In-memory analytical query processing on GPUs. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/146888
Abstract: The high global memory bandwidth and the large number of parallel cores available in modern Graphics Processing Units (GPUs) make them ideal for high-performance Online Analytical Processing (OLAP) systems. However, it is challenging to design efficient high-performance query processing systems for GPUs due to the following reasons: 1) the rapid evolution of GPU hardware in the last decade, 2) the significant differences in the hardware architecture of GPUs when compared to CPUs, 3) the high overhead of moving the data between the CPU and GPU and 4) the small global memory size of a single GPU that necessitates the access of remote data over PCIe or NVLink interconnects when processing large data sets. In this thesis, we study existing query processing systems for GPUs and propose techniques to improve query execution efficiency on GPUs. We begin by studying the performance of hash join, which is one of the most compute and memory intensive relational operator in OLAP systems. Specifically, we first revisit the major GPU hash join implementations in the past decade that were designed to execute on a single GPU. We then detail how these implementations take advantage of different GPU architecture features and conduct a comprehensive evaluation of their performance and cost-efficiency using different generations of GPUs. This helps shed light on the impact of different GPU architecture features on the performance of the hash join operation and identify the factors guiding the choice of these features. We further study how data characteristics like skew and match rate impact the performance of GPU hash join implementations. Novel techniques to improve the performance of the hash join operation when joining input relations with severe skew or high match rate were also proposed as part of this study. Our evaluation finds that the proposed techniques help avoid any performance degradation when joining skewed input relations and achieve up to 2.5x better performance when joining input relations with high match rate. Next, we extend our study on the hash join operation to modern multi-GPU architectures. The recent scale-up of GPU hardware through the integration of multiple GPUs into a single machine and the introduction of higher bandwidth interconnects like NVLink 2.0 has enabled new opportunities of relational query processing on multiple GPUs. However, due to the unique characteristics of GPUs and the interconnects, existing hash join implementations spend up to 66% of their execution time moving the data between the GPUs and achieve lower than 50% utilization of the newer high bandwidth interconnects. This leads to extremely poor scalability of hash join performance on multiple GPUs, which can be slower than the performance on a single GPU. In this thesis, we propose MG-Join, a scalable partitioned hash join implementation on multiple GPUs of a single machine. In order to effectively improve the bandwidth utilization, we develop a novel multi-hop routing for cross-GPU communication that adaptively chooses the efficient route for each data flow to minimize congestion. Our experiments on the DGX-1 machine show that MG-Join helps significantly reduce the communication overhead and achieves up to 97% utilization of the bisection bandwidth of the interconnects, resulting in significantly better scalability. Overall, MG-Join outperforms the state-of-the-art hash join implementations by up to 2.5x. MG-Join further helps improve the overall performance of TPC-H queries by up to 4.5x over multi-GPU version of an open-source commercial GPU database Omnisci. Finally, we focus on optimizing the execution of entire queries on GPUs. Specifically, we look at improving the execution efficiency of just-in-time (JIT) compilation based query processing systems on GPUs. We observe that JIT compile-based query processing encounters severe under-utilization of GPU hardware} due to divergent execution and degraded parallelism arising from resource contention. To address these issues, we propose a JIT compile-based query engine named Pyper to reduce divergent execution and improve hardware utilization during query execution on GPU. Specifically, Pyper has two new operators, Shuffle and Segment, for query plan transformation, which can be plugged into a physical query plan to reduce divergent execution and resolve resource contention, respectively. To determine the insertion points for these two operators, we present an analytical model that helps insert Shuffle and Segment operators into a query plan in a cost-based manner. Our experiments show that 1) the analytical analysis of divergent execution and resource contention improves the accuracy of the cost model and 2) Pyper on average can improve the performance of TPC-H and SSB queries by up to 5.01x and 3.06x when compared to Omnisci and Hawk, respectively.
URI: https://hdl.handle.net/10356/146888
DOI: 10.32657/10356/146888
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Theses

Files in This Item:
File Description SizeFormat 
Thesis_final.pdf3.25 MBAdobe PDFView/Open

Page view(s)

327
Updated on Jan 30, 2023

Download(s) 20

234
Updated on Jan 30, 2023

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.