Please use this identifier to cite or link to this item:
Title: Efficient and scalable techniques for pagerank-based graph analytics
Authors: Yang, Renchi
Keywords: Engineering::Computer science and engineering::Information systems::Database management
Issue Date: 2020
Publisher: Nanyang Technological University
Source: Yang, R. (2020). Efficient and scalable techniques for pagerank-based graph analytics. Doctoral thesis, Nanyang Technological University, Singapore.
Abstract: Graphs are ubiquitous today and are a fundamental data structure to represent objects and their relations in various domains, e.g., social science, citation analysis, weblink analysis, and biological informatics. PageRank-based techniques such as personalized PageRank, heat kernel PageRank, TrustRank have been well studied and shown great efficacy in graph processing tasks including web ranking, recommender system, graph clustering and combating web spam. In this thesis, we investigate three important problems in graph processing by exploiting PageRank-based techniques, namely, local graph clustering, homogeneous network embedding and attributed network embedding. These three problems are not only interesting in themselves when the graph size becomes large, but also find numerous applications in both academia and industry. First, we study the local graph clustering on undirected graphs. Given an undirected graph G and a seed node s, the local clustering problem aims to identify a high-quality cluster containing s in time roughly proportional to the size of the cluster, regardless of the size of G. Recently, heat kernel PageRank (HKPR) is applied to this problem and found to be more efficient compared with prior methods. However, existing solutions for computing HKPR either are prohibitively expensive or provide unsatisfactory error approximation on HKPR values, rendering them impractical especially on billion-edge graphs. Thus, we present TEA and TEA+, which utilize deterministic graph traversal to produce a rough estimation of the exact HKPR vector, and then exploit Monte-Carlo random walks to refine the results in an optimized and non-trivial way. In particular, TEA+ offers practical efficiency and effectiveness due to non-trivial optimizations. Second, we investigate the homogeneous network embedding (HNE) problem. Given an input graph G and a node v ∈ G, HNE aims to map the graph structure in the vicinity of v to a fixed-dimensional feature vector. Existing approaches to HNE are either immensely expensive, and, thus, are limited to small graphs or fail to fully capture the local graph structure, leading to limited effectiveness of the extracted feature vectors. Meanwhile, in recent years there have been rapid advancements in scalable algorithms for computing approximate personalized PageRank (PPR), which captures rich graph topological information. However, PPR was designed for a very different purpose, i.e., ranking nodes in G based on their relative importance from a source node’s perspective. In contrast, HNE aims to build node embeddings considering the whole graph. Consequently, node embeddings derived directly from PPR are of suboptimal utility. Motivated by this, we propose Node-Reweighted PageRank (NRP), a novel solution that transforms PPR values into effective node embeddings, by iteratively solving an optimization problem. Finally, we consider the attributed network embedding (ANE) problem. Given a graph G where each node is associated with a set of attributes, ANE maps each node v in G to a compact vector Xv, which can be used in downstream machine learning tasks. Existing solutions largely fail on such graphs, leading to prohibitive costs, low-quality embeddings, or both. We propose PANE, an effective and scalable approach to ANE computation for massive graphs. PANE obtains high scalability and effectiveness through three main algorithmic designs. First, it formulates the learning objective based on a novel node attribute PageRank model for attributed networks. The resulting optimization task is still challenging on large graphs. Second, PANE includes a highly efficient solver for the above optimization problem, whose key module is a carefully designed initialization of the embeddings, which drastically reduces the number of iterations required to converge. Finally, PANE utilizes multi-core CPUs through non-trivial parallelization of the above solver, which achieves scalability while retaining the high quality of the resulting embeddings.
DOI: 10.32657/10356/145185
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Theses

Files in This Item:
File Description SizeFormat 
Thesis-YANG RENCHI.pdf1.92 MBAdobe PDFView/Open

Page view(s)

Updated on Jan 14, 2022

Download(s) 20

Updated on Jan 14, 2022

Google ScholarTM




Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.