Please use this identifier to cite or link to this item:
|Title:||Efficient and scalable techniques for pagerank-based graph analytics||Authors:||Yang, Renchi||Keywords:||Engineering::Computer science and engineering::Information systems::Database management||Issue Date:||2020||Publisher:||Nanyang Technological University||Source:||Yang, R. (2020). Efficient and scalable techniques for pagerank-based graph analytics. Doctoral thesis, Nanyang Technological University, Singapore.||Abstract:||Graphs are ubiquitous today and are a fundamental data structure to represent objects and their relations in various domains, e.g., social science, citation analysis, weblink analysis, and biological informatics. PageRank-based techniques such as personalized PageRank, heat kernel PageRank, TrustRank have been well studied and shown great eﬃcacy in graph processing tasks including web ranking, recommender system, graph clustering and combating web spam. In this thesis, we investigate three important problems in graph processing by exploiting PageRank-based techniques, namely, local graph clustering, homogeneous network embedding and attributed network embedding. These three problems are not only interesting in themselves when the graph size becomes large, but also ﬁnd numerous applications in both academia and industry. First, we study the local graph clustering on undirected graphs. Given an undirected graph G and a seed node s, the local clustering problem aims to identify a high-quality cluster containing s in time roughly proportional to the size of the cluster, regardless of the size of G. Recently, heat kernel PageRank (HKPR) is applied to this problem and found to be more eﬃcient compared with prior methods. However, existing solutions for computing HKPR either are prohibitively expensive or provide unsatisfactory error approximation on HKPR values, rendering them impractical especially on billion-edge graphs. Thus, we present TEA and TEA+, which utilize deterministic graph traversal to produce a rough estimation of the exact HKPR vector, and then exploit Monte-Carlo random walks to reﬁne the results in an optimized and non-trivial way. In particular, TEA+ oﬀers practical eﬃciency and eﬀectiveness due to non-trivial optimizations. Second, we investigate the homogeneous network embedding (HNE) problem. Given an input graph G and a node v ∈ G, HNE aims to map the graph structure in the vicinity of v to a ﬁxed-dimensional feature vector. Existing approaches to HNE are either immensely expensive, and, thus, are limited to small graphs or fail to fully capture the local graph structure, leading to limited eﬀectiveness of the extracted feature vectors. Meanwhile, in recent years there have been rapid advancements in scalable algorithms for computing approximate personalized PageRank (PPR), which captures rich graph topological information. However, PPR was designed for a very diﬀerent purpose, i.e., ranking nodes in G based on their relative importance from a source node’s perspective. In contrast, HNE aims to build node embeddings considering the whole graph. Consequently, node embeddings derived directly from PPR are of suboptimal utility. Motivated by this, we propose Node-Reweighted PageRank (NRP), a novel solution that transforms PPR values into eﬀective node embeddings, by iteratively solving an optimization problem. Finally, we consider the attributed network embedding (ANE) problem. Given a graph G where each node is associated with a set of attributes, ANE maps each node v in G to a compact vector Xv, which can be used in downstream machine learning tasks. Existing solutions largely fail on such graphs, leading to prohibitive costs, low-quality embeddings, or both. We propose PANE, an eﬀective and scalable approach to ANE computation for massive graphs. PANE obtains high scalability and eﬀectiveness through three main algorithmic designs. First, it formulates the learning objective based on a novel node attribute PageRank model for attributed networks. The resulting optimization task is still challenging on large graphs. Second, PANE includes a highly eﬃcient solver for the above optimization problem, whose key module is a carefully designed initialization of the embeddings, which drastically reduces the number of iterations required to converge. Finally, PANE utilizes multi-core CPUs through non-trivial parallelization of the above solver, which achieves scalability while retaining the high quality of the resulting embeddings.||URI:||https://hdl.handle.net/10356/145185||DOI:||10.32657/10356/145185||Rights:||This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).||Fulltext Permission:||open||Fulltext Availability:||With Fulltext|
|Appears in Collections:||SCSE Theses|
Updated on Mar 7, 2021
Updated on Mar 7, 2021
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.