Partition Aware Connected Component Computation in Distributed Systems

Ha-Myung Park, Namyong Park, Sung-Hyon Myaeng, and U Kang


How can we find all connected components in an enormous graph with billions of nodes and edges? Finding connected components is a fundamental operation for various graph computation tasks such as pattern recognition, reachability, graph compression, etc. Many algorithms have been proposed for decades, but most of them are not scalable enough to process recent web scale graphs. Recently, a MapReduce algorithm was proposed to handle such large graphs. However, the algorithm repeatedly reads and writes numerous intermediate data that cause network overload and prolong the running time. In this paper, we propose PACC (Partition-Aware Connected Components), a new distributed algorithm based on graph partitioning for load-balancing and edge-filtering. Experimental results show that PACC significantly reduces the intermediate data, and provides up to 8.6× faster performance than the current state-of-the-art MapReduce algorithm on real world graphs.


PACC is described in the following paper.


The binary code used in this paper is available. [Download]

Real World Datasets

Skitter169641511095298 Internet topology graph, from traceroutes run daily in 2005 SNAP
Patent377476816518948 Citation network among US Patents SNAP
LiveJournal484757168993773 LiveJournal online social network SNAP
Friendster656083661806067135 Friendster online social network SNAP
Twitter416522301468365182 Twitter follower-followee network Advanced Networking Lab at KAIST
SubDomain892477392043203933 Domain level hyperlink network on the Web Yahoo Webscope
YahooWeb7202421736636600779 Page level hyperlink network on the Web Yahoo Webscope

Synthetic Datasets

RMAT 21111533131457280
RMAT 234118887125829120
RMAT 2515215025503316480
RMAT 27561013822013265920
RMAT 292070159158053063680

* RMAT graphs are generated using parameters (a, b, c, d) = (0.57, 0.19, 0.19, 0.05).