How can we find all connected components in an enormous graph with billions of nodes and edges? Finding
connected components is a fundamental operation for various graph computation tasks such as pattern recognition, reachability, graph compression, etc. Many algorithms have been proposed for decades, but most of them are not scalable enough
to process recent web scale graphs. Recently, a MapReduce algorithm was proposed to handle such large graphs.
However, the algorithm repeatedly reads and writes numerous intermediate data that cause network overload and prolong the running time.
In this paper, we propose PACC (Partition-Aware Connected Components), a new distributed algorithm based on graph partitioning for load-balancing and edge-filtering. Experimental results show that PACC significantly reduces the intermediate data,
and provides up to 8.6× faster performance than the current state-of-the-art MapReduce algorithm on real world graphs.
PACC is described in the following paper.
- Partition Aware Connected Component Computation in Distributed Systems
Ha-Myung Park, Namyong Park, Sung-Hyon Myaeng, U Kang.
The binary code used in this paper is available.
Real World Datasets