PegasusN

an open-source graph-mining system with massive scalability

Watch the introduction video

Overview

PageRank
Connected Components
Triangle
Running time of various systems on three tasks: PageRank, connected component computation, and triangle enumeration. Missing methods for some datasets mean they failed to run on the datasets. PegasusN is the only one that succeeds in processing CW12 with 6 billion nodes and 72 billion edges, the largest graph used in this experiment.

PegasusN is a Peta-scale graph mining system, fully written in Java. It runs in parallel, distributed manner on top of Hadoop and Spark.

PegasusN provides large scale algorithms for important graph mining tasks:

  • Graph Structure Analyses
    • PageRank
    • Random Walk with Restart (RWR)
    • Radius/Diameter
    • Degree
    • Single Source Shortest Path (SSSP)
    • Label Propagation
    • Connected Component
  • Subgraph Enumeration
    • Triangle
    • Cliques
    • Other pattern graphs
  • Graph Generation
    • R-MAT
    • Kronecker
    • Random
  • Graph Visualization

The details of PegasusN can be found in the following paper:

Ha-Myung Park, Chiwan Park, U Kang.
PegasusN: A Scalable and Versatile Graph Mining System.
AAAI 2018, New Orleans, Lousiana, USA. (Demo Paper)

Performance of PegasusN

Following figures show the running time of various systems on three tasks: PageRank, connected component computation, and triangle enumeration. Missing methods for some datasets mean they failed to run on the datasets. PegasusN is the only one that succeeds in processing CW12 with 6 billion nodes and 72 billion edges, the largest graph used in this experiment.
PageRank
Connected Components
Triangle

Graph Mining with PegasusN

Graph Mining is an area of data mining to find patterns, rules, and anomalies of graphs.

Why Should We Care?

Graphs or networks are everywhere, ranging from the Internet Web graph, social networks(FaceBook, Twitter), biological networks, and many more. Finding patterns, rules, and anomalies have numerous applications including, but not limited to, the followings:
  • Ranking web pages by search engine
  • 'viral' or 'word-of-mouth' marketing
  • Patterns of disease with potential impact for drug discovery
  • Computer network security: email/IP traffic and anomaly detection

Why PegasusN?

Existing works on graph mining has limited scalability: usually, the maximum graph size is order of millions. PegasusN breaks the limit by scaling up the algorithms to billion-scale graphs. The breakthrough was possible by the careful algorithm design and implementation for Hadoop, a massive cloud computing platform. To summarize, PegasusN has three major advantages.
  1. Large Graph Mining Package
    Graphs with billions of nodes and edges
  2. Parallel Algorithms on Hadoop and Spark
    Massive cloud computing platform
  3. Open Source
Thanks to PegasusN, we could analyze the largest publicly available Web Graphs (ClueWeb12) with 72 billion edges.