Fast and Scalable Distributed Boolean Tensor Factorization

Namyong Park, Sejoon Oh, and U Kang

Overview

How can we analyze tensors that are composed of 0’s and 1’s? How can we efficiently analyze such Boolean tensors with millions or even billions of entries? Boolean tensors often represent relationship, membership, or occurrences of events such as subject-relation-object tuples in knowledge base data (e.g., ‘Seoul’-‘is the capital of’-‘South Korea’). Boolean tensor factorization (BTF) is a useful tool for analyzing binary tensors to discover latent factors from them. Furthermore, BTF is known to produce more interpretable and sparser results than normal factorization methods. Although several BTF algorithms exist, they do not scale up for large-scale Boolean tensors.
In this paper, we propose DBTF, a distributed algorithm for Boolean tensor factorization running on the Spark framework. By caching computation results, exploiting the characteristics of Boolean operations, and with careful partitioning, DBTF successfully tackles the high computational costs and minimizes the intermediate data. Experimental results show that DBTF decomposes up to 163–323× larger tensors than existing methods in 70–400× less time, and exhibits near-linear scalability in terms of tensor size, rank, density, and machines.


Paper

Our proposed method DBTF is described in the following paper.

Code

The binary code used in the paper is available here. [Download]


Datasets

Name Dimensionality Non-zeros Description Source Download
Facebook 64K × 64K × 870 1.5M Temporal relationship data between users
(User-User-Date)
Max Planck Institute
for Software Systems
Download
DBLP 418K × 3.5K × 50 1.3M DBLP publication data
(Author-Conference-Year)
DBLP Download
CAIDA-DDoS-S 9.3K × 9.3K × 3.9K 22.8M Network DDoS attack traffic (small)
(Source IP-Destination IP-Time)
Center for Applied
Internet Data Analysis
Download
CAIDA-DDoS-L 9.3K × 9.3K × 393K 331.8M Network DDoS attack traffic (large)
(Source IP-Destination IP-Time)
Center for Applied
Internet Data Analysis
Download
NELL-S 14.5K × 14.5K × 28.8K 76.9M Knowledge base data (small)
(Subject-Object-Predicate)
NELL Web
NELL-L 112K × 112K × 213K 17.9M Knowledge base data (large)
(Subject-Object-Predicate)
NELL Web
Synthetic-scalability 26~213 × 26~213 × 26~213 2.6K~5.5B Synthetic tensors
Synthetic-error 100 × 100 × 100 6.5K~240K Synthetic tensors

People