Fast and Scalable Distributed Boolean Tensor Factorization

Overview

How can we analyze tensors that are composed of 0’s and 1’s? How can we efficiently analyze such Boolean tensors with millions or even billions of entries? Boolean tensors often represent relationship, membership, or occurrences of events such as subject-relation-object tuples in knowledge base data (e.g., ‘Seoul’-‘is the capital of’-‘South Korea’). Boolean tensor factorization (BTF) is a useful tool for analyzing binary tensors to discover latent factors from them. Furthermore, BTF is known to produce more interpretable and sparser results than normal factorization methods. Although several BTF algorithms exist, they do not scale up for large-scale Boolean tensors.
In this paper, we propose DBTF, a distributed algorithm for Boolean tensor factorization running on the Spark framework. By caching computation results, exploiting the characteristics of Boolean operations, and with careful partitioning, DBTF successfully tackles the high computational costs and minimizes the intermediate data. Experimental results show that DBTF decomposes up to 16³–32³× larger tensors than existing methods in 70–400× less time, and exhibits near-linear scalability in terms of tensor size, rank, density, and machines.

Paper

Our proposed method DBTF is described in the following paper.

Fast and Scalable Distributed Boolean Tensor Factorization
Namyong Park, Sejoon Oh, and U Kang.
IEEE International Conference on Data Engineering (ICDE) 2017, San Diego, CA, USA.
[PDF] [BIBTEX]

Code

The binary code used in the paper is available here. [Download]

Datasets

Name	Dimensionality	Non-zeros	Description	Source	Download
Facebook	64K × 64K × 870	1.5M	Temporal relationship data between users (User-User-Date)	Max Planck Institute for Software Systems	Download
DBLP	418K × 3.5K × 50	1.3M	DBLP publication data (Author-Conference-Year)	DBLP	Download
CAIDA-DDoS-S	9.3K × 9.3K × 3.9K	22.8M	Network DDoS attack traffic (small) (Source IP-Destination IP-Time)	Center for Applied Internet Data Analysis	Download
CAIDA-DDoS-L	9.3K × 9.3K × 393K	331.8M	Network DDoS attack traffic (large) (Source IP-Destination IP-Time)	Center for Applied Internet Data Analysis	Download
NELL-S	14.5K × 14.5K × 28.8K	76.9M	Knowledge base data (small) (Subject-Object-Predicate)	NELL Web	–
NELL-L	112K × 112K × 213K	17.9M	Knowledge base data (large) (Subject-Object-Predicate)	NELL Web	–
Synthetic-scalability	2⁶~2¹³ × 2⁶~2¹³ × 2⁶~2¹³	2.6K~5.5B	Synthetic tensors	–	–
Synthetic-error	100 × 100 × 100	6.5K~240K	Synthetic tensors	–	–

People

Namyong Park (Seoul National University, Korea)
Sejoon Oh (Seoul National University, Korea)
U Kang (Seoul National University, Korea)