Many real-world data are naturally represented as tensors, or multi-dimensional arrays.
Tensor decomposition is an important tool to analyze tensors for various applications
such as latent concept discovery, trend analysis, clustering, and anomaly detection.
However, existing tools for tensor analysis do not scale well for billion-scale tensors or offer limited functionalities.
In this paper, we propose BIGtensor, a large-scale tensor mining library that tackles both of the above problems. Carefully designed for scalability, BIGtensor decomposes at least 100× larger tensors than the current state of the art. We demonstrate how BIGtensor can help users discover hidden concepts and analyze trends from large-scale tensors that are hard to be processed by the existing tools.
Comparison of functionalities provided by BIGtensor and other state-of-the-art tensor tools.
( P: PARAFAC, T: Tucker, PN: PARAFAC-Nonnegative, TN: Tucker-Nonnegative, C: CMTF )
|Tensor Decomposition||P, PN, T, TN, C||P, PN, T||P, PN, C|
Scalability of BIGtensor and other tools. We report the mode length and the density of the largest data each tool processes using two representative tensor decomposition algorithms, PARAFAC and Tucker. Their nonnegative versions and CMTF show a similar performance. BIGtensor decomposes 100× larger data in terms of mode length than both of the tools, and also decomposes 100× denser data than the Tensor Toolbox.
|Scalability||Method||BIG tensor||Tensor Toolbox||FlexiFaCT|
|Mode Length & Nonzeros||PARAFAC||≥ 10^9||≤ 10^7||≤ 10^7|
|Tucker||≥ 10^9||≤ 10^7||-|
|Density||PARAFAC||≥ 10^-5||≤ 10^-7||≥ 10^-5|
|Tucker||≥ 10^-5||≤ 10^-7||-|
The binary code of BIGtensor is available here.
|Microsoft Academic Graph||Paper - Author - Affiliation||123M × 123M × 2.7M||325M||DOWN||Papers and their metadata|
|NELL||NounPhrase1 - NounPhrase2 - Context||26M × 26M × 48M||144M||"Read the Web" Project|
|MovieLens||User - Movie - YearMonth||72K × 11K × 157||10M||DOWN||Movie rating data|
|YELP||User - Business - YearMonth||71K × 16K × 108||334K||DOWN||Business rating data|
|PhoneCall||Source - Destination - Date||30M × 30M × 62||1B||Phone call traffic data|
|Random||I - J - K||1K~1B × 1K~1B × 1K~1B||10K~10B||Synthetic random data|