High-performance tensor-train primitives on graphics processors

This technology is a GPU tensor core-accelerated software toolkit for tensor train operations, data transfer, and multi-GPU scheduling that improves the performance of tensor train decomposition, neural network layers, and DMRG at scale.

Unmet Need: Scalable, fast tensor train computation for large tensors

Many modern problems in machine learning, scientific computing, and data analysis rely on very large, high-dimensional tensors. Working with these tensors quickly becomes impractical because computation and memory demands grow rapidly with dimension and tensor order. As a result, useful tensor methods can be too slow, too memory-hungry, or too difficult to scale across modern hardware for real-world datasets and models. There is a need for implementations that make these tensor operations efficient, hardware-optimized, and scalable without sacrificing usable accuracy.

The Technology: Tensor-core acceleration for tensor-train algorithms and layers

This technology is a GPU-optimized software approach that speeds up tensor-train (TT) computations by improving how the core math routines and data movement are executed, enabling TT decomposition, TT-based neural-network compression, and DMRG-style tensor-network workloads to run faster and scale to larger problems. It targets the operations that usually dominate runtime in TT workflows, such as repeated tensor multiplications and matrix factorizations, and implements them in a way that better matches modern GPU hardware. It also reduces time spent moving data between the CPU and GPU and supports splitting large computations across multiple GPUs when a single device would otherwise limit performance. Overall, it is a performance-focused implementation layer intended to make TT methods practical for larger-scale engineering and research workloads.

This technology has been validated on an NVIDIA DGX-2 system with A100 GPUs, showing multi-fold speedups for tensor-train decomposition (single- and multi-GPU), faster tensor-train neural-network layers with strong compression, and large speedups for DMRG compared with common baseline libraries.

Applications:

Faster tensor-train tensor decomposition
Multi-GPU decomposition of high-order tensors
Compress fully connected neural network layers
Speed up inference/training of tensor-train layers
Accelerate DMRG-based tensor network layers

Advantages:

Faster tensor-train computations
Scales to larger, higher-order tensors
Better GPU hardware utilization
Less overhead from data movement
Strong compression with small accuracy loss
Faster tensor network simulation workloads

Lead Inventor:

Xiaodong Wang, Ph.D.

Patent Information:

Patent Pending

Related Publications:

Liu XY, Hong H, Zhang Z, Tong W, Kossaifi J, Wang X, et al. High-Performance Tensor-Train Primitives Using GPU Tensor Cores. IEEE Transactions on Computers. 2024 Nov;73(11):2634–48.

Tech Ventures Reference:

IR CU24167
Licensing Contact: Greg Maskel