High-performance tensor-train primitives on graphics processors

This technology is a GPU tensor core-accelerated software toolkit for tensor train operations, data transfer, and multi-GPU scheduling that improves the performance of tensor train decomposition, neural network layers, and DMRG at scale.

Unmet Need: Scalable, fast tensor train computation for large tensors

Many modern problems in machine learning, scientific computing, and data analysis rely on very large, high-dimensional tensors. Working with these tensors quickly becomes impractical because computation and memory demands grow rapidly with dimension and tensor order. As a result, useful tensor methods can be too slow, too memory-hungry, or too difficult to scale across modern hardware for real-world datasets and models. There is a need for implementations that make these tensor operations efficient, hardware-optimized, and scalable without sacrificing usable accuracy.

The Technology: Tensor-core acceleration for tensor-train algorithms and layers

This technology is a GPU-optimized software approach that speeds up tensor-train (TT) computations by improving how the core math routines and data movement are executed, enabling TT decomposition, TT-based neural-network compression, and DMRG-style tensor-network workloads to run faster and scale to larger problems. It targets the operations that usually dominate runtime in TT workflows, such as repeated tensor multiplications and matrix factorizations, and implements them in a way that better matches modern GPU hardware. It also reduces time spent moving data between the CPU and GPU and supports splitting large computations across multiple GPUs when a single device would otherwise limit performance. Overall, it is a performance-focused implementation layer intended to make TT methods practical for larger-scale engineering and research workloads.

This technology has been validated on an NVIDIA DGX-2 system with A100 GPUs, showing multi-fold speedups for tensor-train decomposition (single- and multi-GPU), faster tensor-train neural-network layers with strong compression, and large speedups for DMRG compared with common baseline libraries.

Applications:

  • Faster tensor-train tensor decomposition
  • Multi-GPU decomposition of high-order tensors
  • Compress fully connected neural network layers
  • Speed up inference/training of tensor-train layers
  • Accelerate DMRG-based tensor network layers

Advantages:

  • Faster tensor-train computations
  • Scales to larger, higher-order tensors
  • Better GPU hardware utilization
  • Less overhead from data movement
  • Strong compression with small accuracy loss
  • Faster tensor network simulation workloads

Lead Inventor:

Xiaodong Wang, Ph.D.

Patent Information:

Patent Pending

Related Publications:

Tech Ventures Reference:

Quick Facts:
Tags
Computational scienceGraphics processing unitHardware accelerationMachine learningTensor
Inventors
Xiaodong WangXiaoyang Liu
Manager
Greg Maskel
Departments
Electrical Engineering
Divisions
Fu Foundation School of Engineering and Applied Science (SEAS)
Reference Number
CU24167
Release Date
2026-03-06