torch-sla (Torch Sparse Linear Algebra) is a Python library that provides differentiable sparse linear equation solvers for PyTorch. It solves systems of the form Ax = b where A is a sparse matrix, with full support for automatic differentiation (autograd) and GPU acceleration via CUDA.

How do I solve a sparse linear system in PyTorch?

Use torch-sla's SparseTensor class: from torch_sla import SparseTensor; A = SparseTensor(values, row, col, shape); x = A.solve(b). This works on both CPU and GPU, and supports gradient computation.

What sparse solvers does torch-sla support?

torch-sla supports multiple backends: CPU solvers include SciPy (LU, UMFPACK, CG, BiCGStab, GMRES, MINRES) and PyTorch-native (CG, BiCGStab, GMRES, MINRES). GPU solvers include PyTorch-native (CG, BiCGStab, GMRES, MINRES) and cuDSS (LU, Cholesky, LDLT). The library automatically selects the best solver based on hardware and matrix properties.

Can I compute gradients through sparse solve in PyTorch?

Yes, torch-sla fully supports PyTorch autograd. You can set requires_grad=True on your sparse matrix values, solve the system, compute a loss, and call backward() to get gradients with respect to the matrix values and right-hand side.

How do I solve batched sparse systems in PyTorch?

torch-sla supports batched solving for matrices with the same sparsity pattern. Create a SparseTensor with batched values of shape [batch_size, nnz] and solve all systems in parallel. For matrices with different patterns, use SparseTensorList.

How do I use torch-sla on GPU?

Simply call .cuda() on your SparseTensor: A_cuda = A.cuda(); x = A_cuda.solve(b.cuda()). The library automatically uses cuDSS for GPU-accelerated solving.

What is the difference between torch-sla and scipy.sparse?

torch-sla offers native PyTorch integration, GPU support via CUDA, full autograd gradient support, and batched solving. scipy.sparse is CPU-only, requires data copying to/from PyTorch, and doesn't support automatic differentiation.

How do I install torch-sla?

Install via pip: pip install torch-sla. For GPU support, ensure you have CUDA installed and a compatible PyTorch version.

DSparseTensor¶

DSparseTensor is the distributed counterpart of SparseTensor. It partitions the rows of a sparse matrix across the ranks of a DeviceMesh using domain decomposition: each rank owns a contiguous block of rows and a halo of ghost columns referenced by its rows but owned elsewhere. Vectors stay rank-local; the only communication is a halo exchange before each matrix–vector product and an all-reduce for the inner products inside Krylov iterations. The wrapper mirrors torch.distributed.tensor.DTensor and reuses the single-process operations rank-locally, so the operation vocabulary matches SparseTensor.

Partitioning¶

A distributed tensor is built from a global SparseTensor and a mesh. Three entry points cover the common layouts:

from torch.distributed.device_mesh import init_device_mesh
from torch_sla import SparseTensor, DSparseTensor

mesh = init_device_mesh("cpu", (world_size,))
A = SparseTensor(val, row, col, (n, n))

# Row-partition a single matrix across the mesh (RowPartitioned placement)
D = DSparseTensor.partition(A, mesh, partition_method="metis")

# Partition a batch of same-pattern matrices, one shard per rank (BatchShard)
D = DSparseTensor.partition_batch(A_batched, mesh)

# Re-wrap shards that were already produced per-rank (no re-partitioning)
D = DSparseTensor.from_global_distributed(local_shard, spec, mesh)

partition_method selects how rows are assigned: "simple" (contiguous blocks), "metis" (graph partition, minimizes halo), or "coordinates" (geometric, needs coords). The owned/halo bookkeeping is computed once and cached on the placement.

Scatter / gather move a global vector in and out of the sharded layout:

d = D.scatter(global_vec)     # global torch.Tensor -> DTensor[Shard(0)]
y = (D @ d).full_tensor()     # gather a sharded result back to global

Distributed solves and the sugar API¶

The *_shard methods are the primitives: they operate entirely in Shard(0) space (each vector sized to the rank’s owned rows) and drive the Krylov iteration with halo-exchange SpMV plus all-reduce dot products. On top of them sits a thin sugar layer whose names and signatures match SparseTensor, so single-process code ports with minimal change:

Sugar method	Delegates to	Mirror of
`solve()`	`solve_distributed_shard`	`SparseTensor.solve`
`solve_batch()`	`solve_batch_shard`	`SparseTensor.solve_batch`
`nonlinear_solve()`	`nonlinear_solve_distributed_shard`	`SparseTensor.nonlinear_solve`
`connected_components()`	`connected_components_shard`	`SparseTensor.connected_components`
`lsqr()` / `lsmr()`	`lsqr_shard` / `lsmr_shard`	`spsolve(method='lsqr'/'lsmr')`

Every distributed result is rank-invariant: the same global solution, eigenvalue or component labelling regardless of world size or partition method.

b = D.scatter(global_b)
x = D.solve(b)                       # distributed CG, DTensor in / DTensor out
x_global = x.full_tensor()           # gather to a single rank

Halo-exchange SpMV¶

The matrix–vector product D @ x (and the matvec inside every solve) is the one place ranks communicate during a kernel. Before multiplying, each rank fills its halo entries with the owned values from neighboring ranks via a point-to-point halo exchange, then runs a purely local SpMV over its owned rows. This keeps memory and compute per rank proportional to its share of the matrix; see distributed matvec (halo-exchange SpMV).

Operation catalog¶

The distributed operations cross-reference their single-process equivalents on SparseTensor. Operations not listed (det, logdet, svd, condition_number) gather to one rank before computing and exist for convenience rather than scale.

Operation	API	Single-process equivalent
partition	`partition()`	construction from a global `SparseTensor`
solve	`solve()`	SparseTensor.solve
solve_batch	`solve_batch()`	SparseTensor.solve_batch
nonlinear_solve	`nonlinear_solve()`	SparseTensor.nonlinear_solve
matvec / @	`__matmul__()`	SparseTensor.matvec
eigsh	`eigsh()`	SparseTensor.eigsh
connected_components	`connected_components()`	SparseTensor.connected_components
lsqr / lsmr	`lsqr()`	`spsolve(method='lsqr'/'lsmr')`

Save / load of a sharded tensor (one file per partition) is handled by save() / load() and the functional save_distributed() / load_partition().