DSparseTensor

DSparseTensor is the distributed counterpart of SparseTensor. It partitions the rows of a sparse matrix across the ranks of a DeviceMesh using domain decomposition: each rank owns a contiguous block of rows and a halo of ghost columns referenced by its rows but owned elsewhere. Vectors stay rank-local; the only communication is a halo exchange before each matrix–vector product and an all-reduce for the inner products inside Krylov iterations. The wrapper mirrors torch.distributed.tensor.DTensor and reuses the single-process operations rank-locally, so the operation vocabulary matches SparseTensor.

Partitioning

A distributed tensor is built from a global SparseTensor and a mesh. Three entry points cover the common layouts:

from torch.distributed.device_mesh import init_device_mesh
from torch_sla import SparseTensor, DSparseTensor

mesh = init_device_mesh("cpu", (world_size,))
A = SparseTensor(val, row, col, (n, n))

# Row-partition a single matrix across the mesh (RowPartitioned placement)
D = DSparseTensor.partition(A, mesh, partition_method="metis")

# Partition a batch of same-pattern matrices, one shard per rank (BatchShard)
D = DSparseTensor.partition_batch(A_batched, mesh)

# Re-wrap shards that were already produced per-rank (no re-partitioning)
D = DSparseTensor.from_global_distributed(local_shard, spec, mesh)

partition_method selects how rows are assigned: "simple" (contiguous blocks), "metis" (graph partition, minimizes halo), or "coordinates" (geometric, needs coords). The owned/halo bookkeeping is computed once and cached on the placement.

Scatter / gather move a global vector in and out of the sharded layout:

d = D.scatter(global_vec)     # global torch.Tensor -> DTensor[Shard(0)]
y = (D @ d).full_tensor()     # gather a sharded result back to global

Distributed solves and the sugar API

The *_shard methods are the primitives: they operate entirely in Shard(0) space (each vector sized to the rank’s owned rows) and drive the Krylov iteration with halo-exchange SpMV plus all-reduce dot products. On top of them sits a thin sugar layer whose names and signatures match SparseTensor, so single-process code ports with minimal change:

Sugar method

Delegates to

Mirror of

solve()

solve_distributed_shard

SparseTensor.solve

solve_batch()

solve_batch_shard

SparseTensor.solve_batch

nonlinear_solve()

nonlinear_solve_distributed_shard

SparseTensor.nonlinear_solve

connected_components()

connected_components_shard

SparseTensor.connected_components

lsqr() / lsmr()

lsqr_shard / lsmr_shard

spsolve(method='lsqr'/'lsmr')

Every distributed result is rank-invariant: the same global solution, eigenvalue or component labelling regardless of world size or partition method.

b = D.scatter(global_b)
x = D.solve(b)                       # distributed CG, DTensor in / DTensor out
x_global = x.full_tensor()           # gather to a single rank

See also

Distributed solve scaling – how to benchmark weak / strong / throughput scaling of the distributed solve with torchrun, what the metrics mean, and how to extend it. Script: benchmarks/distributed/scaling/distributed_solve_scaling.py.

Halo-exchange SpMV

The matrix–vector product D @ x (and the matvec inside every solve) is the one place ranks communicate during a kernel. Before multiplying, each rank fills its halo entries with the owned values from neighboring ranks via a point-to-point halo exchange, then runs a purely local SpMV over its owned rows. This keeps memory and compute per rank proportional to its share of the matrix; see distributed matvec (halo-exchange SpMV).

Operation catalog

The distributed operations cross-reference their single-process equivalents on SparseTensor. Operations not listed (det, logdet, svd, condition_number) gather to one rank before computing and exist for convenience rather than scale.

Save / load of a sharded tensor (one file per partition) is handled by save() / load() and the functional save_distributed() / load_partition().