DSparseTensor¶
DSparseTensor is the distributed counterpart of
SparseTensor. It partitions the rows of a sparse matrix
across the ranks of a DeviceMesh using domain decomposition: each rank owns
a contiguous block of rows and a halo of ghost columns referenced by its rows
but owned elsewhere. Vectors stay rank-local; the only communication is a
halo exchange before each matrix–vector product and an all-reduce for the
inner products inside Krylov iterations. The wrapper mirrors
torch.distributed.tensor.DTensor and reuses the single-process operations
rank-locally, so the operation vocabulary matches SparseTensor.
Partitioning¶
A distributed tensor is built from a global SparseTensor
and a mesh. Three entry points cover the common layouts:
from torch.distributed.device_mesh import init_device_mesh
from torch_sla import SparseTensor, DSparseTensor
mesh = init_device_mesh("cpu", (world_size,))
A = SparseTensor(val, row, col, (n, n))
# Row-partition a single matrix across the mesh (RowPartitioned placement)
D = DSparseTensor.partition(A, mesh, partition_method="metis")
# Partition a batch of same-pattern matrices, one shard per rank (BatchShard)
D = DSparseTensor.partition_batch(A_batched, mesh)
# Re-wrap shards that were already produced per-rank (no re-partitioning)
D = DSparseTensor.from_global_distributed(local_shard, spec, mesh)
partition_method selects how rows are assigned: "simple" (contiguous
blocks), "metis" (graph partition, minimizes halo), or "coordinates"
(geometric, needs coords). The owned/halo bookkeeping is computed once and
cached on the placement.
Scatter / gather move a global vector in and out of the sharded layout:
d = D.scatter(global_vec) # global torch.Tensor -> DTensor[Shard(0)]
y = (D @ d).full_tensor() # gather a sharded result back to global
Distributed solves and the sugar API¶
The *_shard methods are the primitives: they operate entirely in
Shard(0) space (each vector sized to the rank’s owned rows) and drive the
Krylov iteration with halo-exchange SpMV plus all-reduce dot products. On top
of them sits a thin sugar layer whose names and signatures match
SparseTensor, so single-process code ports with minimal
change:
Sugar method |
Delegates to |
Mirror of |
|---|---|---|
|
||
|
||
|
||
|
||
|
|
Every distributed result is rank-invariant: the same global solution, eigenvalue or component labelling regardless of world size or partition method.
b = D.scatter(global_b)
x = D.solve(b) # distributed CG, DTensor in / DTensor out
x_global = x.full_tensor() # gather to a single rank
See also
Distributed solve scaling – how to benchmark weak / strong /
throughput scaling of the distributed solve with torchrun, what
the metrics mean, and how to extend it. Script:
benchmarks/distributed/scaling/distributed_solve_scaling.py.
Halo-exchange SpMV¶
The matrix–vector product D @ x (and the matvec inside every solve) is the
one place ranks communicate during a kernel. Before multiplying, each rank
fills its halo entries with the owned values from neighboring ranks via a
point-to-point halo exchange, then runs a purely local SpMV over its owned
rows. This keeps memory and compute per rank proportional to its share of the
matrix; see distributed matvec (halo-exchange SpMV).
Operation catalog¶
The distributed operations cross-reference their single-process equivalents on
SparseTensor. Operations not listed (det, logdet,
svd, condition_number) gather to one rank before computing and exist
for convenience rather than scale.
Operation |
API |
Single-process equivalent |
|---|---|---|
construction from a global |
||
|
Save / load of a sharded tensor (one file per partition) is handled by
save() / load()
and the functional save_distributed() /
load_partition().