Support different NSE in batches of CSR and CSC tensors #84843

pearu · 2022-09-11T22:04:28Z

This PR enables batched CSR/CSC tensors that batches may have different NSE counts.

For instance, with the current master we have

>>> a = torch.tensor([[[1, 2], [3, 4]], [[0, 12], [21, 0]]])
>>> a.to_sparse_csr()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Expect the same number of specified elements per batch.

because the NSE of the first and second batches are different, 4 and 2, respectively.

This PR implements a strided-to-sparse-CSR/CSC conversion algorithm that supports CSR/CSC batches with different NSE counts. For instance:

>>> a = torch.tensor([[[1, 2], [3, 4]], [[0, 12], [21, 0]]])
>>> b = a.to_sparse_csr()
>>> b
tensor(crow_indices=tensor([[0, 2, 4],
                            [0, 1, 2]]),
       col_indices=tensor([[0, 1, 0, 1],
                           [1, 0, 0, 0]]),
       values=tensor([[ 1,  2,  3,  4],
                      [12, 21,  0,  0]]), size=(2, 2, 2), nnz=4,
       layout=torch.sparse_csr)
>>> b[0]
tensor(crow_indices=tensor([0, 2, 4]),
       col_indices=tensor([0, 1, 0, 1]),
       values=tensor([1, 2, 3, 4]), size=(2, 2), nnz=4,
       layout=torch.sparse_csr)
>>> b[1]
tensor(crow_indices=tensor([0, 1, 2]),
       col_indices=tensor([1, 0]),
       values=tensor([12, 21]), size=(2, 2), nnz=2, layout=torch.sparse_csr)

that is, if the NSE of a batch is smaller than the maximum NSE over all batches, the corresponding rows in col_indices/values are padded with zeros as placeholders. Algorithms on batched CSR/CSC tensors must not access the padded parts of these tensors, that is, the algorithms should use the last element of the corresponding crow_indices row as the NSE value rather than the value of .values().shape[0] that holds the maximum NSE over all batches.

Performance-wise, the strided-to-sparse-CSR/CSC conversion algorithms in master and in this PR, are roughly equivalent:

# master branch:
n [2]: a = torch.rand(10, 10, 1000, 1000)

In [3]: a = torch.where(a==0, 0.1, a)  # required for master, optional for the PR

In [4]: %timeit a.to_sparse_csr()
2.25 s ± 9.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: a_cuda = a.cuda()

In [6]: %timeit a_cuda.to_sparse_csr()
55.2 ms ± 6.95 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# this PR
In [2]: a = torch.rand(10, 10, 1000, 1000)

In [3]: a = torch.where(a==0, 0.1, a)  # required for master, optional for the PR

In [4]: %timeit a.to_sparse_csr()
2.12 s ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: a_cuda = a.cuda()

In [6]: %timeit a_cuda.to_sparse_csr(); torch.cuda.synchronize()
47.2 ms ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The performance of to_sparse_csr() on CUDA tensors increased by 15% with this PR.

A strided-to-sparse-BSR/BSC conversion with variable NSE support will be implemented as a follow-up.

Stack from ghstack (oldest at bottom):

cc @nikitaved @cpuhrsch @amjames @bhosmer

[ghstack-poisoned]

pytorch-bot · 2022-09-11T22:04:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84843

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 Failures, 1 Pending

As of commit 18f4bfc:

The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

This PR enables batched CSR/CSC tensors that batches may have different NSE counts. For instance, with the current master we have ```python >>> a = torch.tensor([[[1, 2], [3, 4]], [[0, 12], [21, 0]]]) >>> a.to_sparse_csr() Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: Expect the same number of specified elements per batch. ``` because the NSE of the first and second batches are different, 4 and 2, respectively. This PR implements a strided-to-sparse-CSR/CSC conversion algorithm that supports CSR/CSC batches with different NSE counts. For instance: ```python >>> a = torch.tensor([[[1, 2], [3, 4]], [[0, 12], [21, 0]]]) >>> b = a.to_sparse_csr() >>> b tensor(crow_indices=tensor([[0, 2, 4], [0, 1, 2]]), col_indices=tensor([[0, 1, 0, 1], [1, 0, 0, 0]]), values=tensor([[ 1, 2, 3, 4], [12, 21, 0, 0]]), size=(2, 2, 2), nnz=4, layout=torch.sparse_csr) >>> b[0] tensor(crow_indices=tensor([0, 2, 4]), col_indices=tensor([0, 1, 0, 1]), values=tensor([1, 2, 3, 4]), size=(2, 2), nnz=4, layout=torch.sparse_csr) >>> b[1] tensor(crow_indices=tensor([0, 1, 2]), col_indices=tensor([1, 0]), values=tensor([12, 21]), size=(2, 2), nnz=2, layout=torch.sparse_csr) ``` that is, if the NSE of a batch is smaller than the maximum NSE over all batches, the corresponding rows in `col_indices`/`values` are padded with zeros as placeholders. Algorithms on batched CSR/CSC tensors must not access the padded parts of these tensors, that is, the algorithms should use the last element of the corresponding `crow_indices` row as the NSE value rather than the value of `.values().shape[0]` that holds the maximum NSE over all batches. Performance-wise, the strided-to-sparse-CSR/CSC conversion algorithms in master and in this PR, are roughly equivalent: ```python # master branch: n [2]: a = torch.rand(10, 10, 1000, 1000) In [3]: a = torch.where(a==0, 0.1, a) # required for master, optional for the PR In [4]: %timeit a.to_sparse_csr() 2.25 s ± 9.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [5]: a_cuda = a.cuda() In [6]: %timeit a_cuda.to_sparse_csr() 55.2 ms ± 6.95 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` ```python # this PR In [2]: a = torch.rand(10, 10, 1000, 1000) In [3]: a = torch.where(a==0, 0.1, a) # required for master, optional for the PR In [4]: %timeit a.to_sparse_csr() 2.13 s ± 7.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [5]: a_cuda = a.cuda() In [6]: %timeit a_cuda.to_sparse_csr() 54.3 ms ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` The performance of the PR is only slightly better than the master branch. A strided-to-sparse-BSR/BSC conversion with variable NSE support will be implemented as a follow-up. [ghstack-poisoned]

This PR enables batched CSR/CSC tensors that batches may have different NSE counts. For instance, with the current master we have ```python >>> a = torch.tensor([[[1, 2], [3, 4]], [[0, 12], [21, 0]]]) >>> a.to_sparse_csr() Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: Expect the same number of specified elements per batch. ``` because the NSE of the first and second batches are different, 4 and 2, respectively. This PR implements a strided-to-sparse-CSR/CSC conversion algorithm that supports CSR/CSC batches with different NSE counts. For instance: ```python >>> a = torch.tensor([[[1, 2], [3, 4]], [[0, 12], [21, 0]]]) >>> b = a.to_sparse_csr() >>> b tensor(crow_indices=tensor([[0, 2, 4], [0, 1, 2]]), col_indices=tensor([[0, 1, 0, 1], [1, 0, 0, 0]]), values=tensor([[ 1, 2, 3, 4], [12, 21, 0, 0]]), size=(2, 2, 2), nnz=4, layout=torch.sparse_csr) >>> b[0] tensor(crow_indices=tensor([0, 2, 4]), col_indices=tensor([0, 1, 0, 1]), values=tensor([1, 2, 3, 4]), size=(2, 2), nnz=4, layout=torch.sparse_csr) >>> b[1] tensor(crow_indices=tensor([0, 1, 2]), col_indices=tensor([1, 0]), values=tensor([12, 21]), size=(2, 2), nnz=2, layout=torch.sparse_csr) ``` that is, if the NSE of a batch is smaller than the maximum NSE over all batches, the corresponding rows in `col_indices`/`values` are padded with zeros as placeholders. Algorithms on batched CSR/CSC tensors must not access the padded parts of these tensors, that is, the algorithms should use the last element of the corresponding `crow_indices` row as the NSE value rather than the value of `.values().shape[0]` that holds the maximum NSE over all batches. Performance-wise, the strided-to-sparse-CSR/CSC conversion algorithms in master and in this PR, are roughly equivalent: ```python # master branch: n [2]: a = torch.rand(10, 10, 1000, 1000) In [3]: a = torch.where(a==0, 0.1, a) # required for master, optional for the PR In [4]: %timeit a.to_sparse_csr() 2.25 s ± 9.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [5]: a_cuda = a.cuda() In [6]: %timeit a_cuda.to_sparse_csr() 55.2 ms ± 6.95 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` ```python # this PR In [2]: a = torch.rand(10, 10, 1000, 1000) In [3]: a = torch.where(a==0, 0.1, a) # required for master, optional for the PR In [4]: %timeit a.to_sparse_csr() 2.12 s ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [5]: a_cuda = a.cuda() In [6]: %timeit a_cuda.to_sparse_csr(); torch.cuda.synchronize() 47.2 ms ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` The performance of `to_sparse_csr()` on CUDA tensors increased by 15% with this PR. A strided-to-sparse-BSR/BSC conversion with variable NSE support will be implemented as a follow-up. [ghstack-poisoned]

ghstack-source-id: 19d3b9616b62846c14ef45a85ea4468db42e0836 Pull Request resolved: #84843

facebook-github-bot · 2022-10-04T00:24:14Z

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

linux-foundation-easycla · 2022-10-04T00:24:18Z

❌ - login: @pearu / name: Pearu Peterson . The commit (cf6565a, 27a47f1, 16bad3d, 90fe9e9, 18f4bfc) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

github-actions · 2022-12-03T03:34:47Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

pearu · 2023-12-12T17:41:33Z

Converting to draft as the used approach in this PR requires further discussion and there exists other alternatives, see #104193

Support different NSE in batches of CSR and CSC tensors

cf6565a

[ghstack-poisoned]

pytorch-bot bot added the release notes: sparse release notes category label Sep 11, 2022

pearu mentioned this pull request Sep 11, 2022

Opinfo based testing of torch.bmm with Strided, COO, CSR, and CSC samples. #84572

Closed

facebook-github-bot added the cla signed label Sep 11, 2022

pearu self-assigned this Sep 11, 2022

pearu added the module: sparse Related to torch.sparse label Sep 11, 2022

pearu added this to In progress in Sparse tensors via automation Sep 11, 2022

pytorchbot added the open source label Sep 11, 2022

Update on "Support different NSE in batches of CSR and CSC tensors"

27a47f1

[ghstack-poisoned]

pearu requested review from amjames, nikitaved and cpuhrsch September 12, 2022 08:08

pearu added 3 commits September 12, 2022 14:58

pearu added a commit that referenced this pull request Sep 19, 2022

Support different NSE in batches of CSR and CSC tensors

99f96a4

ghstack-source-id: 19d3b9616b62846c14ef45a85ea4468db42e0836 Pull Request resolved: #84843

github-actions bot added the Stale label Dec 3, 2022

pearu added no-stale and removed Stale labels Dec 8, 2022

This was referenced Jun 22, 2023

Add torch.cat support for torch native sparse tensors. (Need for PyG) #98947

Open

Conversion from strided to batched sparse compressed tensor with a non-constant number of zeros in batches fails #104193

Open

pearu mentioned this pull request Jun 26, 2023

Strided to batch BSR/BSC conversion fails when the number of zeros per block varies while the number of blocks per patch is constant #98495

Open

pearu marked this pull request as draft December 12, 2023 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support different NSE in batches of CSR and CSC tensors #84843

Support different NSE in batches of CSR and CSC tensors #84843

pearu commented Sep 11, 2022 •

edited by pytorch-bot bot

pytorch-bot bot commented Sep 11, 2022 •

edited

facebook-github-bot commented Oct 4, 2022

linux-foundation-easycla bot commented Oct 4, 2022

github-actions bot commented Dec 3, 2022

pearu commented Dec 12, 2023

Support different NSE in batches of CSR and CSC tensors #84843

Are you sure you want to change the base?

Support different NSE in batches of CSR and CSC tensors #84843

Conversation

pearu commented Sep 11, 2022 • edited by pytorch-bot bot

pytorch-bot bot commented Sep 11, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84843

❌ 4 Failures, 1 Pending

facebook-github-bot commented Oct 4, 2022

linux-foundation-easycla bot commented Oct 4, 2022

github-actions bot commented Dec 3, 2022

pearu commented Dec 12, 2023

pearu commented Sep 11, 2022 •

edited by pytorch-bot bot

pytorch-bot bot commented Sep 11, 2022 •

edited