[SYCL] offload op #6217

airMeng · 2024-03-22T02:22:40Z

According to #5277 (reply in thread), the PR does the following:

Leave the scheduler to ggml_backend_sched entirely
Since SYCL doesn't support registering host memory, recommending to use USM instead, remove all non-USM code by the way

results:

$:~/llama.cpp/build$ ./bin/llama-bench -n 0 -ngl 0 -m ~/llama-2-7b.Q4_0.gguf -mmp 0
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: yes
found 4 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|              Intel(R) Arc(TM) A730M Graphics|       1.3|        384|    1024|     32|    12160962560|
| 1|    [opencl:gpu:0]|              Intel(R) Arc(TM) A730M Graphics|       3.0|        384|    1024|     32|    12160962560|
| 2|    [opencl:cpu:0]|     Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz|       3.0|          6|    8192|     64|    16498610176|
| 3|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|          6|67108864|     64|    16498610176|
| model                          |       size |     params | backend    | ngl |       mmap | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:384
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |   0 |          0 | pp 512     |    341.90 ± 0.31 |

build: 4b9f3b43 (2493)
$:~/llama.cpp/build$ ./bin/llama-bench -n 0 -ngl 33 -m ~/llama-2-7b.Q4_0.gguf -mmp 0
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: yes
found 4 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|              Intel(R) Arc(TM) A730M Graphics|       1.3|        384|    1024|     32|    12160962560|
| 1|    [opencl:gpu:0]|              Intel(R) Arc(TM) A730M Graphics|       3.0|        384|    1024|     32|    12160962560|
| 2|    [opencl:cpu:0]|     Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz|       3.0|          6|    8192|     64|    16498610176|
| 3|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|          6|67108864|     64|    16498610176|
| model                          |       size |     params | backend    | ngl |       mmap | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:384
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  33 |          0 | pp 512     |    851.59 ± 0.78 |

build: 4b9f3b43 (2493)

airMeng · 2024-03-23T08:36:41Z

@slaren In fact still not quite understand, I think you want ngl 0 and ngl 33 to be switched smoothly without a specific device selection, please correct me if wrong.

We work in the spare time, so the response might be slow, please forgive too.

slaren · 2024-03-23T13:01:32Z

I would expect that with -ngl 0, the fastest accelerator available would be added to the list of backends in llama.cpp, so that it can be used to offload the computation of large batches. Probably it should be the same device that would be chosen to use in single GPU mode. The performance should increase gradually as more layers are offloaded, but with no layers offloaded it should still be significantly faster than with the CPU alone (see the graphs in #6083).

The changes look good. The call to ggml_init_sycl and should also be removed from ggml.c, and ggml-sycl.h should not be included in ggml.c. Instead, the backend should do its initialization the first time any of its functions are called. The goal is to remove all code from the backends in ggml.c.

airMeng · 2024-03-24T02:57:46Z

The changes look good. The call to ggml_init_sycl and should also be removed from ggml.c, and ggml-sycl.h should not be included in ggml.c. Instead, the backend should do its initialization the first time any of its functions are called. The goal is to remove all code from the backends in ggml.c

Done in 5f8a87d

abhilash1910

LGTM! Nice work to use usm.

* remove no USM methods * leave the schedule to ggml_backend_sched entirely

airMeng added 4 commits March 22, 2024 10:13

add offload_op in sycl

72eb4ba

fix order

5e548af

remove no USM methods

4b9f3b4

leave the schedule to ggml_backend_sched entirely

0c2aa1a

airMeng requested review from abhilash1910, NeoZhangJianyu and slaren March 23, 2024 08:30

airMeng added the Intel GPU label Mar 23, 2024

remove sycl part from common backend

5f8a87d

slaren approved these changes Mar 24, 2024

View reviewed changes

abhilash1910 approved these changes Mar 24, 2024

View reviewed changes

airMeng merged commit ddf6568 into master Mar 24, 2024
58 checks passed

airMeng mentioned this pull request Mar 24, 2024

Excessively slow prompt processing time with 70B partially offloaded in SYCL #5272

Open

airMeng deleted the sycl-offload-op branch March 25, 2024 07:27

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

[SYCL] offload op (ggerganov#6217)

424c752

* remove no USM methods * leave the schedule to ggml_backend_sched entirely

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024

[SYCL] offload op (ggerganov#6217)

8a0cf50

* remove no USM methods * leave the schedule to ggml_backend_sched entirely

tybalex pushed a commit to tybalex/function.cpp that referenced this pull request Apr 17, 2024

[SYCL] offload op (ggerganov#6217)

bd69ff2

* remove no USM methods * leave the schedule to ggml_backend_sched entirely

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] offload op #6217

[SYCL] offload op #6217

airMeng commented Mar 22, 2024 •

edited

airMeng commented Mar 23, 2024

slaren commented Mar 23, 2024 •

edited

airMeng commented Mar 24, 2024

abhilash1910 left a comment

[SYCL] offload op #6217

[SYCL] offload op #6217

Conversation

airMeng commented Mar 22, 2024 • edited

airMeng commented Mar 23, 2024

slaren commented Mar 23, 2024 • edited

airMeng commented Mar 24, 2024

abhilash1910 left a comment

Choose a reason for hiding this comment

airMeng commented Mar 22, 2024 •

edited

slaren commented Mar 23, 2024 •

edited