Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add pipeline parallelism support #6017

Merged
merged 23 commits into from Mar 13, 2024
Merged

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Mar 12, 2024

Pipeline parallelism improves batch processing performance when using multiple GPUs.

Changes:

  • Graph inputs are allocated in the graph
    • Allocating inputs in the graph allows ggml_backend_sched to make multiple copies automatically, which is useful to reduce synchronization requirements with pipeline parallelism
    • Allocating inputs on the graph is more efficient since it allows their memory to be reused for other computations
    • Only the inputs actually required by the model are allocated
  • Automatic batch splitting in llama_decode
    • llama_decode automatically splits the batches into multiple smaller batches if it is too big for the configured compute batch size
    • The largest batch size that can be submitted to llama_decode is still limited by n_batch to reduce the size of the logits and embeddings buffers
  • Adds n_ubatch (-ub in the command line) to llama_context_params parameter
    • n_batch sets the size of the logits and embeddings buffer, which limits the maximum batch size passed to llama_decode
    • n_ubatch sets the maximum batch size for computation
    • By default n_batch is 4096, n_ubatch is 512
    • This allows current applications to take advantage of pipeline parallelism by setting a larger n_batch without having to update their logic
  • Makes llama_decode asynchronous
    • Synchronization is done automatically on llama_get_logits and llama_get_embeddings
    • Adds llama_synchronize to force a synchronization manually, which can be useful when measuring the time of llama_decode
    • Applications can also take advantage of pipeline parallelism by making multiple calls to llama_decode without synchronizing
    • Note: llama_timings may not be accurate if the application does not synchronize immediately after calling llama_decode
  • Uses a host buffer for the logits and embeddings (except when pooled) outputs, which improves the performance when copying the data from the GPU
  • Multi-threaded ggml_get_rows (still single-threaded when full offload)
  • Adds an event interface to ggml-backend for synchronization between different backends
  • CUDA only, other backends will need to implement the async and event interface to take advantage of pipeline parallelism
  • Adds the build parameter LLAMA_SCHED_MAX_COPIES
    • This parameter configures the number of copies of the split inputs in ggml_backend_sched when using pipeline parallelism
    • Increasing this value allows more operations to be queued without requiring synchronization, and may improve performance on some systems at the expense of higher memory usage in the compute buffer

@phymbert
Copy link
Collaborator

Do you mind to also add --ubatch-size in server.cpp please ?

@ggerganov ggerganov added the high priority Very important issue label Mar 12, 2024
@ggerganov
Copy link
Owner

ggerganov commented Mar 12, 2024

Posting some results on 8x A100:

LLAMA_SCHED_MAX_COPIES=8 LLAMA_CUBLAS=1 make -j
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./llama-bench \
    -m models/codellama-7b/ggml-model-f16.gguf \
    -m models/codellama-7b/ggml-model-q8_0.gguf \
    -m models/codellama-7b/ggml-model-q4_k.gguf \
    -m models/codellama-7b/ggml-model-q4_0.gguf \
    -ngl 99 -p 512,1024,2048,4096,8192 -b 8192

master (1x GPU):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model size params backend ngl n_batch test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 512 8321.01 ± 78.06
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 1024 7832.04 ± 51.18
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 2048 7284.35 ± 21.52
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 4096 6276.87 ± 9.24
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 8192 4835.90 ± 4.67
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 tg 128 74.12 ± 0.33
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 512 6770.95 ± 86.88
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 1024 7065.25 ± 41.31
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 2048 6623.07 ± 20.11
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 4096 5780.57 ± 21.80
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 8192 4540.37 ± 5.68
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 tg 128 115.28 ± 0.84
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 512 6162.48 ± 75.34
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 1024 6709.16 ± 48.61
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 2048 6304.09 ± 17.74
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 4096 5529.16 ± 14.36
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 8192 4382.12 ± 6.16
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 tg 128 131.24 ± 0.68
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 512 6139.05 ± 52.82
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 1024 6694.85 ± 29.93
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 2048 6279.48 ± 30.08
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 4096 5519.16 ± 13.95
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 8192 4373.91 ± 3.77
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 tg 128 145.86 ± 1.44

build: d8fd0cc (2412)

master (2x GPUs):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model size params backend ngl n_batch test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 512 8437.93 ± 56.78
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 1024 7935.64 ± 41.99
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 2048 7383.09 ± 12.74
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 4096 6334.79 ± 16.70
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 8192 4875.20 ± 8.59
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 tg 128 73.96 ± 0.33
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 512 6907.67 ± 21.79
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 1024 7187.15 ± 34.47
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 2048 6742.78 ± 13.20
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 4096 5855.42 ± 12.20
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 8192 4580.27 ± 5.76
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 tg 128 115.18 ± 0.86
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 512 6314.50 ± 35.45
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 1024 6851.88 ± 27.60
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 2048 6430.31 ± 8.31
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 4096 5627.43 ± 16.12
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 8192 4435.21 ± 9.20
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 tg 128 130.87 ± 1.04
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 512 6254.15 ± 48.68
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 1024 6801.81 ± 46.62
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 2048 6399.32 ± 8.18
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 4096 5603.93 ± 9.26
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 8192 4419.86 ± 5.96
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 tg 128 145.50 ± 1.17

build: d8fd0cc (2412)

master (4x GPUs):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 4 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model size params backend ngl n_batch test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 512 8196.19 ± 61.77
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 1024 7799.21 ± 17.14
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 2048 7279.24 ± 6.64
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 4096 6267.52 ± 10.16
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 8192 4826.63 ± 4.70
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 tg 128 73.24 ± 0.07
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 512 6792.78 ± 9.64
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 1024 7122.67 ± 10.67
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 2048 6687.84 ± 6.44
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 4096 5812.81 ± 7.81
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 8192 4549.54 ± 5.62
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 tg 128 112.97 ± 0.61
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 512 6235.15 ± 16.33
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 1024 6790.49 ± 19.86
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 2048 6398.79 ± 8.87
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 4096 5596.10 ± 6.54
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 8192 4417.32 ± 3.00
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 tg 128 127.61 ± 0.48
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 512 6165.95 ± 12.53
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 1024 6742.28 ± 12.87
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 2048 6338.36 ± 17.06
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 4096 5561.52 ± 8.61
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 8192 4394.38 ± 1.90
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 tg 128 142.00 ± 0.98

build: d8fd0cc (2412)

master (8x GPUs):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model size params backend ngl n_batch test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 512 7705.18 ± 83.77
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 1024 7372.18 ± 18.39
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 2048 6902.99 ± 13.14
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 4096 5966.81 ± 5.21
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 pp 8192 4627.64 ± 6.14
llama 7B F16 12.55 GiB 6.74 B CUDA 99 1024 tg 128 71.74 ± 0.15
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 512 6472.16 ± 7.31
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 1024 6782.80 ± 5.23
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 2048 6374.71 ± 7.33
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 4096 5565.41 ± 1.94
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 pp 8192 4375.93 ± 4.90
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 1024 tg 128 109.13 ± 0.77
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 512 5958.00 ± 4.99
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 1024 6488.75 ± 5.99
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 2048 6114.23 ± 7.09
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 4096 5370.20 ± 3.53
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 pp 8192 4255.44 ± 2.08
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 1024 tg 128 123.10 ± 1.01
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 512 5885.62 ± 9.11
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 1024 6419.71 ± 8.39
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 2048 6052.19 ± 9.94
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 4096 5322.86 ± 3.01
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 pp 8192 4227.29 ± 2.87
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 tg 128 136.12 ± 1.33

build: d8fd0cc (2412)

old (sl/micro-batching, 8x GPUs) with ub=256:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model size params backend ngl n_batch test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 pp 512 12287.35 ± 404.42
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 pp 1024 18428.65 ± 238.85
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 pp 2048 24279.31 ± 225.94
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 pp 4096 27046.41 ± 54.81
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 pp 8192 24665.13 ± 61.06
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 tg 128 71.59 ± 1.20
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 pp 512 9222.69 ± 4.70
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 pp 1024 13907.00 ± 3.19
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 pp 2048 17866.35 ± 13.90
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 pp 4096 19697.05 ± 14.58
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 pp 8192 18544.07 ± 6.93
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 tg 128 89.72 ± 2.34
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 pp 512 8570.40 ± 4.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 pp 1024 12955.56 ± 9.16
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 pp 2048 16745.54 ± 4.98
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 pp 4096 18631.20 ± 2.73
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 pp 8192 17652.17 ± 2.92
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 tg 128 128.17 ± 4.96

build: af789e7 (1861)

new (x1 GPU, LLAMA_SCHED_MAX_COPIES=2):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model size params backend ngl n_batch n_ubatch test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 4096 pp 512 8237.93 ± 21.75
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 4096 pp 1024 7823.11 ± 17.30
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 4096 pp 2048 6974.75 ± 9.76
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 4096 pp 4096 5594.67 ± 4.54
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 4096 pp 8192 4527.03 ± 8.31
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 4096 tg 128 73.87 ± 0.16
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 4096 pp 512 6728.34 ± 16.10
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 4096 pp 1024 7055.80 ± 8.37
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 4096 pp 2048 6674.59 ± 9.64
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 4096 pp 4096 5496.80 ± 7.27
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 4096 pp 8192 4449.37 ± 4.71
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 4096 tg 128 114.80 ± 0.33
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 4096 pp 512 6125.22 ± 13.83
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 4096 pp 1024 6690.14 ± 19.46
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 4096 pp 2048 6501.66 ± 6.32
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 4096 pp 4096 5427.55 ± 8.96
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 4096 pp 8192 4404.22 ± 3.19
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 4096 tg 128 129.97 ± 0.30
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 4096 pp 512 6109.62 ± 5.88
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 4096 pp 1024 6678.93 ± 21.22
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 4096 pp 2048 6496.22 ± 14.22
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 4096 pp 4096 5418.54 ± 4.24
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 4096 pp 8192 4408.39 ± 3.31
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 4096 tg 128 145.15 ± 0.50

build: 54cdd47 (2424)

new (x2 GPUs):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model size params backend ngl n_batch n_ubatch test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 pp 512 10090.00 ± 25.47
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 pp 1024 11267.24 ± 20.31
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 pp 2048 11404.60 ± 14.40
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 pp 4096 10493.05 ± 10.40
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 pp 8192 8587.92 ± 4.60
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 tg 128 72.08 ± 0.21
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 pp 512 7276.31 ± 13.11
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 pp 1024 8212.10 ± 7.13
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 pp 2048 8474.57 ± 12.77
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 pp 4096 8044.92 ± 16.42
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 pp 8192 6922.49 ± 3.52
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 tg 128 111.61 ± 0.45
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 pp 512 6241.75 ± 10.47
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 pp 1024 7085.78 ± 15.74
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 pp 2048 7372.92 ± 5.08
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 pp 4096 7087.04 ± 3.72
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 pp 8192 6186.62 ± 2.76
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 tg 128 126.62 ± 0.41
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 pp 512 6198.75 ± 6.54
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 pp 1024 7045.06 ± 7.38
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 pp 2048 7345.98 ± 3.19
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 pp 4096 7083.79 ± 3.68
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 pp 8192 6192.83 ± 2.38
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 tg 128 140.12 ± 0.65

build: 54cdd47 (2424)

new (x4 GPUs):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 4 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model size params backend ngl n_batch n_ubatch test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 pp 512 11813.02 ± 22.83
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 pp 1024 15387.27 ± 33.61
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 pp 2048 17609.80 ± 41.84
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 pp 4096 17618.77 ± 18.29
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 pp 8192 15100.46 ± 21.89
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 tg 128 71.87 ± 0.22
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 pp 512 8604.28 ± 4.99
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 pp 1024 11357.60 ± 4.19
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 pp 2048 13201.46 ± 18.92
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 pp 4096 13411.17 ± 21.39
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 pp 8192 11882.45 ± 10.51
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 tg 128 110.12 ± 0.59
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 pp 512 7401.93 ± 6.81
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 pp 1024 9828.07 ± 11.61
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 pp 2048 11516.67 ± 20.88
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 pp 4096 11881.12 ± 20.03
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 pp 8192 10722.42 ± 24.25
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 tg 128 124.27 ± 0.74
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 pp 512 7293.93 ± 11.64
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 pp 1024 9681.32 ± 20.72
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 pp 2048 11390.60 ± 12.97
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 pp 4096 11770.08 ± 14.25
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 pp 8192 10639.98 ± 10.34
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 tg 128 137.21 ± 0.81

build: 54cdd47 (2424)

new (x8 GPUs):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model size params backend ngl n_batch n_ubatch test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 pp 512 12100.57 ± 193.33
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 pp 1024 17591.46 ± 132.00
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 pp 2048 22931.45 ± 144.20
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 pp 4096 25504.36 ± 434.87
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 pp 8192 23820.97 ± 188.92
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 256 tg 128 71.87 ± 0.02
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 pp 512 9266.34 ± 59.90
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 pp 1024 13796.83 ± 29.05
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 pp 2048 17820.71 ± 327.33
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 pp 4096 20105.47 ± 38.39
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 pp 8192 18915.61 ± 58.18
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 256 tg 128 108.56 ± 0.90
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 pp 512 7983.52 ± 84.85
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 pp 1024 12007.61 ± 19.79
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 pp 2048 15755.47 ± 74.45
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 pp 4096 17909.72 ± 28.69
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 pp 8192 17152.45 ± 22.33
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 8192 256 tg 128 122.04 ± 0.95
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 pp 512 7908.16 ± 5.14
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 pp 1024 11764.20 ± 142.62
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 pp 2048 15567.45 ± 33.99
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 pp 4096 17668.75 ± 59.72
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 pp 8192 16992.44 ± 27.34
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 256 tg 128 135.03 ± 1.31

build: 54cdd47 (2424)

ppl, -c 512 -b 2048, -ub 256 - runtime: 30.7s

LLAMA_SCHED_MAX_COPIES=8 LLAMA_CUBLAS=1 make -j perplexity && time CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./perplexity -m models/codellama-7b/ggml-model-f16.gguf -ngl 99 -c 512 -b 2048 -ub 256 -f wikitext-2-raw/wiki.test.raw 

llm_load_tensors: ggml ctx size = 1.00 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 250.12 MiB
llm_load_tensors: CUDA0 buffer size = 1930.16 MiB
llm_load_tensors: CUDA1 buffer size = 1544.12 MiB
llm_load_tensors: CUDA2 buffer size = 1544.12 MiB
llm_load_tensors: CUDA3 buffer size = 1544.12 MiB
llm_load_tensors: CUDA4 buffer size = 1544.12 MiB
llm_load_tensors: CUDA5 buffer size = 1544.12 MiB
llm_load_tensors: CUDA6 buffer size = 1544.12 MiB
llm_load_tensors: CUDA7 buffer size = 1408.23 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 256
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA4 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA5 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA6 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA7 KV buffer size = 96.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 250.12 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=8)
llama_new_context_with_model: CUDA0 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA3 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA4 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA5 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA6 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA7 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 20.02 MiB
llama_new_context_with_model: graph splits: 9

system_info: n_threads = 126 / 252 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 922.755 ms
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 0.15 seconds per pass - ETA 0.38 minutes
[1]5.5511,[2]6.0573,[3]6.7593,[4]7.7007,[5]7.8618,[6]7.6842,[7]7.8791,[8]7.8360,[9]8.2430,[10]8.5690,[11]8.8467,[12]8.9639,[13]8.8943,[14]9.0010,[15]9.3134,[16]8.8007,[17]8.6196,[18]8.5497,[19]8.0678,[20]8.0193,[21]7.9075,[22]7.7105,[23]7.6356,[24]7.5056,[25]7.4911,[26]7.2615,[27]6.9952,[28]6.8504,[29]6.7198,[30]6.5173,[31]6.4668,[32]6.5103,[33]6.4616,[34]6.4974,[35]6.5202,[36]6.5535,[37]6.5503,[38]6.5514,[39]6.5706,[40]6.6289,[41]6.6420,[42]6.6959,[43]6.6409,[44]6.6959,[45]6.7062,[46]6.6758,[47]6.7094,[48]6.6758,[49]6.6698,[50]6.6089,[51]6.6233,[52]6.6110,[53]6.6717,[54]6.6556,[55]6.6299,[56]6.6832,[57]6.7115,[58]6.7425,[59]6.7600,[60]6.8152,[61]6.8079,[62]6.8741,[63]6.9046,[64]6.8991,[65]6.9468,[66]6.9558,[67]6.9631,[68]6.9864,[69]7.0091,[70]7.0463,[71]7.0751,[72]7.1178,[73]7.1782,[74]7.1854,[75]7.1957,[76]7.2122,[77]7.2331,[78]7.2206,[79]7.2511,[80]7.2403,[81]7.2675,[82]7.2900,[83]7.2264,[84]7.2237,[85]7.2272,[86]7.2067,[87]7.1662,[88]7.1429,[89]7.1244,[90]7.1093,[91]7.1429,[92]7.1385,[93]7.1324,[94]7.1418,[95]7.1824,[96]7.1791,[97]7.1767,[98]7.1758,[99]7.1554,[100]7.1581,[101]7.1865,[102]7.1820,[103]7.2073,[104]7.2144,[105]7.2079,[106]7.2312,[107]7.2356,[108]7.2507,[109]7.2459,[110]7.2401,[111]7.2658,[112]7.2937,[113]7.2966,[114]7.2962,[115]7.3049,[116]7.2989,[117]7.3076,[118]7.3354,[119]7.3648,[120]7.4016,[121]7.4241,[122]7.4536,[123]7.4954,[124]7.5155,[125]7.4993,[126]7.5349,[127]7.5734,[128]7.6024,[129]7.5849,[130]7.5883,[131]7.5831,[132]7.5721,[133]7.5550,[134]7.5628,[135]7.5569,[136]7.5508,[137]7.5394,[138]7.5140,[139]7.5074,[140]7.5004,[141]7.4701,[142]7.4698,[143]7.4386,[144]7.4145,[145]7.4026,[146]7.3901,[147]7.3926,[148]7.3947,[149]7.3940,[150]7.3950,[151]7.3984,[152]7.3873,[153]7.3681,[154]7.3583,[155]7.3625,[156]7.3602,[157]7.3794,[158]7.3808,[159]7.3843,[160]7.3906,[161]7.4065,[162]7.3679,[163]7.3529,[164]7.3240,[165]7.2897,[166]7.2563,[167]7.2055,[168]7.1706,[169]7.1610,[170]7.1486,[171]7.1179,[172]7.0995,[173]7.0834,[174]7.0533,[175]7.0260,[176]7.0116,[177]6.9880,[178]6.9643,[179]6.9449,[180]6.9396,[181]6.9200,[182]6.8981,[183]6.8821,[184]6.8821,[185]6.8762,[186]6.8799,[187]6.8900,[188]6.8914,[189]6.9165,[190]6.9204,[191]6.9462,[192]6.9646,[193]6.9863,[194]7.0012,[195]7.0284,[196]7.0463,[197]7.0716,[198]7.0863,[199]7.0885,[200]7.0948,[201]7.0876,[202]7.1108,[203]7.1237,[204]7.1431,[205]7.1612,[206]7.1705,[207]7.1669,[208]7.1796,[209]7.1845,[210]7.1894,[211]7.2037,[212]7.2112,[213]7.2223,[214]7.2253,[215]7.2263,[216]7.2405,[217]7.2590,[218]7.2737,[219]7.2695,[220]7.2644,[221]7.2528,[222]7.2511,[223]7.2399,[224]7.2308,[225]7.2227,[226]7.2459,[227]7.2548,[228]7.2605,[229]7.2638,[230]7.2600,[231]7.2770,[232]7.2661,[233]7.2424,[234]7.2223,[235]7.1999,[236]7.1899,[237]7.1746,[238]7.1772,[239]7.1613,[240]7.1496,[241]7.1493,[242]7.1494,[243]7.1434,[244]7.1286,[245]7.1240,[246]7.1083,[247]7.0942,[248]7.0844,[249]7.0797,[250]7.0844,[251]7.0735,[252]7.0683,[253]7.0559,[254]7.0456,[255]7.0301,[256]7.0075,[257]6.9929,[258]6.9799,[259]6.9761,[260]6.9634,[261]6.9565,[262]6.9477,[263]6.9374,[264]6.9211,[265]6.9218,[266]6.9153,[267]6.9052,[268]6.9166,[269]6.9204,[270]6.9187,[271]6.9293,[272]6.9360,[273]6.9367,[274]6.9365,[275]6.9453,[276]6.9501,[277]6.9676,[278]6.9776,[279]6.9893,[280]6.9940,[281]7.0085,[282]7.0149,[283]7.0307,[284]7.0399,[285]7.0482,[286]7.0631,[287]7.0621,[288]7.0664,[289]7.0547,[290]7.0401,[291]7.0271,[292]7.0130,[293]6.9996,[294]7.0019,[295]7.0022,[296]7.0082,[297]7.0083,[298]7.0143,[299]7.0128,[300]7.0034,[301]7.0033,[302]6.9977,[303]6.9883,[304]6.9792,[305]6.9776,[306]6.9637,[307]6.9657,[308]6.9651,[309]6.9480,[310]6.9422,[311]6.9368,[312]6.9407,[313]6.9375,[314]6.9383,[315]6.9206,[316]6.9212,[317]6.8998,[318]6.8745,[319]6.8893,[320]6.9036,[321]6.9058,[322]6.8981,[323]6.8981,[324]6.9013,[325]6.9164,[326]6.9179,[327]6.9223,[328]6.9251,[329]6.9330,[330]6.9426,[331]6.9580,[332]6.9545,[333]6.9661,[334]6.9597,[335]6.9514,[336]6.9532,[337]6.9498,[338]6.9510,[339]6.9467,[340]6.9415,[341]6.9477,[342]6.9479,[343]6.9532,[344]6.9538,[345]6.9535,[346]6.9495,[347]6.9521,[348]6.9556,[349]6.9591,[350]6.9571,[351]6.9569,[352]6.9573,[353]6.9502,[354]6.9500,[355]6.9559,[356]6.9605,[357]6.9549,[358]6.9657,[359]6.9688,[360]6.9646,[361]6.9628,[362]6.9721,[363]6.9853,[364]6.9924,[365]6.9977,[366]6.9991,[367]7.0063,[368]7.0008,[369]7.0019,[370]7.0035,[371]6.9969,[372]7.0010,[373]7.0051,[374]7.0013,[375]6.9996,[376]7.0100,[377]7.0028,[378]7.0065,[379]7.0141,[380]7.0042,[381]7.0010,[382]6.9944,[383]6.9944,[384]6.9932,[385]6.9938,[386]6.9941,[387]6.9961,[388]6.9923,[389]6.9873,[390]6.9804,[391]6.9732,[392]6.9752,[393]6.9797,[394]6.9863,[395]6.9843,[396]6.9754,[397]6.9850,[398]6.9921,[399]7.0012,[400]7.0003,[401]7.0017,[402]7.0040,[403]7.0057,[404]7.0127,[405]7.0102,[406]7.0073,[407]7.0125,[408]7.0133,[409]7.0256,[410]7.0396,[411]7.0541,[412]7.0736,[413]7.0866,[414]7.0969,[415]7.1036,[416]7.1153,[417]7.1269,[418]7.1320,[419]7.1385,[420]7.1509,[421]7.1644,[422]7.1703,[423]7.1794,[424]7.1918,[425]7.2047,[426]7.2135,[427]7.2187,[428]7.2303,[429]7.2368,[430]7.2494,[431]7.2648,[432]7.2681,[433]7.2650,[434]7.2590,[435]7.2599,[436]7.2629,[437]7.2752,[438]7.2845,[439]7.2789,[440]7.2778,[441]7.2726,[442]7.2706,[443]7.2708,[444]7.2715,[445]7.2686,[446]7.2708,[447]7.2739,[448]7.2799,[449]7.2768,[450]7.2767,[451]7.2715,[452]7.2706,[453]7.2622,[454]7.2568,[455]7.2581,[456]7.2614,[457]7.2639,[458]7.2631,[459]7.2628,[460]7.2732,[461]7.2714,[462]7.2717,[463]7.2781,[464]7.2772,[465]7.2734,[466]7.2652,[467]7.2690,[468]7.2719,[469]7.2758,[470]7.2767,[471]7.2726,[472]7.2797,[473]7.2732,[474]7.2786,[475]7.2801,[476]7.2828,[477]7.2762,[478]7.2753,[479]7.2892,[480]7.2962,[481]7.2997,[482]7.2952,[483]7.2913,[484]7.2962,[485]7.2961,[486]7.2903,[487]7.2945,[488]7.2946,[489]7.2898,[490]7.2903,[491]7.2909,[492]7.2878,[493]7.2842,[494]7.2825,[495]7.2852,[496]7.2831,[497]7.2814,[498]7.2821,[499]7.2758,[500]7.2668,[501]7.2613,[502]7.2659,[503]7.2653,[504]7.2569,[505]7.2604,[506]7.2614,[507]7.2561,[508]7.2501,[509]7.2494,[510]7.2528,[511]7.2609,[512]7.2636,[513]7.2651,[514]7.2720,[515]7.2664,[516]7.2648,[517]7.2661,[518]7.2668,[519]7.2710,[520]7.2733,[521]7.2759,[522]7.2794,[523]7.2805,[524]7.2872,[525]7.2914,[526]7.2918,[527]7.2945,[528]7.2889,[529]7.2914,[530]7.2835,[531]7.2816,[532]7.2885,[533]7.2916,[534]7.2892,[535]7.2931,[536]7.2878,[537]7.2852,[538]7.2916,[539]7.2940,[540]7.2978,[541]7.3027,[542]7.3019,[543]7.3033,[544]7.3033,[545]7.3012,[546]7.3024,[547]7.2970,[548]7.2887,[549]7.2891,[550]7.2859,[551]7.2822,[552]7.2799,[553]7.2758,[554]7.2727,[555]7.2676,[556]7.2671,[557]7.2735,[558]7.2704,[559]7.2692,[560]7.2676,[561]7.2680,[562]7.2629,[563]7.2649,[564]7.2701,[565]7.2723,[566]7.2716,[567]7.2699,[568]7.2691,[569]7.2664,[570]7.2692,[571]7.2697,[572]7.2696,[573]7.2699,[574]7.2657,[575]7.2640,[576]7.2626,[577]7.2598,[578]7.2574,[579]7.2565,[580]7.2485,[581]7.2447,[582]7.2445,[583]7.2444,[584]7.2439,[585]7.2345,[586]7.2259,[587]7.2256,[588]7.2309,[589]7.2378,[590]7.2396,[591]7.2412,[592]7.2396,[593]7.2352,[594]7.2354,[595]7.2323,[596]7.2361,[597]7.2316,[598]7.2292,[599]7.2312,[600]7.2305,[601]7.2294,[602]7.2348,[603]7.2367,[604]7.2393,[605]7.2428,[606]7.2447,[607]7.2455,[608]7.2413,[609]7.2415,[610]7.2460,[611]7.2428,[612]7.2441,[613]7.2384,[614]7.2325,[615]7.2219,[616]7.2245,[617]7.2152,[618]7.2070,[619]7.1987,[620]7.1805,[621]7.1720,[622]7.1704,[623]7.1718,[624]7.1710,[625]7.1699,[626]7.1686,[627]7.1717,[628]7.1713,[629]7.1710,[630]7.1741,[631]7.1794,[632]7.1853,[633]7.1833,[634]7.1873,[635]7.1882,[636]7.1848,[637]7.1813,[638]7.1843,[639]7.1801,[640]7.1824,[641]7.1821,[642]7.1901,[643]7.1924,[644]7.1930,[645]7.1911,[646]7.1956,[647]7.1954,[648]7.1974,[649]7.1978,[650]7.2026,[651]7.2079,[652]7.2096,[653]7.2143,[654]7.2065,[655]7.2061,
Final estimate: PPL = 7.2061 +/- 0.04271

llama_print_timings: load time = 1602.72 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 15310.80 ms / 335360 tokens ( 0.05 ms per token, 21903.50 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 25860.90 ms / 335361 tokens

real 0m30.712s
user 1m12.300s
sys 0m25.268s

master (1x GPU, 13B):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model size params backend ngl n_batch test t/s
llama 13B F16 24.25 GiB 13.02 B CUDA 99 4096 pp 512 5111.70 ± 47.07
llama 13B F16 24.25 GiB 13.02 B CUDA 99 4096 pp 1024 4922.35 ± 35.87
llama 13B F16 24.25 GiB 13.02 B CUDA 99 4096 pp 2048 4413.17 ± 10.56
llama 13B F16 24.25 GiB 13.02 B CUDA 99 4096 tg 128 45.28 ± 0.05
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 4096 pp 512 4036.98 ± 25.68
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 4096 pp 1024 4350.98 ± 14.85
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 4096 pp 2048 4162.22 ± 9.86
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 4096 tg 128 68.79 ± 0.04
llama 13B Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 4096 pp 512 3615.92 ± 20.41
llama 13B Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 4096 pp 1024 4062.81 ± 27.69
llama 13B Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 4096 pp 2048 4006.28 ± 11.71
llama 13B Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 4096 tg 128 83.52 ± 0.05
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 4096 pp 512 3608.53 ± 20.18
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 4096 pp 1024 4059.13 ± 20.43
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 4096 pp 2048 4004.22 ± 12.23
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 4096 tg 128 95.35 ± 0.08

build: 99b71c0 (2410)

new (x8 GPUs, 13B):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model size params backend ngl n_batch n_ubatch test t/s
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 256 pp 512 8142.20 ± 3.99
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 256 pp 1024 11977.25 ± 18.75
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 256 pp 2048 15377.04 ± 19.31
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 256 pp 4096 16842.23 ± 240.43
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 256 pp 8192 15316.75 ± 11.69
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 256 tg 128 43.96 ± 0.12
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 256 pp 512 5605.35 ± 3.46
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 256 pp 1024 8428.62 ± 0.80
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 256 pp 2048 10686.25 ± 14.22
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 256 pp 4096 11824.00 ± 20.43
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 256 pp 8192 11228.18 ± 12.87
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 256 tg 128 66.32 ± 0.38
llama 13B Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 8192 256 pp 512 4775.66 ± 3.41
llama 13B Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 8192 256 pp 1024 7225.87 ± 4.09
llama 13B Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 8192 256 pp 2048 9251.36 ± 4.16
llama 13B Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 8192 256 pp 4096 10346.24 ± 19.27
llama 13B Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 8192 256 pp 8192 10018.40 ± 6.89
llama 13B Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 8192 256 tg 128 80.95 ± 0.48
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 256 pp 512 4731.63 ± 2.33
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 256 pp 1024 7174.17 ± 4.75
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 256 pp 2048 9167.04 ± 2.94
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 256 pp 4096 10255.81 ± 20.46
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 256 pp 8192 9943.68 ± 11.66
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 256 tg 128 90.05 ± 0.70

build: 54cdd47 (2424)

ppl (13B), -c 512 -b 2048, -ub 256 - runtime: 39.3s

LLAMA_SCHED_MAX_COPIES=8 LLAMA_CUBLAS=1 make -j perplexity && time CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./perplexity -m models/codellama-13b/ggml-model-f16.gguf -ngl 99 -c 512 -b 2048 -ub 256 -f wikitext-2-raw/wiki.test.raw 

llm_load_tensors: ggml ctx size = 1.25 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 312.66 MiB
llm_load_tensors: CUDA0 buffer size = 3630.23 MiB
llm_load_tensors: CUDA1 buffer size = 3025.20 MiB
llm_load_tensors: CUDA2 buffer size = 3025.20 MiB
llm_load_tensors: CUDA3 buffer size = 3025.20 MiB
llm_load_tensors: CUDA4 buffer size = 3025.20 MiB
llm_load_tensors: CUDA5 buffer size = 3025.20 MiB
llm_load_tensors: CUDA6 buffer size = 3025.20 MiB
llm_load_tensors: CUDA7 buffer size = 2732.83 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 256
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 240.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA4 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA5 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA6 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA7 KV buffer size = 160.00 MiB
llama_new_context_with_model: KV self size = 1600.00 MiB, K (f16): 800.00 MiB, V (f16): 800.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 250.12 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=8)
llama_new_context_with_model: CUDA0 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA3 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA4 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA5 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA6 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA7 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 21.02 MiB
llama_new_context_with_model: graph splits: 9

system_info: n_threads = 126 / 252 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 969.675 ms
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 0.21 seconds per pass - ETA 0.57 minutes
[1]4.7515,[2]5.5743,[3]6.2923,[4]7.1020,[5]7.3440,[6]7.1685,[7]7.3431,[8]7.3339,[9]7.6557,[10]7.9098,[11]8.1824,[12]8.2266,[13]8.1402,[14]8.3017,[15]8.6143,[16]8.1901,[17]8.0203,[18]7.9660,[19]7.5368,[20]7.5153,[21]7.3838,[22]7.1791,[23]7.1297,[24]7.0125,[25]7.0057,[26]6.7937,[27]6.5418,[28]6.3808,[29]6.2788,[30]6.0670,[31]6.0166,[32]6.0573,[33]6.0083,[34]6.0517,[35]6.0759,[36]6.1186,[37]6.0951,[38]6.0937,[39]6.1068,[40]6.1666,[41]6.1816,[42]6.2251,[43]6.1683,[44]6.2104,[45]6.2241,[46]6.1909,[47]6.2133,[48]6.1786,[49]6.1735,[50]6.1243,[51]6.1308,[52]6.1158,[53]6.1663,[54]6.1518,[55]6.1288,[56]6.1687,[57]6.1866,[58]6.2100,[59]6.2220,[60]6.2663,[61]6.2574,[62]6.3202,[63]6.3439,[64]6.3468,[65]6.3853,[66]6.3889,[67]6.3957,[68]6.4132,[69]6.4349,[70]6.4674,[71]6.4919,[72]6.5292,[73]6.5872,[74]6.5989,[75]6.6017,[76]6.6165,[77]6.6342,[78]6.6272,[79]6.6555,[80]6.6523,[81]6.6707,[82]6.6828,[83]6.6300,[84]6.6309,[85]6.6321,[86]6.6132,[87]6.5768,[88]6.5584,[89]6.5367,[90]6.5242,[91]6.5494,[92]6.5431,[93]6.5359,[94]6.5396,[95]6.5709,[96]6.5721,[97]6.5615,[98]6.5512,[99]6.5314,[100]6.5349,[101]6.5650,[102]6.5594,[103]6.5811,[104]6.5868,[105]6.5803,[106]6.6011,[107]6.6042,[108]6.6121,[109]6.6110,[110]6.6045,[111]6.6265,[112]6.6494,[113]6.6511,[114]6.6513,[115]6.6577,[116]6.6475,[117]6.6507,[118]6.6772,[119]6.7020,[120]6.7391,[121]6.7564,[122]6.7807,[123]6.8201,[124]6.8343,[125]6.8196,[126]6.8549,[127]6.8890,[128]6.9170,[129]6.8986,[130]6.9033,[131]6.8992,[132]6.8917,[133]6.8754,[134]6.8875,[135]6.8839,[136]6.8761,[137]6.8654,[138]6.8416,[139]6.8363,[140]6.8337,[141]6.8074,[142]6.8039,[143]6.7750,[144]6.7536,[145]6.7413,[146]6.7318,[147]6.7340,[148]6.7389,[149]6.7376,[150]6.7390,[151]6.7435,[152]6.7361,[153]6.7216,[154]6.7140,[155]6.7182,[156]6.7190,[157]6.7372,[158]6.7420,[159]6.7459,[160]6.7538,[161]6.7710,[162]6.7381,[163]6.7260,[164]6.7003,[165]6.6716,[166]6.6415,[167]6.5974,[168]6.5694,[169]6.5591,[170]6.5475,[171]6.5199,[172]6.5045,[173]6.4907,[174]6.4609,[175]6.4382,[176]6.4240,[177]6.4023,[178]6.3802,[179]6.3635,[180]6.3575,[181]6.3416,[182]6.3212,[183]6.3065,[184]6.3064,[185]6.3032,[186]6.3078,[187]6.3188,[188]6.3199,[189]6.3446,[190]6.3471,[191]6.3713,[192]6.3893,[193]6.4072,[194]6.4224,[195]6.4466,[196]6.4630,[197]6.4856,[198]6.5021,[199]6.5066,[200]6.5136,[201]6.5077,[202]6.5295,[203]6.5412,[204]6.5503,[205]6.5645,[206]6.5725,[207]6.5681,[208]6.5817,[209]6.5853,[210]6.5894,[211]6.6043,[212]6.6106,[213]6.6198,[214]6.6245,[215]6.6243,[216]6.6345,[217]6.6509,[218]6.6648,[219]6.6618,[220]6.6608,[221]6.6477,[222]6.6465,[223]6.6350,[224]6.6276,[225]6.6201,[226]6.6419,[227]6.6499,[228]6.6569,[229]6.6612,[230]6.6582,[231]6.6721,[232]6.6608,[233]6.6399,[234]6.6229,[235]6.5969,[236]6.5903,[237]6.5777,[238]6.5783,[239]6.5635,[240]6.5497,[241]6.5486,[242]6.5490,[243]6.5424,[244]6.5292,[245]6.5252,[246]6.5119,[247]6.4984,[248]6.4879,[249]6.4835,[250]6.4865,[251]6.4765,[252]6.4718,[253]6.4596,[254]6.4509,[255]6.4359,[256]6.4159,[257]6.4019,[258]6.3910,[259]6.3876,[260]6.3756,[261]6.3697,[262]6.3622,[263]6.3531,[264]6.3359,[265]6.3355,[266]6.3300,[267]6.3235,[268]6.3349,[269]6.3385,[270]6.3401,[271]6.3501,[272]6.3558,[273]6.3565,[274]6.3549,[275]6.3623,[276]6.3679,[277]6.3819,[278]6.3923,[279]6.4024,[280]6.4065,[281]6.4176,[282]6.4230,[283]6.4373,[284]6.4463,[285]6.4552,[286]6.4717,[287]6.4694,[288]6.4729,[289]6.4627,[290]6.4507,[291]6.4391,[292]6.4267,[293]6.4133,[294]6.4148,[295]6.4170,[296]6.4237,[297]6.4246,[298]6.4292,[299]6.4272,[300]6.4185,[301]6.4203,[302]6.4145,[303]6.4065,[304]6.3975,[305]6.3954,[306]6.3836,[307]6.3860,[308]6.3862,[309]6.3712,[310]6.3669,[311]6.3623,[312]6.3648,[313]6.3619,[314]6.3628,[315]6.3458,[316]6.3434,[317]6.3241,[318]6.3019,[319]6.3154,[320]6.3273,[321]6.3286,[322]6.3214,[323]6.3207,[324]6.3248,[325]6.3398,[326]6.3410,[327]6.3440,[328]6.3467,[329]6.3532,[330]6.3590,[331]6.3731,[332]6.3688,[333]6.3790,[334]6.3732,[335]6.3674,[336]6.3687,[337]6.3651,[338]6.3659,[339]6.3609,[340]6.3549,[341]6.3606,[342]6.3624,[343]6.3670,[344]6.3670,[345]6.3674,[346]6.3647,[347]6.3676,[348]6.3702,[349]6.3736,[350]6.3720,[351]6.3717,[352]6.3708,[353]6.3638,[354]6.3618,[355]6.3680,[356]6.3736,[357]6.3680,[358]6.3786,[359]6.3817,[360]6.3768,[361]6.3760,[362]6.3854,[363]6.3971,[364]6.4042,[365]6.4087,[366]6.4100,[367]6.4187,[368]6.4150,[369]6.4159,[370]6.4180,[371]6.4115,[372]6.4165,[373]6.4203,[374]6.4182,[375]6.4181,[376]6.4260,[377]6.4201,[378]6.4225,[379]6.4266,[380]6.4187,[381]6.4174,[382]6.4118,[383]6.4120,[384]6.4119,[385]6.4122,[386]6.4108,[387]6.4122,[388]6.4087,[389]6.4035,[390]6.3968,[391]6.3893,[392]6.3888,[393]6.3913,[394]6.3975,[395]6.3965,[396]6.3880,[397]6.3968,[398]6.4013,[399]6.4093,[400]6.4078,[401]6.4078,[402]6.4109,[403]6.4134,[404]6.4204,[405]6.4161,[406]6.4146,[407]6.4190,[408]6.4209,[409]6.4331,[410]6.4453,[411]6.4581,[412]6.4769,[413]6.4899,[414]6.4994,[415]6.5063,[416]6.5153,[417]6.5260,[418]6.5301,[419]6.5360,[420]6.5462,[421]6.5586,[422]6.5638,[423]6.5708,[424]6.5822,[425]6.5930,[426]6.6008,[427]6.6052,[428]6.6157,[429]6.6220,[430]6.6324,[431]6.6473,[432]6.6511,[433]6.6487,[434]6.6432,[435]6.6449,[436]6.6479,[437]6.6584,[438]6.6675,[439]6.6628,[440]6.6633,[441]6.6583,[442]6.6560,[443]6.6565,[444]6.6572,[445]6.6552,[446]6.6574,[447]6.6596,[448]6.6651,[449]6.6622,[450]6.6619,[451]6.6574,[452]6.6568,[453]6.6497,[454]6.6455,[455]6.6463,[456]6.6501,[457]6.6530,[458]6.6518,[459]6.6514,[460]6.6611,[461]6.6594,[462]6.6604,[463]6.6651,[464]6.6640,[465]6.6612,[466]6.6541,[467]6.6583,[468]6.6617,[469]6.6656,[470]6.6659,[471]6.6624,[472]6.6693,[473]6.6634,[474]6.6679,[475]6.6683,[476]6.6705,[477]6.6653,[478]6.6659,[479]6.6783,[480]6.6844,[481]6.6876,[482]6.6837,[483]6.6808,[484]6.6849,[485]6.6842,[486]6.6791,[487]6.6824,[488]6.6813,[489]6.6775,[490]6.6781,[491]6.6778,[492]6.6753,[493]6.6720,[494]6.6706,[495]6.6722,[496]6.6699,[497]6.6684,[498]6.6700,[499]6.6648,[500]6.6557,[501]6.6508,[502]6.6547,[503]6.6548,[504]6.6469,[505]6.6490,[506]6.6502,[507]6.6438,[508]6.6375,[509]6.6372,[510]6.6389,[511]6.6451,[512]6.6482,[513]6.6498,[514]6.6552,[515]6.6503,[516]6.6496,[517]6.6511,[518]6.6508,[519]6.6542,[520]6.6565,[521]6.6581,[522]6.6613,[523]6.6632,[524]6.6701,[525]6.6747,[526]6.6761,[527]6.6787,[528]6.6736,[529]6.6750,[530]6.6685,[531]6.6665,[532]6.6725,[533]6.6751,[534]6.6736,[535]6.6767,[536]6.6723,[537]6.6705,[538]6.6757,[539]6.6772,[540]6.6797,[541]6.6831,[542]6.6831,[543]6.6849,[544]6.6856,[545]6.6836,[546]6.6846,[547]6.6793,[548]6.6715,[549]6.6718,[550]6.6695,[551]6.6658,[552]6.6636,[553]6.6601,[554]6.6579,[555]6.6536,[556]6.6533,[557]6.6595,[558]6.6565,[559]6.6563,[560]6.6539,[561]6.6543,[562]6.6497,[563]6.6502,[564]6.6555,[565]6.6582,[566]6.6584,[567]6.6563,[568]6.6563,[569]6.6540,[570]6.6560,[571]6.6559,[572]6.6558,[573]6.6550,[574]6.6506,[575]6.6488,[576]6.6474,[577]6.6449,[578]6.6429,[579]6.6416,[580]6.6339,[581]6.6304,[582]6.6306,[583]6.6305,[584]6.6308,[585]6.6216,[586]6.6133,[587]6.6131,[588]6.6172,[589]6.6236,[590]6.6247,[591]6.6269,[592]6.6253,[593]6.6214,[594]6.6215,[595]6.6188,[596]6.6224,[597]6.6184,[598]6.6174,[599]6.6196,[600]6.6185,[601]6.6173,[602]6.6227,[603]6.6240,[604]6.6267,[605]6.6298,[606]6.6316,[607]6.6323,[608]6.6270,[609]6.6270,[610]6.6306,[611]6.6279,[612]6.6297,[613]6.6241,[614]6.6166,[615]6.6077,[616]6.6093,[617]6.6009,[618]6.5941,[619]6.5868,[620]6.5708,[621]6.5625,[622]6.5601,[623]6.5617,[624]6.5614,[625]6.5608,[626]6.5594,[627]6.5634,[628]6.5628,[629]6.5624,[630]6.5658,[631]6.5707,[632]6.5768,[633]6.5751,[634]6.5783,[635]6.5790,[636]6.5765,[637]6.5732,[638]6.5749,[639]6.5715,[640]6.5731,[641]6.5729,[642]6.5797,[643]6.5822,[644]6.5830,[645]6.5809,[646]6.5849,[647]6.5844,[648]6.5861,[649]6.5860,[650]6.5894,[651]6.5940,[652]6.5949,[653]6.5985,[654]6.5921,[655]6.5914,
Final estimate: PPL = 6.5914 +/- 0.03810

llama_print_timings: load time = 2846.46 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 22401.71 ms / 335360 tokens ( 0.07 ms per token, 14970.29 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 33071.41 ms / 335361 tokens

real 0m39.310s
user 1m16.005s
sys 0m31.222s

@slaren
Copy link
Collaborator Author

slaren commented Mar 12, 2024

You can usually get better performance with F16 models with -ub 256. I set the default to 512 because that's the current default and works better with quantized models. In the previous PR it was 256.

It is also possible to test the performance in real scenario with perplexity.

Copy link
Collaborator

@compilade compilade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks Mamba, but it's fixable.

To help with fixing, I managed to at least make main work with compilade@3e06fca, but parallel still triggers an assert with Mamba. I'll investigate further.

llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
llama.cpp Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
@slaren
Copy link
Collaborator Author

slaren commented Mar 12, 2024

@compilade thanks for testing, feel free to push your fixes here directly.

compilade and others added 3 commits March 12, 2024 14:59
Tested to work correctly with both `main` and `parallel` examples.
…r pipeline parallelism

default increase to 4 (from 2)

changing this value may improve performance for some systems, but increases memory usage
@Engininja2
Copy link
Contributor

hipLaunchHostFunc() was added in ROCm 5.4.1 as a beta API. Should that minimum version be added to the readme?

@slaren
Copy link
Collaborator Author

slaren commented Mar 13, 2024

hipLaunchHostFunc is not used at the moment. It may be useful in the future for adding support for pipeline parallelism between the CPU and CUDA backends, but at the moment it is just a stub.

@compilade compilade dismissed their stale review March 13, 2024 01:20

It works properly with Mamba.

Copy link
Collaborator

@compilade compilade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that the input tensors are allocated in the graph, it's much cleaner than before.

But I think the new default size of the logits buffer might be too big.

llama.cpp Outdated
@@ -12537,7 +12582,8 @@ struct llama_context_params llama_context_default_params() {
struct llama_context_params result = {
/*.seed =*/ LLAMA_DEFAULT_SEED,
/*.n_ctx =*/ 512,
/*.n_batch =*/ 512,
/*.n_batch =*/ 4096,
Copy link
Collaborator

@compilade compilade Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4096 seems a bit big for a default logical batch size.
For a model with a vocab size of 50280, the logits buffer takes 50280*4096*4/1024/1024 = 785.63 MiB, while with the previous default batch size of 512, the logits buffer took 50280*512*4/1024/1024 = 98.20 MiB.
This only depends on the vocab and logical batch sizes, so the logits buffer for a small model like Mamba-130m (a 256.96 MiB model in f16) would take 3 times as much memory as the model weights with a default n_batch of 4096.

And it doesn't really fit with the default n_ctx of 512; a bigger n_batch than n_ctx won't ever be used completely (unless there's a way I didn't think of), and is thus wasted memory.

I suggest to either clamp n_batch to n_ctx, or (preferably) make the default n_batch equal to the default n_ctx again.

Copy link
Collaborator Author

@slaren slaren Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bigger batch size is what allows pipeline parallelism to work. For example if the application submits a batch of 4096 tokens to llama_decode, these will be split in mini-batches of 512 tokens each (n_ubatch) and evaluated as a pipeline in parallel between the available GPUs. It could be reduced to 2048 and still get most of the benefit.

Copy link
Collaborator

@compilade compilade Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipeline parallelism by default seems desirable. But n_batch shouldn't exceed n_ctx. Even when passing --ctx-size 128 to main, n_batch is still 4096 (from the default), and the 785.63 MiB of logits are still allocated, even if they can't be used since a batch bigger than n_ctx will simply not be able to find a big enough KV slot (for Transformer-based models, at least).

Clamping n_batch to n_ctx (with something like n_batch = std::min(n_batch, n_ctx)) should fix this.

Copy link
Collaborator Author

@slaren slaren Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change llama_new_context_with_model to limit n_batch and n_ubatch to n_ctx, since there is no advantage to increasing it beyond that.

Ideally, we would set the defaults intelligently according to the hardware of the system, and only increase n_batch if the system can actually support pipeline parallelism, which requires several CUDA GPUs and full offload. However that would require deeper changes.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But n_batch shouldn't exceed n_ctx

This is not always the case. When using a model that does not utilize the KV cache (for example non-causal embedding model like BERT), we might want to run with n_ctx = 0, n_batch = 8192

With 4400153 applied in such cases, we now have to allocate a KV cache due to n_ctx = 8192 and it won't be used

Given that, should we revert the clamp change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could pre-allocate buffers for n_max_seq tokens only during initialization, and increase the size dynamically in llama_decode automatically if there is a request for more logits than that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could pre-allocate buffers for n_max_seq tokens only during initialization, and increase the size dynamically in llama_decode automatically if there is a request for more logits than that.

From what I understand, the logits aren't necessarily contiguous with each other in the output buffer, so yes pre-allocation and dynamic resizing could be done, but not until the way the layout of the logits is always made contiguous with no offset before the first used logits.

Copy link
Collaborator Author

@slaren slaren Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could definitely improve the handling of logits. Even perplexity and imatrix only need logits for n_ctx/2 tokens. We could also skip the computation of the output layer for the tokens where logits are not needed. IIRC there was a PR about this a while ago, but it was never merged.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the std::vector<float> logits; can become a std::vector<std::pair<int, std::vector<float>> where the int is the token index in the batch (i.e. i_batch)

Copy link
Collaborator Author

@slaren slaren Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should do something like that, and remove llama_get_logits in favor of llama_get_logits_ith. However logits is no longer a std::vector since it is allocated in a host buffer, which is necessary for pipeline parallelism, otherwise the copy from the GPU can cause a synchronization. It is also important that the logits are contiguous in memory when possible to reduce the number of copies for applications such as perplexity, there is a significant performance improvement when doing just one cudaMemcpy instead of one for each token (which ends being n_ctx/2 calls to cudaMemcpy).

llama.cpp Outdated Show resolved Hide resolved
@slaren
Copy link
Collaborator Author

slaren commented Mar 13, 2024

master with 8x GPU would be a bit slower than master with 1x GPU, since there is no parallelism at all, but there is the additional overhead of copying data between GPUs. That's also the reason why tg is slower with multiple GPUs, single token generation cannot be parallelized (with -sm layer, tensor level parallelism is supported with -sm row).

@phymbert
Copy link
Collaborator

Sorry, but I dont understand where the performance is improved compared to master then ? could you please point out to me which line of the bench is faster ?

@slaren
Copy link
Collaborator Author

slaren commented Mar 13, 2024

Are you looking at the results of master or old? old refers to the previous pipeline parallelism PR, which suffered from synchronization issues and didn't always produce correct results. Performance is better than master with multiple GPUs in every pp test. For example, 7B F16 pp 2048 improved from 7113 t/s to 22931 t/s with 8 GPUs. Larger batch sizes improve performance further.

@phymbert
Copy link
Collaborator

thanks for your reply 👍, I am looking for a comparison against master yes. We do not see the figures for x8 gpu. Is the tg increased also ? how much ?

as I am working on the performance CI pipeline, I want to understand what this PR brings. Thanks

@slaren
Copy link
Collaborator Author

slaren commented Mar 13, 2024

Here is a recap of the data from @ggerganov:

model test 1x GPU 2x GPU 4x GPU 8x GPU
llama 7B F16 pp 512 8237.93 10090.00 11813.02 12100.57
llama 7B F16 pp 1024 7823.11 11267.24 15387.27 17591.46
llama 7B F16 pp 2048 6974.75 11404.60 17609.80 22931.45
llama 7B F16 pp 4096 5594.67 10493.05 17618.77 25504.36
llama 7B F16 pp 8192 4527.03 8587.92 15100.46 23820.97
llama 7B F16 tg 128 73.87 72.08 71.87 71.87
llama 7B Q8_0 pp 512 6728.34 7276.31 8604.28 9266.34
llama 7B Q8_0 pp 1024 7055.80 8212.10 11357.60 13796.83
llama 7B Q8_0 pp 2048 6674.59 8474.57 13201.46 17820.71
llama 7B Q8_0 pp 4096 5496.80 8044.92 13411.17 20105.47
llama 7B Q8_0 pp 8192 4449.37 6922.49 11882.45 18915.61
llama 7B Q8_0 tg 128 114.80 111.61 110.12 108.56
llama 7B Q4_K - Medium pp 512 6125.22 6241.75 7983.52 7908.16
llama 7B Q4_K - Medium pp 1024 6690.14 7085.78 12007.61 11764.20
llama 7B Q4_K - Medium pp 2048 6501.66 7372.92 15755.47 15567.45
llama 7B Q4_K - Medium pp 4096 5427.55 7087.04 17909.72 17668.75
llama 7B Q4_K - Medium pp 8192 4404.22 6186.62 17152.45 16992.44
llama 7B Q4_K - Medium tg 128 129.97 126.62 124.27 122.04
llama 7B Q4_0 pp 512 6109.62 6198.75 7908.16 7908.16
llama 7B Q4_0 pp 1024 6678.93 7045.06 11764.20 11764.20
llama 7B Q4_0 pp 2048 6496.22 7345.98 15567.45 15567.45
llama 7B Q4_0 pp 4096 5418.54 7083.79 17909.72 17668.75
llama 7B Q4_0 pp 8192 4408.39 6192.83 17152.45 16992.44
llama 7B Q4_0 tg 128 145.15 140.12 137.21 135.03

With master the performance would be roughly the same as with 1x GPU in all the cases. tg is not improved, this optimization works by splitting large batches into multiple smaller batches and processing them in parallel.

@slaren
Copy link
Collaborator Author

slaren commented Mar 13, 2024

I noticed the thread sanitizer tests failing, but the errors don't make much sense to me. The errors are intermittent, running the jobs again eventually the test passes. I suspect that it is an issue with a particular runner.

It produces errors such as this:
FATAL: ThreadSanitizer: unexpected memory mapping 0x637684c4f000-0x637684c5b000

llama.cpp Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner

ggerganov commented Mar 13, 2024

@phymbert I've added multi-GPU results for master into the comment - hope this makes it more clear. With this PR we now have efficient pipeline parallelization which is important for increasing the throughput in multi-GPU systems

@slaren Yes, the thread sanitizer build failures are not related to our code (#5943 (comment))

@slaren slaren merged commit f30ea47 into master Mar 13, 2024
52 of 62 checks passed
@slaren slaren deleted the sl/pipeline-parallelism branch March 13, 2024 17:57
NeoZhangJianyu pushed a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 15, 2024
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs

ggml-ci

* server : add -ub, --ubatch-size parameter

* fix server embedding test

* llama : fix Mamba inference for pipeline parallelism

Tested to work correctly with both `main` and `parallel` examples.

* llama : limit max batch size to n_batch

* add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism
default increase to 4 (from 2)

changing this value may improve performance for some systems, but increases memory usage

* fix hip build

* fix sycl build (disable cpy_tensor_async)

* fix hip build

* llama : limit n_batch and n_ubatch to n_ctx during context creation

* llama : fix norm backend

* batched-bench : sync after decode

* swiftui : sync after decode

* ggml : allow ggml_get_rows to use multiple threads if they are available

* check n_ubatch >= n_tokens with non-casual attention

* llama : do not limit n_batch to n_ctx with non-casual attn

* server : construct batch with size of llama_n_batch

* ggml_backend_cpu_graph_compute : fix return value when alloc fails

* llama : better n_batch and n_ubatch comment

* fix merge

* small fix

* reduce default n_batch to 2048

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@siavashmohammady66
Copy link

Does it also support multiple GPU , when we are using Vulken SDK for compiling the llama.cpp code? (For the AMD GPU)

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs

ggml-ci

* server : add -ub, --ubatch-size parameter

* fix server embedding test

* llama : fix Mamba inference for pipeline parallelism

Tested to work correctly with both `main` and `parallel` examples.

* llama : limit max batch size to n_batch

* add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism
default increase to 4 (from 2)

changing this value may improve performance for some systems, but increases memory usage

* fix hip build

* fix sycl build (disable cpy_tensor_async)

* fix hip build

* llama : limit n_batch and n_ubatch to n_ctx during context creation

* llama : fix norm backend

* batched-bench : sync after decode

* swiftui : sync after decode

* ggml : allow ggml_get_rows to use multiple threads if they are available

* check n_ubatch >= n_tokens with non-casual attention

* llama : do not limit n_batch to n_ctx with non-casual attn

* server : construct batch with size of llama_n_batch

* ggml_backend_cpu_graph_compute : fix return value when alloc fails

* llama : better n_batch and n_ubatch comment

* fix merge

* small fix

* reduce default n_batch to 2048

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants