llama : add pipeline parallelism support #6017

slaren · 2024-03-12T14:22:40Z

Pipeline parallelism improves batch processing performance when using multiple GPUs.

Changes:

Graph inputs are allocated in the graph
- Allocating inputs in the graph allows ggml_backend_sched to make multiple copies automatically, which is useful to reduce synchronization requirements with pipeline parallelism
- Allocating inputs on the graph is more efficient since it allows their memory to be reused for other computations
- Only the inputs actually required by the model are allocated
Automatic batch splitting in llama_decode
- llama_decode automatically splits the batches into multiple smaller batches if it is too big for the configured compute batch size
- The largest batch size that can be submitted to llama_decode is still limited by n_batch to reduce the size of the logits and embeddings buffers
Adds n_ubatch (-ub in the command line) to llama_context_params parameter
- n_batch sets the size of the logits and embeddings buffer, which limits the maximum batch size passed to llama_decode
- n_ubatch sets the maximum batch size for computation
- By default n_batch is 4096, n_ubatch is 512
- This allows current applications to take advantage of pipeline parallelism by setting a larger n_batch without having to update their logic
Makes llama_decode asynchronous
- Synchronization is done automatically on llama_get_logits and llama_get_embeddings
- Adds llama_synchronize to force a synchronization manually, which can be useful when measuring the time of llama_decode
- Applications can also take advantage of pipeline parallelism by making multiple calls to llama_decode without synchronizing
- Note: llama_timings may not be accurate if the application does not synchronize immediately after calling llama_decode
Uses a host buffer for the logits and embeddings (except when pooled) outputs, which improves the performance when copying the data from the GPU
Multi-threaded ggml_get_rows (still single-threaded when full offload)
Adds an event interface to ggml-backend for synchronization between different backends
CUDA only, other backends will need to implement the async and event interface to take advantage of pipeline parallelism
Adds the build parameter LLAMA_SCHED_MAX_COPIES
- This parameter configures the number of copies of the split inputs in ggml_backend_sched when using pipeline parallelism
- Increasing this value allows more operations to be queued without requiring synchronization, and may improve performance on some systems at the expense of higher memory usage in the compute buffer

…ltiple CUDA GPUs ggml-ci

phymbert · 2024-03-12T14:45:36Z

Do you mind to also add --ubatch-size in server.cpp please ?

ggerganov · 2024-03-12T17:00:56Z

Posting some results on 8x A100:

LLAMA_SCHED_MAX_COPIES=8 LLAMA_CUBLAS=1 make -j
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./llama-bench \
    -m models/codellama-7b/ggml-model-f16.gguf \
    -m models/codellama-7b/ggml-model-q8_0.gguf \
    -m models/codellama-7b/ggml-model-q4_k.gguf \
    -m models/codellama-7b/ggml-model-q4_0.gguf \
    -ngl 99 -p 512,1024,2048,4096,8192 -b 8192

master (1x GPU):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model	size	params	backend	ngl	n_batch	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 512	8321.01 ± 78.06
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 1024	7832.04 ± 51.18
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 2048	7284.35 ± 21.52
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 4096	6276.87 ± 9.24
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 8192	4835.90 ± 4.67
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	tg 128	74.12 ± 0.33
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 512	6770.95 ± 86.88
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 1024	7065.25 ± 41.31
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 2048	6623.07 ± 20.11
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 4096	5780.57 ± 21.80
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 8192	4540.37 ± 5.68
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	tg 128	115.28 ± 0.84
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 512	6162.48 ± 75.34
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 1024	6709.16 ± 48.61
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 2048	6304.09 ± 17.74
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 4096	5529.16 ± 14.36
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 8192	4382.12 ± 6.16
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	tg 128	131.24 ± 0.68
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 512	6139.05 ± 52.82
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 1024	6694.85 ± 29.93
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 2048	6279.48 ± 30.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 4096	5519.16 ± 13.95
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 8192	4373.91 ± 3.77
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	tg 128	145.86 ± 1.44

build: d8fd0cc (2412)

master (2x GPUs):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model	size	params	backend	ngl	n_batch	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 512	8437.93 ± 56.78
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 1024	7935.64 ± 41.99
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 2048	7383.09 ± 12.74
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 4096	6334.79 ± 16.70
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 8192	4875.20 ± 8.59
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	tg 128	73.96 ± 0.33
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 512	6907.67 ± 21.79
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 1024	7187.15 ± 34.47
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 2048	6742.78 ± 13.20
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 4096	5855.42 ± 12.20
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 8192	4580.27 ± 5.76
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	tg 128	115.18 ± 0.86
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 512	6314.50 ± 35.45
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 1024	6851.88 ± 27.60
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 2048	6430.31 ± 8.31
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 4096	5627.43 ± 16.12
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 8192	4435.21 ± 9.20
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	tg 128	130.87 ± 1.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 512	6254.15 ± 48.68
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 1024	6801.81 ± 46.62
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 2048	6399.32 ± 8.18
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 4096	5603.93 ± 9.26
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 8192	4419.86 ± 5.96
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	tg 128	145.50 ± 1.17

build: d8fd0cc (2412)

master (4x GPUs):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 4 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model	size	params	backend	ngl	n_batch	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 512	8196.19 ± 61.77
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 1024	7799.21 ± 17.14
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 2048	7279.24 ± 6.64
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 4096	6267.52 ± 10.16
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 8192	4826.63 ± 4.70
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	tg 128	73.24 ± 0.07
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 512	6792.78 ± 9.64
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 1024	7122.67 ± 10.67
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 2048	6687.84 ± 6.44
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 4096	5812.81 ± 7.81
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 8192	4549.54 ± 5.62
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	tg 128	112.97 ± 0.61
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 512	6235.15 ± 16.33
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 1024	6790.49 ± 19.86
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 2048	6398.79 ± 8.87
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 4096	5596.10 ± 6.54
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 8192	4417.32 ± 3.00
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	tg 128	127.61 ± 0.48
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 512	6165.95 ± 12.53
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 1024	6742.28 ± 12.87
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 2048	6338.36 ± 17.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 4096	5561.52 ± 8.61
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 8192	4394.38 ± 1.90
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	tg 128	142.00 ± 0.98

build: d8fd0cc (2412)

master (8x GPUs):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model	size	params	backend	ngl	n_batch	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 512	7705.18 ± 83.77
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 1024	7372.18 ± 18.39
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 2048	6902.99 ± 13.14
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 4096	5966.81 ± 5.21
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	pp 8192	4627.64 ± 6.14
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	1024	tg 128	71.74 ± 0.15
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 512	6472.16 ± 7.31
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 1024	6782.80 ± 5.23
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 2048	6374.71 ± 7.33
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 4096	5565.41 ± 1.94
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	pp 8192	4375.93 ± 4.90
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1024	tg 128	109.13 ± 0.77
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 512	5958.00 ± 4.99
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 1024	6488.75 ± 5.99
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 2048	6114.23 ± 7.09
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 4096	5370.20 ± 3.53
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	pp 8192	4255.44 ± 2.08
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	1024	tg 128	123.10 ± 1.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 512	5885.62 ± 9.11
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 1024	6419.71 ± 8.39
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 2048	6052.19 ± 9.94
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 4096	5322.86 ± 3.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 8192	4227.29 ± 2.87
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	tg 128	136.12 ± 1.33

build: d8fd0cc (2412)

old (sl/micro-batching, 8x GPUs) with ub=256:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model	size	params	backend	ngl	n_batch	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	pp 512	12287.35 ± 404.42
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	pp 1024	18428.65 ± 238.85
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	pp 2048	24279.31 ± 225.94
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	pp 4096	27046.41 ± 54.81
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	pp 8192	24665.13 ± 61.06
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	tg 128	71.59 ± 1.20
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	pp 512	9222.69 ± 4.70
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	pp 1024	13907.00 ± 3.19
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	pp 2048	17866.35 ± 13.90
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	pp 4096	19697.05 ± 14.58
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	pp 8192	18544.07 ± 6.93
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	tg 128	89.72 ± 2.34
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	pp 512	8570.40 ± 4.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	pp 1024	12955.56 ± 9.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	pp 2048	16745.54 ± 4.98
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	pp 4096	18631.20 ± 2.73
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	pp 8192	17652.17 ± 2.92
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	tg 128	128.17 ± 4.96

build: af789e7 (1861)

new (x1 GPU, `LLAMA_SCHED_MAX_COPIES=2`):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model	size	params	backend	ngl	n_batch	n_ubatch	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	4096	pp 512	8237.93 ± 21.75
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	4096	pp 1024	7823.11 ± 17.30
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	4096	pp 2048	6974.75 ± 9.76
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	4096	pp 4096	5594.67 ± 4.54
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	4096	pp 8192	4527.03 ± 8.31
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	4096	tg 128	73.87 ± 0.16
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	4096	pp 512	6728.34 ± 16.10
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	4096	pp 1024	7055.80 ± 8.37
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	4096	pp 2048	6674.59 ± 9.64
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	4096	pp 4096	5496.80 ± 7.27
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	4096	pp 8192	4449.37 ± 4.71
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	4096	tg 128	114.80 ± 0.33
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	4096	pp 512	6125.22 ± 13.83
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	4096	pp 1024	6690.14 ± 19.46
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	4096	pp 2048	6501.66 ± 6.32
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	4096	pp 4096	5427.55 ± 8.96
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	4096	pp 8192	4404.22 ± 3.19
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	4096	tg 128	129.97 ± 0.30
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	4096	pp 512	6109.62 ± 5.88
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	4096	pp 1024	6678.93 ± 21.22
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	4096	pp 2048	6496.22 ± 14.22
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	4096	pp 4096	5418.54 ± 4.24
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	4096	pp 8192	4408.39 ± 3.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	4096	tg 128	145.15 ± 0.50

build: 54cdd47 (2424)

new (x2 GPUs):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model	size	params	backend	ngl	n_batch	n_ubatch	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	pp 512	10090.00 ± 25.47
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	pp 1024	11267.24 ± 20.31
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	pp 2048	11404.60 ± 14.40
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	pp 4096	10493.05 ± 10.40
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	pp 8192	8587.92 ± 4.60
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	tg 128	72.08 ± 0.21
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	pp 512	7276.31 ± 13.11
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	pp 1024	8212.10 ± 7.13
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	pp 2048	8474.57 ± 12.77
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	pp 4096	8044.92 ± 16.42
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	pp 8192	6922.49 ± 3.52
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	tg 128	111.61 ± 0.45
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	pp 512	6241.75 ± 10.47
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	pp 1024	7085.78 ± 15.74
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	pp 2048	7372.92 ± 5.08
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	pp 4096	7087.04 ± 3.72
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	pp 8192	6186.62 ± 2.76
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	tg 128	126.62 ± 0.41
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	pp 512	6198.75 ± 6.54
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	pp 1024	7045.06 ± 7.38
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	pp 2048	7345.98 ± 3.19
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	pp 4096	7083.79 ± 3.68
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	pp 8192	6192.83 ± 2.38
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	tg 128	140.12 ± 0.65

build: 54cdd47 (2424)

new (x4 GPUs):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 4 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model	size	params	backend	ngl	n_batch	n_ubatch	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	pp 512	11813.02 ± 22.83
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	pp 1024	15387.27 ± 33.61
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	pp 2048	17609.80 ± 41.84
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	pp 4096	17618.77 ± 18.29
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	pp 8192	15100.46 ± 21.89
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	tg 128	71.87 ± 0.22
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	pp 512	8604.28 ± 4.99
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	pp 1024	11357.60 ± 4.19
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	pp 2048	13201.46 ± 18.92
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	pp 4096	13411.17 ± 21.39
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	pp 8192	11882.45 ± 10.51
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	tg 128	110.12 ± 0.59
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	pp 512	7401.93 ± 6.81
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	pp 1024	9828.07 ± 11.61
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	pp 2048	11516.67 ± 20.88
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	pp 4096	11881.12 ± 20.03
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	pp 8192	10722.42 ± 24.25
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	tg 128	124.27 ± 0.74
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	pp 512	7293.93 ± 11.64
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	pp 1024	9681.32 ± 20.72
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	pp 2048	11390.60 ± 12.97
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	pp 4096	11770.08 ± 14.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	pp 8192	10639.98 ± 10.34
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	tg 128	137.21 ± 0.81

build: 54cdd47 (2424)

new (x8 GPUs):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model	size	params	backend	ngl	n_batch	n_ubatch	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	pp 512	12100.57 ± 193.33
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	pp 1024	17591.46 ± 132.00
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	pp 2048	22931.45 ± 144.20
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	pp 4096	25504.36 ± 434.87
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	pp 8192	23820.97 ± 188.92
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	256	tg 128	71.87 ± 0.02
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	pp 512	9266.34 ± 59.90
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	pp 1024	13796.83 ± 29.05
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	pp 2048	17820.71 ± 327.33
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	pp 4096	20105.47 ± 38.39
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	pp 8192	18915.61 ± 58.18
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	256	tg 128	108.56 ± 0.90
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	pp 512	7983.52 ± 84.85
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	pp 1024	12007.61 ± 19.79
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	pp 2048	15755.47 ± 74.45
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	pp 4096	17909.72 ± 28.69
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	pp 8192	17152.45 ± 22.33
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	8192	256	tg 128	122.04 ± 0.95
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	pp 512	7908.16 ± 5.14
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	pp 1024	11764.20 ± 142.62
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	pp 2048	15567.45 ± 33.99
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	pp 4096	17668.75 ± 59.72
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	pp 8192	16992.44 ± 27.34
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	256	tg 128	135.03 ± 1.31

build: 54cdd47 (2424)

ppl, `-c 512 -b 2048, -ub 256` - runtime: `30.7s`

LLAMA_SCHED_MAX_COPIES=8 LLAMA_CUBLAS=1 make -j perplexity && time CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./perplexity -m models/codellama-7b/ggml-model-f16.gguf -ngl 99 -c 512 -b 2048 -ub 256 -f wikitext-2-raw/wiki.test.raw

llm_load_tensors: ggml ctx size = 1.00 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 250.12 MiB
llm_load_tensors: CUDA0 buffer size = 1930.16 MiB
llm_load_tensors: CUDA1 buffer size = 1544.12 MiB
llm_load_tensors: CUDA2 buffer size = 1544.12 MiB
llm_load_tensors: CUDA3 buffer size = 1544.12 MiB
llm_load_tensors: CUDA4 buffer size = 1544.12 MiB
llm_load_tensors: CUDA5 buffer size = 1544.12 MiB
llm_load_tensors: CUDA6 buffer size = 1544.12 MiB
llm_load_tensors: CUDA7 buffer size = 1408.23 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 256
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA4 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA5 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA6 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA7 KV buffer size = 96.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 250.12 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=8)
llama_new_context_with_model: CUDA0 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA3 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA4 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA5 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA6 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA7 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 20.02 MiB
llama_new_context_with_model: graph splits: 9

system_info: n_threads = 126 / 252 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 922.755 ms
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 0.15 seconds per pass - ETA 0.38 minutes
[1]5.5511,[2]6.0573,[3]6.7593,[4]7.7007,[5]7.8618,[6]7.6842,[7]7.8791,[8]7.8360,[9]8.2430,[10]8.5690,[11]8.8467,[12]8.9639,[13]8.8943,[14]9.0010,[15]9.3134,[16]8.8007,[17]8.6196,[18]8.5497,[19]8.0678,[20]8.0193,[21]7.9075,[22]7.7105,[23]7.6356,[24]7.5056,[25]7.4911,[26]7.2615,[27]6.9952,[28]6.8504,[29]6.7198,[30]6.5173,[31]6.4668,[32]6.5103,[33]6.4616,[34]6.4974,[35]6.5202,[36]6.5535,[37]6.5503,[38]6.5514,[39]6.5706,[40]6.6289,[41]6.6420,[42]6.6959,[43]6.6409,[44]6.6959,[45]6.7062,[46]6.6758,[47]6.7094,[48]6.6758,[49]6.6698,[50]6.6089,[51]6.6233,[52]6.6110,[53]6.6717,[54]6.6556,[55]6.6299,[56]6.6832,[57]6.7115,[58]6.7425,[59]6.7600,[60]6.8152,[61]6.8079,[62]6.8741,[63]6.9046,[64]6.8991,[65]6.9468,[66]6.9558,[67]6.9631,[68]6.9864,[69]7.0091,[70]7.0463,[71]7.0751,[72]7.1178,[73]7.1782,[74]7.1854,[75]7.1957,[76]7.2122,[77]7.2331,[78]7.2206,[79]7.2511,[80]7.2403,[81]7.2675,[82]7.2900,[83]7.2264,[84]7.2237,[85]7.2272,[86]7.2067,[87]7.1662,[88]7.1429,[89]7.1244,[90]7.1093,[91]7.1429,[92]7.1385,[93]7.1324,[94]7.1418,[95]7.1824,[96]7.1791,[97]7.1767,[98]7.1758,[99]7.1554,[100]7.1581,[101]7.1865,[102]7.1820,[103]7.2073,[104]7.2144,[105]7.2079,[106]7.2312,[107]7.2356,[108]7.2507,[109]7.2459,[110]7.2401,[111]7.2658,[112]7.2937,[113]7.2966,[114]7.2962,[115]7.3049,[116]7.2989,[117]7.3076,[118]7.3354,[119]7.3648,[120]7.4016,[121]7.4241,[122]7.4536,[123]7.4954,[124]7.5155,[125]7.4993,[126]7.5349,[127]7.5734,[128]7.6024,[129]7.5849,[130]7.5883,[131]7.5831,[132]7.5721,[133]7.5550,[134]7.5628,[135]7.5569,[136]7.5508,[137]7.5394,[138]7.5140,[139]7.5074,[140]7.5004,[141]7.4701,[142]7.4698,[143]7.4386,[144]7.4145,[145]7.4026,[146]7.3901,[147]7.3926,[148]7.3947,[149]7.3940,[150]7.3950,[151]7.3984,[152]7.3873,[153]7.3681,[154]7.3583,[155]7.3625,[156]7.3602,[157]7.3794,[158]7.3808,[159]7.3843,[160]7.3906,[161]7.4065,[162]7.3679,[163]7.3529,[164]7.3240,[165]7.2897,[166]7.2563,[167]7.2055,[168]7.1706,[169]7.1610,[170]7.1486,[171]7.1179,[172]7.0995,[173]7.0834,[174]7.0533,[175]7.0260,[176]7.0116,[177]6.9880,[178]6.9643,[179]6.9449,[180]6.9396,[181]6.9200,[182]6.8981,[183]6.8821,[184]6.8821,[185]6.8762,[186]6.8799,[187]6.8900,[188]6.8914,[189]6.9165,[190]6.9204,[191]6.9462,[192]6.9646,[193]6.9863,[194]7.0012,[195]7.0284,[196]7.0463,[197]7.0716,[198]7.0863,[199]7.0885,[200]7.0948,[201]7.0876,[202]7.1108,[203]7.1237,[204]7.1431,[205]7.1612,[206]7.1705,[207]7.1669,[208]7.1796,[209]7.1845,[210]7.1894,[211]7.2037,[212]7.2112,[213]7.2223,[214]7.2253,[215]7.2263,[216]7.2405,[217]7.2590,[218]7.2737,[219]7.2695,[220]7.2644,[221]7.2528,[222]7.2511,[223]7.2399,[224]7.2308,[225]7.2227,[226]7.2459,[227]7.2548,[228]7.2605,[229]7.2638,[230]7.2600,[231]7.2770,[232]7.2661,[233]7.2424,[234]7.2223,[235]7.1999,[236]7.1899,[237]7.1746,[238]7.1772,[239]7.1613,[240]7.1496,[241]7.1493,[242]7.1494,[243]7.1434,[244]7.1286,[245]7.1240,[246]7.1083,[247]7.0942,[248]7.0844,[249]7.0797,[250]7.0844,[251]7.0735,[252]7.0683,[253]7.0559,[254]7.0456,[255]7.0301,[256]7.0075,[257]6.9929,[258]6.9799,[259]6.9761,[260]6.9634,[261]6.9565,[262]6.9477,[263]6.9374,[264]6.9211,[265]6.9218,[266]6.9153,[267]6.9052,[268]6.9166,[269]6.9204,[270]6.9187,[271]6.9293,[272]6.9360,[273]6.9367,[274]6.9365,[275]6.9453,[276]6.9501,[277]6.9676,[278]6.9776,[279]6.9893,[280]6.9940,[281]7.0085,[282]7.0149,[283]7.0307,[284]7.0399,[285]7.0482,[286]7.0631,[287]7.0621,[288]7.0664,[289]7.0547,[290]7.0401,[291]7.0271,[292]7.0130,[293]6.9996,[294]7.0019,[295]7.0022,[296]7.0082,[297]7.0083,[298]7.0143,[299]7.0128,[300]7.0034,[301]7.0033,[302]6.9977,[303]6.9883,[304]6.9792,[305]6.9776,[306]6.9637,[307]6.9657,[308]6.9651,[309]6.9480,[310]6.9422,[311]6.9368,[312]6.9407,[313]6.9375,[314]6.9383,[315]6.9206,[316]6.9212,[317]6.8998,[318]6.8745,[319]6.8893,[320]6.9036,[321]6.9058,[322]6.8981,[323]6.8981,[324]6.9013,[325]6.9164,[326]6.9179,[327]6.9223,[328]6.9251,[329]6.9330,[330]6.9426,[331]6.9580,[332]6.9545,[333]6.9661,[334]6.9597,[335]6.9514,[336]6.9532,[337]6.9498,[338]6.9510,[339]6.9467,[340]6.9415,[341]6.9477,[342]6.9479,[343]6.9532,[344]6.9538,[345]6.9535,[346]6.9495,[347]6.9521,[348]6.9556,[349]6.9591,[350]6.9571,[351]6.9569,[352]6.9573,[353]6.9502,[354]6.9500,[355]6.9559,[356]6.9605,[357]6.9549,[358]6.9657,[359]6.9688,[360]6.9646,[361]6.9628,[362]6.9721,[363]6.9853,[364]6.9924,[365]6.9977,[366]6.9991,[367]7.0063,[368]7.0008,[369]7.0019,[370]7.0035,[371]6.9969,[372]7.0010,[373]7.0051,[374]7.0013,[375]6.9996,[376]7.0100,[377]7.0028,[378]7.0065,[379]7.0141,[380]7.0042,[381]7.0010,[382]6.9944,[383]6.9944,[384]6.9932,[385]6.9938,[386]6.9941,[387]6.9961,[388]6.9923,[389]6.9873,[390]6.9804,[391]6.9732,[392]6.9752,[393]6.9797,[394]6.9863,[395]6.9843,[396]6.9754,[397]6.9850,[398]6.9921,[399]7.0012,[400]7.0003,[401]7.0017,[402]7.0040,[403]7.0057,[404]7.0127,[405]7.0102,[406]7.0073,[407]7.0125,[408]7.0133,[409]7.0256,[410]7.0396,[411]7.0541,[412]7.0736,[413]7.0866,[414]7.0969,[415]7.1036,[416]7.1153,[417]7.1269,[418]7.1320,[419]7.1385,[420]7.1509,[421]7.1644,[422]7.1703,[423]7.1794,[424]7.1918,[425]7.2047,[426]7.2135,[427]7.2187,[428]7.2303,[429]7.2368,[430]7.2494,[431]7.2648,[432]7.2681,[433]7.2650,[434]7.2590,[435]7.2599,[436]7.2629,[437]7.2752,[438]7.2845,[439]7.2789,[440]7.2778,[441]7.2726,[442]7.2706,[443]7.2708,[444]7.2715,[445]7.2686,[446]7.2708,[447]7.2739,[448]7.2799,[449]7.2768,[450]7.2767,[451]7.2715,[452]7.2706,[453]7.2622,[454]7.2568,[455]7.2581,[456]7.2614,[457]7.2639,[458]7.2631,[459]7.2628,[460]7.2732,[461]7.2714,[462]7.2717,[463]7.2781,[464]7.2772,[465]7.2734,[466]7.2652,[467]7.2690,[468]7.2719,[469]7.2758,[470]7.2767,[471]7.2726,[472]7.2797,[473]7.2732,[474]7.2786,[475]7.2801,[476]7.2828,[477]7.2762,[478]7.2753,[479]7.2892,[480]7.2962,[481]7.2997,[482]7.2952,[483]7.2913,[484]7.2962,[485]7.2961,[486]7.2903,[487]7.2945,[488]7.2946,[489]7.2898,[490]7.2903,[491]7.2909,[492]7.2878,[493]7.2842,[494]7.2825,[495]7.2852,[496]7.2831,[497]7.2814,[498]7.2821,[499]7.2758,[500]7.2668,[501]7.2613,[502]7.2659,[503]7.2653,[504]7.2569,[505]7.2604,[506]7.2614,[507]7.2561,[508]7.2501,[509]7.2494,[510]7.2528,[511]7.2609,[512]7.2636,[513]7.2651,[514]7.2720,[515]7.2664,[516]7.2648,[517]7.2661,[518]7.2668,[519]7.2710,[520]7.2733,[521]7.2759,[522]7.2794,[523]7.2805,[524]7.2872,[525]7.2914,[526]7.2918,[527]7.2945,[528]7.2889,[529]7.2914,[530]7.2835,[531]7.2816,[532]7.2885,[533]7.2916,[534]7.2892,[535]7.2931,[536]7.2878,[537]7.2852,[538]7.2916,[539]7.2940,[540]7.2978,[541]7.3027,[542]7.3019,[543]7.3033,[544]7.3033,[545]7.3012,[546]7.3024,[547]7.2970,[548]7.2887,[549]7.2891,[550]7.2859,[551]7.2822,[552]7.2799,[553]7.2758,[554]7.2727,[555]7.2676,[556]7.2671,[557]7.2735,[558]7.2704,[559]7.2692,[560]7.2676,[561]7.2680,[562]7.2629,[563]7.2649,[564]7.2701,[565]7.2723,[566]7.2716,[567]7.2699,[568]7.2691,[569]7.2664,[570]7.2692,[571]7.2697,[572]7.2696,[573]7.2699,[574]7.2657,[575]7.2640,[576]7.2626,[577]7.2598,[578]7.2574,[579]7.2565,[580]7.2485,[581]7.2447,[582]7.2445,[583]7.2444,[584]7.2439,[585]7.2345,[586]7.2259,[587]7.2256,[588]7.2309,[589]7.2378,[590]7.2396,[591]7.2412,[592]7.2396,[593]7.2352,[594]7.2354,[595]7.2323,[596]7.2361,[597]7.2316,[598]7.2292,[599]7.2312,[600]7.2305,[601]7.2294,[602]7.2348,[603]7.2367,[604]7.2393,[605]7.2428,[606]7.2447,[607]7.2455,[608]7.2413,[609]7.2415,[610]7.2460,[611]7.2428,[612]7.2441,[613]7.2384,[614]7.2325,[615]7.2219,[616]7.2245,[617]7.2152,[618]7.2070,[619]7.1987,[620]7.1805,[621]7.1720,[622]7.1704,[623]7.1718,[624]7.1710,[625]7.1699,[626]7.1686,[627]7.1717,[628]7.1713,[629]7.1710,[630]7.1741,[631]7.1794,[632]7.1853,[633]7.1833,[634]7.1873,[635]7.1882,[636]7.1848,[637]7.1813,[638]7.1843,[639]7.1801,[640]7.1824,[641]7.1821,[642]7.1901,[643]7.1924,[644]7.1930,[645]7.1911,[646]7.1956,[647]7.1954,[648]7.1974,[649]7.1978,[650]7.2026,[651]7.2079,[652]7.2096,[653]7.2143,[654]7.2065,[655]7.2061,
Final estimate: PPL = 7.2061 +/- 0.04271

llama_print_timings: load time = 1602.72 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 15310.80 ms / 335360 tokens ( 0.05 ms per token, 21903.50 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 25860.90 ms / 335361 tokens

real 0m30.712s
user 1m12.300s
sys 0m25.268s

master (1x GPU, 13B):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model	size	params	backend	ngl	n_batch	test	t/s
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	4096	pp 512	5111.70 ± 47.07
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	4096	pp 1024	4922.35 ± 35.87
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	4096	pp 2048	4413.17 ± 10.56
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	4096	tg 128	45.28 ± 0.05
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	4096	pp 512	4036.98 ± 25.68
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	4096	pp 1024	4350.98 ± 14.85
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	4096	pp 2048	4162.22 ± 9.86
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	4096	tg 128	68.79 ± 0.04
llama 13B Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	4096	pp 512	3615.92 ± 20.41
llama 13B Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	4096	pp 1024	4062.81 ± 27.69
llama 13B Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	4096	pp 2048	4006.28 ± 11.71
llama 13B Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	4096	tg 128	83.52 ± 0.05
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	4096	pp 512	3608.53 ± 20.18
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	4096	pp 1024	4059.13 ± 20.43
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	4096	pp 2048	4004.22 ± 12.23
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	4096	tg 128	95.35 ± 0.08

build: 99b71c0 (2410)

new (x8 GPUs, 13B):

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model	size	params	backend	ngl	n_batch	n_ubatch	test	t/s
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	256	pp 512	8142.20 ± 3.99
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	256	pp 1024	11977.25 ± 18.75
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	256	pp 2048	15377.04 ± 19.31
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	256	pp 4096	16842.23 ± 240.43
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	256	pp 8192	15316.75 ± 11.69
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	256	tg 128	43.96 ± 0.12
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	256	pp 512	5605.35 ± 3.46
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	256	pp 1024	8428.62 ± 0.80
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	256	pp 2048	10686.25 ± 14.22
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	256	pp 4096	11824.00 ± 20.43
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	256	pp 8192	11228.18 ± 12.87
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	256	tg 128	66.32 ± 0.38
llama 13B Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	8192	256	pp 512	4775.66 ± 3.41
llama 13B Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	8192	256	pp 1024	7225.87 ± 4.09
llama 13B Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	8192	256	pp 2048	9251.36 ± 4.16
llama 13B Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	8192	256	pp 4096	10346.24 ± 19.27
llama 13B Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	8192	256	pp 8192	10018.40 ± 6.89
llama 13B Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	8192	256	tg 128	80.95 ± 0.48
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	256	pp 512	4731.63 ± 2.33
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	256	pp 1024	7174.17 ± 4.75
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	256	pp 2048	9167.04 ± 2.94
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	256	pp 4096	10255.81 ± 20.46
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	256	pp 8192	9943.68 ± 11.66
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	256	tg 128	90.05 ± 0.70

build: 54cdd47 (2424)

ppl (13B), `-c 512 -b 2048, -ub 256` - runtime: `39.3s`

LLAMA_SCHED_MAX_COPIES=8 LLAMA_CUBLAS=1 make -j perplexity && time CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./perplexity -m models/codellama-13b/ggml-model-f16.gguf -ngl 99 -c 512 -b 2048 -ub 256 -f wikitext-2-raw/wiki.test.raw

llm_load_tensors: ggml ctx size = 1.25 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 312.66 MiB
llm_load_tensors: CUDA0 buffer size = 3630.23 MiB
llm_load_tensors: CUDA1 buffer size = 3025.20 MiB
llm_load_tensors: CUDA2 buffer size = 3025.20 MiB
llm_load_tensors: CUDA3 buffer size = 3025.20 MiB
llm_load_tensors: CUDA4 buffer size = 3025.20 MiB
llm_load_tensors: CUDA5 buffer size = 3025.20 MiB
llm_load_tensors: CUDA6 buffer size = 3025.20 MiB
llm_load_tensors: CUDA7 buffer size = 2732.83 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 256
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 240.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA4 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA5 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA6 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA7 KV buffer size = 160.00 MiB
llama_new_context_with_model: KV self size = 1600.00 MiB, K (f16): 800.00 MiB, V (f16): 800.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 250.12 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=8)
llama_new_context_with_model: CUDA0 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA3 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA4 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA5 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA6 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA7 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 21.02 MiB
llama_new_context_with_model: graph splits: 9

system_info: n_threads = 126 / 252 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 969.675 ms
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 0.21 seconds per pass - ETA 0.57 minutes
[1]4.7515,[2]5.5743,[3]6.2923,[4]7.1020,[5]7.3440,[6]7.1685,[7]7.3431,[8]7.3339,[9]7.6557,[10]7.9098,[11]8.1824,[12]8.2266,[13]8.1402,[14]8.3017,[15]8.6143,[16]8.1901,[17]8.0203,[18]7.9660,[19]7.5368,[20]7.5153,[21]7.3838,[22]7.1791,[23]7.1297,[24]7.0125,[25]7.0057,[26]6.7937,[27]6.5418,[28]6.3808,[29]6.2788,[30]6.0670,[31]6.0166,[32]6.0573,[33]6.0083,[34]6.0517,[35]6.0759,[36]6.1186,[37]6.0951,[38]6.0937,[39]6.1068,[40]6.1666,[41]6.1816,[42]6.2251,[43]6.1683,[44]6.2104,[45]6.2241,[46]6.1909,[47]6.2133,[48]6.1786,[49]6.1735,[50]6.1243,[51]6.1308,[52]6.1158,[53]6.1663,[54]6.1518,[55]6.1288,[56]6.1687,[57]6.1866,[58]6.2100,[59]6.2220,[60]6.2663,[61]6.2574,[62]6.3202,[63]6.3439,[64]6.3468,[65]6.3853,[66]6.3889,[67]6.3957,[68]6.4132,[69]6.4349,[70]6.4674,[71]6.4919,[72]6.5292,[73]6.5872,[74]6.5989,[75]6.6017,[76]6.6165,[77]6.6342,[78]6.6272,[79]6.6555,[80]6.6523,[81]6.6707,[82]6.6828,[83]6.6300,[84]6.6309,[85]6.6321,[86]6.6132,[87]6.5768,[88]6.5584,[89]6.5367,[90]6.5242,[91]6.5494,[92]6.5431,[93]6.5359,[94]6.5396,[95]6.5709,[96]6.5721,[97]6.5615,[98]6.5512,[99]6.5314,[100]6.5349,[101]6.5650,[102]6.5594,[103]6.5811,[104]6.5868,[105]6.5803,[106]6.6011,[107]6.6042,[108]6.6121,[109]6.6110,[110]6.6045,[111]6.6265,[112]6.6494,[113]6.6511,[114]6.6513,[115]6.6577,[116]6.6475,[117]6.6507,[118]6.6772,[119]6.7020,[120]6.7391,[121]6.7564,[122]6.7807,[123]6.8201,[124]6.8343,[125]6.8196,[126]6.8549,[127]6.8890,[128]6.9170,[129]6.8986,[130]6.9033,[131]6.8992,[132]6.8917,[133]6.8754,[134]6.8875,[135]6.8839,[136]6.8761,[137]6.8654,[138]6.8416,[139]6.8363,[140]6.8337,[141]6.8074,[142]6.8039,[143]6.7750,[144]6.7536,[145]6.7413,[146]6.7318,[147]6.7340,[148]6.7389,[149]6.7376,[150]6.7390,[151]6.7435,[152]6.7361,[153]6.7216,[154]6.7140,[155]6.7182,[156]6.7190,[157]6.7372,[158]6.7420,[159]6.7459,[160]6.7538,[161]6.7710,[162]6.7381,[163]6.7260,[164]6.7003,[165]6.6716,[166]6.6415,[167]6.5974,[168]6.5694,[169]6.5591,[170]6.5475,[171]6.5199,[172]6.5045,[173]6.4907,[174]6.4609,[175]6.4382,[176]6.4240,[177]6.4023,[178]6.3802,[179]6.3635,[180]6.3575,[181]6.3416,[182]6.3212,[183]6.3065,[184]6.3064,[185]6.3032,[186]6.3078,[187]6.3188,[188]6.3199,[189]6.3446,[190]6.3471,[191]6.3713,[192]6.3893,[193]6.4072,[194]6.4224,[195]6.4466,[196]6.4630,[197]6.4856,[198]6.5021,[199]6.5066,[200]6.5136,[201]6.5077,[202]6.5295,[203]6.5412,[204]6.5503,[205]6.5645,[206]6.5725,[207]6.5681,[208]6.5817,[209]6.5853,[210]6.5894,[211]6.6043,[212]6.6106,[213]6.6198,[214]6.6245,[215]6.6243,[216]6.6345,[217]6.6509,[218]6.6648,[219]6.6618,[220]6.6608,[221]6.6477,[222]6.6465,[223]6.6350,[224]6.6276,[225]6.6201,[226]6.6419,[227]6.6499,[228]6.6569,[229]6.6612,[230]6.6582,[231]6.6721,[232]6.6608,[233]6.6399,[234]6.6229,[235]6.5969,[236]6.5903,[237]6.5777,[238]6.5783,[239]6.5635,[240]6.5497,[241]6.5486,[242]6.5490,[243]6.5424,[244]6.5292,[245]6.5252,[246]6.5119,[247]6.4984,[248]6.4879,[249]6.4835,[250]6.4865,[251]6.4765,[252]6.4718,[253]6.4596,[254]6.4509,[255]6.4359,[256]6.4159,[257]6.4019,[258]6.3910,[259]6.3876,[260]6.3756,[261]6.3697,[262]6.3622,[263]6.3531,[264]6.3359,[265]6.3355,[266]6.3300,[267]6.3235,[268]6.3349,[269]6.3385,[270]6.3401,[271]6.3501,[272]6.3558,[273]6.3565,[274]6.3549,[275]6.3623,[276]6.3679,[277]6.3819,[278]6.3923,[279]6.4024,[280]6.4065,[281]6.4176,[282]6.4230,[283]6.4373,[284]6.4463,[285]6.4552,[286]6.4717,[287]6.4694,[288]6.4729,[289]6.4627,[290]6.4507,[291]6.4391,[292]6.4267,[293]6.4133,[294]6.4148,[295]6.4170,[296]6.4237,[297]6.4246,[298]6.4292,[299]6.4272,[300]6.4185,[301]6.4203,[302]6.4145,[303]6.4065,[304]6.3975,[305]6.3954,[306]6.3836,[307]6.3860,[308]6.3862,[309]6.3712,[310]6.3669,[311]6.3623,[312]6.3648,[313]6.3619,[314]6.3628,[315]6.3458,[316]6.3434,[317]6.3241,[318]6.3019,[319]6.3154,[320]6.3273,[321]6.3286,[322]6.3214,[323]6.3207,[324]6.3248,[325]6.3398,[326]6.3410,[327]6.3440,[328]6.3467,[329]6.3532,[330]6.3590,[331]6.3731,[332]6.3688,[333]6.3790,[334]6.3732,[335]6.3674,[336]6.3687,[337]6.3651,[338]6.3659,[339]6.3609,[340]6.3549,[341]6.3606,[342]6.3624,[343]6.3670,[344]6.3670,[345]6.3674,[346]6.3647,[347]6.3676,[348]6.3702,[349]6.3736,[350]6.3720,[351]6.3717,[352]6.3708,[353]6.3638,[354]6.3618,[355]6.3680,[356]6.3736,[357]6.3680,[358]6.3786,[359]6.3817,[360]6.3768,[361]6.3760,[362]6.3854,[363]6.3971,[364]6.4042,[365]6.4087,[366]6.4100,[367]6.4187,[368]6.4150,[369]6.4159,[370]6.4180,[371]6.4115,[372]6.4165,[373]6.4203,[374]6.4182,[375]6.4181,[376]6.4260,[377]6.4201,[378]6.4225,[379]6.4266,[380]6.4187,[381]6.4174,[382]6.4118,[383]6.4120,[384]6.4119,[385]6.4122,[386]6.4108,[387]6.4122,[388]6.4087,[389]6.4035,[390]6.3968,[391]6.3893,[392]6.3888,[393]6.3913,[394]6.3975,[395]6.3965,[396]6.3880,[397]6.3968,[398]6.4013,[399]6.4093,[400]6.4078,[401]6.4078,[402]6.4109,[403]6.4134,[404]6.4204,[405]6.4161,[406]6.4146,[407]6.4190,[408]6.4209,[409]6.4331,[410]6.4453,[411]6.4581,[412]6.4769,[413]6.4899,[414]6.4994,[415]6.5063,[416]6.5153,[417]6.5260,[418]6.5301,[419]6.5360,[420]6.5462,[421]6.5586,[422]6.5638,[423]6.5708,[424]6.5822,[425]6.5930,[426]6.6008,[427]6.6052,[428]6.6157,[429]6.6220,[430]6.6324,[431]6.6473,[432]6.6511,[433]6.6487,[434]6.6432,[435]6.6449,[436]6.6479,[437]6.6584,[438]6.6675,[439]6.6628,[440]6.6633,[441]6.6583,[442]6.6560,[443]6.6565,[444]6.6572,[445]6.6552,[446]6.6574,[447]6.6596,[448]6.6651,[449]6.6622,[450]6.6619,[451]6.6574,[452]6.6568,[453]6.6497,[454]6.6455,[455]6.6463,[456]6.6501,[457]6.6530,[458]6.6518,[459]6.6514,[460]6.6611,[461]6.6594,[462]6.6604,[463]6.6651,[464]6.6640,[465]6.6612,[466]6.6541,[467]6.6583,[468]6.6617,[469]6.6656,[470]6.6659,[471]6.6624,[472]6.6693,[473]6.6634,[474]6.6679,[475]6.6683,[476]6.6705,[477]6.6653,[478]6.6659,[479]6.6783,[480]6.6844,[481]6.6876,[482]6.6837,[483]6.6808,[484]6.6849,[485]6.6842,[486]6.6791,[487]6.6824,[488]6.6813,[489]6.6775,[490]6.6781,[491]6.6778,[492]6.6753,[493]6.6720,[494]6.6706,[495]6.6722,[496]6.6699,[497]6.6684,[498]6.6700,[499]6.6648,[500]6.6557,[501]6.6508,[502]6.6547,[503]6.6548,[504]6.6469,[505]6.6490,[506]6.6502,[507]6.6438,[508]6.6375,[509]6.6372,[510]6.6389,[511]6.6451,[512]6.6482,[513]6.6498,[514]6.6552,[515]6.6503,[516]6.6496,[517]6.6511,[518]6.6508,[519]6.6542,[520]6.6565,[521]6.6581,[522]6.6613,[523]6.6632,[524]6.6701,[525]6.6747,[526]6.6761,[527]6.6787,[528]6.6736,[529]6.6750,[530]6.6685,[531]6.6665,[532]6.6725,[533]6.6751,[534]6.6736,[535]6.6767,[536]6.6723,[537]6.6705,[538]6.6757,[539]6.6772,[540]6.6797,[541]6.6831,[542]6.6831,[543]6.6849,[544]6.6856,[545]6.6836,[546]6.6846,[547]6.6793,[548]6.6715,[549]6.6718,[550]6.6695,[551]6.6658,[552]6.6636,[553]6.6601,[554]6.6579,[555]6.6536,[556]6.6533,[557]6.6595,[558]6.6565,[559]6.6563,[560]6.6539,[561]6.6543,[562]6.6497,[563]6.6502,[564]6.6555,[565]6.6582,[566]6.6584,[567]6.6563,[568]6.6563,[569]6.6540,[570]6.6560,[571]6.6559,[572]6.6558,[573]6.6550,[574]6.6506,[575]6.6488,[576]6.6474,[577]6.6449,[578]6.6429,[579]6.6416,[580]6.6339,[581]6.6304,[582]6.6306,[583]6.6305,[584]6.6308,[585]6.6216,[586]6.6133,[587]6.6131,[588]6.6172,[589]6.6236,[590]6.6247,[591]6.6269,[592]6.6253,[593]6.6214,[594]6.6215,[595]6.6188,[596]6.6224,[597]6.6184,[598]6.6174,[599]6.6196,[600]6.6185,[601]6.6173,[602]6.6227,[603]6.6240,[604]6.6267,[605]6.6298,[606]6.6316,[607]6.6323,[608]6.6270,[609]6.6270,[610]6.6306,[611]6.6279,[612]6.6297,[613]6.6241,[614]6.6166,[615]6.6077,[616]6.6093,[617]6.6009,[618]6.5941,[619]6.5868,[620]6.5708,[621]6.5625,[622]6.5601,[623]6.5617,[624]6.5614,[625]6.5608,[626]6.5594,[627]6.5634,[628]6.5628,[629]6.5624,[630]6.5658,[631]6.5707,[632]6.5768,[633]6.5751,[634]6.5783,[635]6.5790,[636]6.5765,[637]6.5732,[638]6.5749,[639]6.5715,[640]6.5731,[641]6.5729,[642]6.5797,[643]6.5822,[644]6.5830,[645]6.5809,[646]6.5849,[647]6.5844,[648]6.5861,[649]6.5860,[650]6.5894,[651]6.5940,[652]6.5949,[653]6.5985,[654]6.5921,[655]6.5914,
Final estimate: PPL = 6.5914 +/- 0.03810

llama_print_timings: load time = 2846.46 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 22401.71 ms / 335360 tokens ( 0.07 ms per token, 14970.29 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 33071.41 ms / 335361 tokens

real 0m39.310s
user 1m16.005s
sys 0m31.222s

slaren · 2024-03-12T17:03:56Z

You can usually get better performance with F16 models with -ub 256. I set the default to 512 because that's the current default and works better with quantized models. In the previous PR it was 256.

It is also possible to test the performance in real scenario with perplexity.

compilade

This breaks Mamba, but it's fixable.

To help with fixing, I managed to at least make main work with compilade@3e06fca, but parallel still triggers an assert with Mamba. I'll investigate further.

llama.cpp

slaren · 2024-03-12T17:21:42Z

@compilade thanks for testing, feel free to push your fixes here directly.

Tested to work correctly with both `main` and `parallel` examples.

…r pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage

…lism

Engininja2 · 2024-03-13T00:38:02Z

hipLaunchHostFunc() was added in ROCm 5.4.1 as a beta API. Should that minimum version be added to the readme?

slaren · 2024-03-13T00:45:30Z

hipLaunchHostFunc is not used at the moment. It may be useful in the future for adding support for pipeline parallelism between the CPU and CUDA backends, but at the moment it is just a stub.

It works properly with Mamba.

compilade

I like that the input tensors are allocated in the graph, it's much cleaner than before.

But I think the new default size of the logits buffer might be too big.

compilade · 2024-03-13T00:58:06Z

llama.cpp

@@ -12537,7 +12582,8 @@ struct llama_context_params llama_context_default_params() {
    struct llama_context_params result = {
        /*.seed                        =*/ LLAMA_DEFAULT_SEED,
        /*.n_ctx                       =*/ 512,
-        /*.n_batch                     =*/ 512,
+        /*.n_batch                     =*/ 4096,


4096 seems a bit big for a default logical batch size.
For a model with a vocab size of 50280, the logits buffer takes 50280*4096*4/1024/1024 = 785.63 MiB, while with the previous default batch size of 512, the logits buffer took 50280*512*4/1024/1024 = 98.20 MiB.
This only depends on the vocab and logical batch sizes, so the logits buffer for a small model like Mamba-130m (a 256.96 MiB model in f16) would take 3 times as much memory as the model weights with a default n_batch of 4096.

And it doesn't really fit with the default n_ctx of 512; a bigger n_batch than n_ctx won't ever be used completely (unless there's a way I didn't think of), and is thus wasted memory.

I suggest to either clamp n_batch to n_ctx, or (preferably) make the default n_batch equal to the default n_ctx again.

A bigger batch size is what allows pipeline parallelism to work. For example if the application submits a batch of 4096 tokens to llama_decode, these will be split in mini-batches of 512 tokens each (n_ubatch) and evaluated as a pipeline in parallel between the available GPUs. It could be reduced to 2048 and still get most of the benefit.

Pipeline parallelism by default seems desirable. But n_batch shouldn't exceed n_ctx. Even when passing --ctx-size 128 to main, n_batch is still 4096 (from the default), and the 785.63 MiB of logits are still allocated, even if they can't be used since a batch bigger than n_ctx will simply not be able to find a big enough KV slot (for Transformer-based models, at least).

Clamping n_batch to n_ctx (with something like n_batch = std::min(n_batch, n_ctx)) should fix this.

We can change llama_new_context_with_model to limit n_batch and n_ubatch to n_ctx, since there is no advantage to increasing it beyond that.

Ideally, we would set the defaults intelligently according to the hardware of the system, and only increase n_batch if the system can actually support pipeline parallelism, which requires several CUDA GPUs and full offload. However that would require deeper changes.

But n_batch shouldn't exceed n_ctx

This is not always the case. When using a model that does not utilize the KV cache (for example non-causal embedding model like BERT), we might want to run with n_ctx = 0, n_batch = 8192

With 4400153 applied in such cases, we now have to allocate a KV cache due to n_ctx = 8192 and it won't be used

Given that, should we revert the clamp change?

We could pre-allocate buffers for n_max_seq tokens only during initialization, and increase the size dynamically in llama_decode automatically if there is a request for more logits than that.

We could pre-allocate buffers for n_max_seq tokens only during initialization, and increase the size dynamically in llama_decode automatically if there is a request for more logits than that.

From what I understand, the logits aren't necessarily contiguous with each other in the output buffer, so yes pre-allocation and dynamic resizing could be done, but not until the way the layout of the logits is always made contiguous with no offset before the first used logits.

We could definitely improve the handling of logits. Even perplexity and imatrix only need logits for n_ctx/2 tokens. We could also skip the computation of the output layer for the tokens where logits are not needed. IIRC there was a PR about this a while ago, but it was never merged.

Maybe the std::vector<float> logits; can become a std::vector<std::pair<int, std::vector<float>> where the int is the token index in the batch (i.e. i_batch)

Yes, I think we should do something like that, and remove llama_get_logits in favor of llama_get_logits_ith. However logits is no longer a std::vector since it is allocated in a host buffer, which is necessary for pipeline parallelism, otherwise the copy from the GPU can cause a synchronization. It is also important that the logits are contiguous in memory when possible to reduce the number of copies for applications such as perplexity, there is a significant performance improvement when doing just one cudaMemcpy instead of one for each token (which ends being n_ctx/2 calls to cudaMemcpy).

llama.cpp

slaren · 2024-03-13T15:36:21Z

master with 8x GPU would be a bit slower than master with 1x GPU, since there is no parallelism at all, but there is the additional overhead of copying data between GPUs. That's also the reason why tg is slower with multiple GPUs, single token generation cannot be parallelized (with -sm layer, tensor level parallelism is supported with -sm row).

phymbert · 2024-03-13T15:46:27Z

Sorry, but I dont understand where the performance is improved compared to master then ? could you please point out to me which line of the bench is faster ?

slaren · 2024-03-13T15:50:55Z

Are you looking at the results of master or old? old refers to the previous pipeline parallelism PR, which suffered from synchronization issues and didn't always produce correct results. Performance is better than master with multiple GPUs in every pp test. For example, 7B F16 pp 2048 improved from 7113 t/s to 22931 t/s with 8 GPUs. Larger batch sizes improve performance further.

phymbert · 2024-03-13T16:04:23Z

thanks for your reply 👍, I am looking for a comparison against master yes. We do not see the figures for x8 gpu. Is the tg increased also ? how much ?

as I am working on the performance CI pipeline, I want to understand what this PR brings. Thanks

slaren · 2024-03-13T16:09:52Z

Here is a recap of the data from @ggerganov:

model	test	1x GPU	2x GPU	4x GPU	8x GPU
llama 7B F16	pp 512	8237.93	10090.00	11813.02	12100.57
llama 7B F16	pp 1024	7823.11	11267.24	15387.27	17591.46
llama 7B F16	pp 2048	6974.75	11404.60	17609.80	22931.45
llama 7B F16	pp 4096	5594.67	10493.05	17618.77	25504.36
llama 7B F16	pp 8192	4527.03	8587.92	15100.46	23820.97
llama 7B F16	tg 128	73.87	72.08	71.87	71.87
llama 7B Q8_0	pp 512	6728.34	7276.31	8604.28	9266.34
llama 7B Q8_0	pp 1024	7055.80	8212.10	11357.60	13796.83
llama 7B Q8_0	pp 2048	6674.59	8474.57	13201.46	17820.71
llama 7B Q8_0	pp 4096	5496.80	8044.92	13411.17	20105.47
llama 7B Q8_0	pp 8192	4449.37	6922.49	11882.45	18915.61
llama 7B Q8_0	tg 128	114.80	111.61	110.12	108.56
llama 7B Q4_K - Medium	pp 512	6125.22	6241.75	7983.52	7908.16
llama 7B Q4_K - Medium	pp 1024	6690.14	7085.78	12007.61	11764.20
llama 7B Q4_K - Medium	pp 2048	6501.66	7372.92	15755.47	15567.45
llama 7B Q4_K - Medium	pp 4096	5427.55	7087.04	17909.72	17668.75
llama 7B Q4_K - Medium	pp 8192	4404.22	6186.62	17152.45	16992.44
llama 7B Q4_K - Medium	tg 128	129.97	126.62	124.27	122.04
llama 7B Q4_0	pp 512	6109.62	6198.75	7908.16	7908.16
llama 7B Q4_0	pp 1024	6678.93	7045.06	11764.20	11764.20
llama 7B Q4_0	pp 2048	6496.22	7345.98	15567.45	15567.45
llama 7B Q4_0	pp 4096	5418.54	7083.79	17909.72	17668.75
llama 7B Q4_0	pp 8192	4408.39	6192.83	17152.45	16992.44
llama 7B Q4_0	tg 128	145.15	140.12	137.21	135.03

With master the performance would be roughly the same as with 1x GPU in all the cases. tg is not improved, this optimization works by splitting large batches into multiple smaller batches and processing them in parallel.

slaren · 2024-03-13T16:20:13Z

I noticed the thread sanitizer tests failing, but the errors don't make much sense to me. The errors are intermittent, running the jobs again eventually the test passes. I suspect that it is an issue with a particular runner.

It produces errors such as this:
FATAL: ThreadSanitizer: unexpected memory mapping 0x637684c4f000-0x637684c5b000

llama.cpp

ggerganov · 2024-03-13T17:45:29Z

@phymbert I've added multi-GPU results for master into the comment - hope this makes it more clear. With this PR we now have efficient pipeline parallelization which is important for increasing the throughput in multi-GPU systems

@slaren Yes, the thread sanitizer build failures are not related to our code (#5943 (comment))

* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

siavashmohammady66 · 2024-03-17T05:24:47Z

Does it also support multiple GPU , when we are using Vulken SDK for compiling the llama.cpp code? (For the AMD GPU)

* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

llama : add pipeline parallelism support for batch processing with mu…

822121f

…ltiple CUDA GPUs ggml-ci

server : add -ub, --ubatch-size parameter

1ac668e

ggerganov added the high priority Very important issue label Mar 12, 2024

fix server embedding test

4ddccc2

compilade previously requested changes Mar 12, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

compilade and others added 3 commits March 12, 2024 14:59

llama : fix Mamba inference for pipeline parallelism

937966d

Tested to work correctly with both `main` and `parallel` examples.

llama : limit max batch size to n_batch

00a415d

add LLAMA_SCHED_MAX_COPIES to configure the number of input copies fo…

89bfa1f

…r pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage

slaren force-pushed the sl/pipeline-parallelism branch from 65bbb1a to 89bfa1f Compare March 12, 2024 20:14

slaren added 4 commits March 12, 2024 21:41

fix hip build

aa1e2f8

Merge remote-tracking branch 'origin/master' into sl/pipeline-paralle…

deb3e24

…lism

fix sycl build (disable cpy_tensor_async)

ead5c8b

fix hip build

255c1ec

compilade reviewed Mar 13, 2024

View reviewed changes

llama : limit n_batch and n_ubatch to n_ctx during context creation

4400153

zsogitbe mentioned this pull request Mar 13, 2024

Thread Safety in llama.cpp SciSharp/LLamaSharp#596

Open

slaren and others added 3 commits March 13, 2024 12:18

llama : fix norm backend

9e7cecc

batched-bench : sync after decode

b25a0f1

swiftui : sync after decode

529e749

ggerganov reviewed Mar 13, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

slaren added 2 commits March 13, 2024 13:37

ggml : allow ggml_get_rows to use multiple threads if they are available

54cdd47

check n_ubatch >= n_tokens with non-casual attention

cda49d3

slaren force-pushed the sl/pipeline-parallelism branch from 202adca to cda49d3 Compare March 13, 2024 13:06

slaren commented Mar 13, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

slaren added 5 commits March 13, 2024 18:22

ggml_backend_cpu_graph_compute : fix return value when alloc fails

3c38789

llama : better n_batch and n_ubatch comment

9092883

fix merge

cb580a6

small fix

1f56481

reduce default n_batch to 2048

976176d

slaren merged commit f30ea47 into master Mar 13, 2024
52 of 62 checks passed

slaren deleted the sl/pipeline-parallelism branch March 13, 2024 17:57

zsogitbe mentioned this pull request Mar 14, 2024

ggml : become thread-safe #3960

Open

hugoabonizio mentioned this pull request Mar 14, 2024

Quantized models on multi-GPU huggingface/candle#1813

Open

LostRuins mentioned this pull request Mar 15, 2024

Regression in prompt processing speed using a batch size of 1024 #6075

Closed

compilade mentioned this pull request Mar 17, 2024

llama : greatly reduce output buffer memory usage #6122

Merged

10 tasks

hiro-v mentioned this pull request Mar 22, 2024

bug: nitro cuda windows low performance on machine has multiple GPUs - tested using Jan App janhq/cortex#269

Open

zsogitbe mentioned this pull request Mar 27, 2024

Information on new important updates in llama.cpp SciSharp/LLamaSharp#627

Open

compilade mentioned this pull request Mar 28, 2024

Why is llama_synchronize called? #6366

Closed

XiongjieDai mentioned this pull request Apr 2, 2024

confusion in naming of NVIDIA RTX 6000 ADA XiongjieDai/GPU-Benchmarks-on-LLM-Inference#5

Closed

sepcnt mentioned this pull request Apr 3, 2024

Segmentation fault (core dumped) - 0.2.58 abetlen/llama-cpp-python#1319

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : add pipeline parallelism support #6017

llama : add pipeline parallelism support #6017

slaren commented Mar 12, 2024 •

edited

phymbert commented Mar 12, 2024

ggerganov commented Mar 12, 2024 •

edited

slaren commented Mar 12, 2024 •

edited

compilade left a comment

slaren commented Mar 12, 2024

Engininja2 commented Mar 13, 2024

slaren commented Mar 13, 2024

compilade left a comment

compilade Mar 13, 2024 •

edited

slaren Mar 13, 2024 •

edited

compilade Mar 13, 2024 •

edited

slaren Mar 13, 2024 •

edited

ggerganov Mar 13, 2024

slaren Mar 13, 2024

compilade Mar 13, 2024

slaren Mar 13, 2024 •

edited

ggerganov Mar 13, 2024

slaren Mar 13, 2024 •

edited

slaren commented Mar 13, 2024 •

edited

phymbert commented Mar 13, 2024

slaren commented Mar 13, 2024 •

edited

phymbert commented Mar 13, 2024

slaren commented Mar 13, 2024

slaren commented Mar 13, 2024 •

edited

ggerganov commented Mar 13, 2024 •

edited

siavashmohammady66 commented Mar 17, 2024

llama : add pipeline parallelism support #6017

llama : add pipeline parallelism support #6017

Conversation

slaren commented Mar 12, 2024 • edited

phymbert commented Mar 12, 2024

ggerganov commented Mar 12, 2024 • edited

master (1x GPU):

master (2x GPUs):

master (4x GPUs):

master (8x GPUs):

old (sl/micro-batching, 8x GPUs) with ub=256:

new (x1 GPU, LLAMA_SCHED_MAX_COPIES=2):

new (x2 GPUs):

new (x4 GPUs):

new (x8 GPUs):

ppl, -c 512 -b 2048, -ub 256 - runtime: 30.7s

master (1x GPU, 13B):

new (x8 GPUs, 13B):

ppl (13B), -c 512 -b 2048, -ub 256 - runtime: 39.3s

slaren commented Mar 12, 2024 • edited

compilade left a comment

Choose a reason for hiding this comment

slaren commented Mar 12, 2024

Engininja2 commented Mar 13, 2024

slaren commented Mar 13, 2024

compilade left a comment

Choose a reason for hiding this comment

compilade Mar 13, 2024 • edited

Choose a reason for hiding this comment

slaren Mar 13, 2024 • edited

Choose a reason for hiding this comment

compilade Mar 13, 2024 • edited

Choose a reason for hiding this comment

slaren Mar 13, 2024 • edited

Choose a reason for hiding this comment

ggerganov Mar 13, 2024

Choose a reason for hiding this comment

slaren Mar 13, 2024

Choose a reason for hiding this comment

compilade Mar 13, 2024

Choose a reason for hiding this comment

slaren Mar 13, 2024 • edited

Choose a reason for hiding this comment

ggerganov Mar 13, 2024

Choose a reason for hiding this comment

slaren Mar 13, 2024 • edited

Choose a reason for hiding this comment

slaren commented Mar 13, 2024 • edited

phymbert commented Mar 13, 2024

slaren commented Mar 13, 2024 • edited

phymbert commented Mar 13, 2024

slaren commented Mar 13, 2024

slaren commented Mar 13, 2024 • edited

ggerganov commented Mar 13, 2024 • edited

siavashmohammady66 commented Mar 17, 2024

slaren commented Mar 12, 2024 •

edited

ggerganov commented Mar 12, 2024 •

edited

new (x1 GPU, `LLAMA_SCHED_MAX_COPIES=2`):

ppl, `-c 512 -b 2048, -ub 256` - runtime: `30.7s`

ppl (13B), `-c 512 -b 2048, -ub 256` - runtime: `39.3s`

slaren commented Mar 12, 2024 •

edited

compilade Mar 13, 2024 •

edited

slaren Mar 13, 2024 •

edited

compilade Mar 13, 2024 •

edited

slaren Mar 13, 2024 •

edited

slaren Mar 13, 2024 •

edited

slaren Mar 13, 2024 •

edited

slaren commented Mar 13, 2024 •

edited

slaren commented Mar 13, 2024 •

edited

slaren commented Mar 13, 2024 •

edited

ggerganov commented Mar 13, 2024 •

edited