backend : offload large batches to GPU #6083

slaren · 2024-03-15T13:44:21Z

Moves the logic of auto-offloading to the GPU when processing large batches to ggml_backend_sched. Currently only CUDA and Vulkan support this, this will allow any backend to support this feature.

Instead of offloading only the matrix multiplications, the entire computation of the batch is offloaded. This reduces the amount of data that needs to be transferred between the GPU and CPU and improves performance significantly.

The weights are now copied to VRAM in the compute buffer, instead of the private CUDA pool buffer. As a result, the size of the compute buffers will increase significantly when offloading a model partially. However, the total VRAM usage should stay same, or slightly lower.

Backends that wish to support this feature need to implement the offload_op function. Only the CUDA backend implements it at this point.

Additionally, the CUDA backend will now attempt to register as a host pinned buffer the memory of the models, even when using mmap. Previously, host buffers were only supported with mmap disabled. This further increases the performance of automatic offloading. The usage of host pinned memory can be disabled by defining the GGML_CUDA_NO_PINNED environment variable.

RTX 3090 Ti, CUDA under WSL:

Raw data

model	size	params	backend	ngl	n_batch	n_ubatch	mmap	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	0	1024	1024	0	pp 1024	388.15 ± 1.24
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	0	1024	1024	1	pp 1024	348.58 ± 2.46
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	1	1024	1024	0	pp 1024	397.95 ± 1.48
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	1	1024	1024	1	pp 1024	359.64 ± 1.94
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	2	1024	1024	0	pp 1024	409.85 ± 2.36
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	2	1024	1024	1	pp 1024	370.63 ± 3.55
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	3	1024	1024	0	pp 1024	422.51 ± 2.51
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	3	1024	1024	1	pp 1024	380.48 ± 1.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	4	1024	1024	0	pp 1024	433.78 ± 1.37
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	4	1024	1024	1	pp 1024	392.48 ± 2.48
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	5	1024	1024	0	pp 1024	447.87 ± 1.24
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	5	1024	1024	1	pp 1024	404.52 ± 2.98
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	6	1024	1024	0	pp 1024	463.28 ± 1.98
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	6	1024	1024	1	pp 1024	418.75 ± 2.95
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	7	1024	1024	0	pp 1024	478.75 ± 1.40
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	7	1024	1024	1	pp 1024	430.76 ± 2.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	8	1024	1024	0	pp 1024	495.91 ± 1.76
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	8	1024	1024	1	pp 1024	447.96 ± 3.44
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	9	1024	1024	0	pp 1024	515.97 ± 1.26
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	9	1024	1024	1	pp 1024	469.55 ± 2.47
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	10	1024	1024	0	pp 1024	535.50 ± 2.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	10	1024	1024	1	pp 1024	485.98 ± 3.47
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	11	1024	1024	0	pp 1024	555.22 ± 3.80
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	11	1024	1024	1	pp 1024	504.74 ± 3.24
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	12	1024	1024	0	pp 1024	581.50 ± 3.47
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	12	1024	1024	1	pp 1024	529.37 ± 1.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	13	1024	1024	0	pp 1024	605.49 ± 3.89
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	13	1024	1024	1	pp 1024	550.70 ± 1.20
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	14	1024	1024	0	pp 1024	636.00 ± 3.93
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	14	1024	1024	1	pp 1024	574.79 ± 1.91
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	15	1024	1024	0	pp 1024	669.54 ± 2.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	15	1024	1024	1	pp 1024	611.74 ± 3.48
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	16	1024	1024	0	pp 1024	696.63 ± 5.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	16	1024	1024	1	pp 1024	638.12 ± 2.97
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	17	1024	1024	0	pp 1024	739.56 ± 3.48
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	17	1024	1024	1	pp 1024	678.63 ± 3.30
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	18	1024	1024	0	pp 1024	784.66 ± 2.44
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	18	1024	1024	1	pp 1024	713.97 ± 2.73
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	19	1024	1024	0	pp 1024	828.81 ± 3.28
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	19	1024	1024	1	pp 1024	759.73 ± 3.57
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	20	1024	1024	0	pp 1024	884.96 ± 4.69
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	20	1024	1024	1	pp 1024	806.80 ± 6.71
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	21	1024	1024	0	pp 1024	948.70 ± 5.85
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	21	1024	1024	1	pp 1024	860.19 ± 5.59
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	22	1024	1024	0	pp 1024	1019.88 ± 3.62
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	22	1024	1024	1	pp 1024	933.79 ± 4.86
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	23	1024	1024	0	pp 1024	1101.54 ± 4.91
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	23	1024	1024	1	pp 1024	1007.86 ± 4.38
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	24	1024	1024	0	pp 1024	1194.18 ± 3.30
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	24	1024	1024	1	pp 1024	1095.93 ± 15.66
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	25	1024	1024	0	pp 1024	1311.94 ± 8.63
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	25	1024	1024	1	pp 1024	1207.60 ± 10.30
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	26	1024	1024	0	pp 1024	1442.92 ± 14.07
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	26	1024	1024	1	pp 1024	1346.63 ± 15.93
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	27	1024	1024	0	pp 1024	1615.53 ± 15.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	27	1024	1024	1	pp 1024	1490.20 ± 9.59
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	28	1024	1024	0	pp 1024	1818.64 ± 30.18
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	28	1024	1024	1	pp 1024	1710.29 ± 17.53
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	29	1024	1024	0	pp 1024	2144.10 ± 21.59
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	29	1024	1024	1	pp 1024	1993.06 ± 27.68
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	30	1024	1024	0	pp 1024	2546.11 ± 19.09
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	30	1024	1024	1	pp 1024	2371.35 ± 35.65
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	31	1024	1024	0	pp 1024	2885.51 ± 115.14
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	31	1024	1024	1	pp 1024	2863.88 ± 132.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	32	1024	1024	0	pp 1024	3732.88 ± 206.19
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	32	1024	1024	1	pp 1024	3694.50 ± 119.51
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	33	1024	1024	0	pp 1024	4685.48 ± 5.83
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	33	1024	1024	1	pp 1024	4653.46 ± 45.50

build: 4755afd (2431)

model	size	params	backend	ngl	n_batch	n_ubatch	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	0	1024	1024	pp 1024	1178.01 ± 52.80
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	1	1024	1024	pp 1024	1221.25 ± 20.50
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	2	1024	1024	pp 1024	1251.01 ± 30.48
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	3	1024	1024	pp 1024	1294.10 ± 15.29
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	4	1024	1024	pp 1024	1299.26 ± 36.69
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	5	1024	1024	pp 1024	1313.64 ± 53.28
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	6	1024	1024	pp 1024	1371.72 ± 48.12
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	7	1024	1024	pp 1024	1404.57 ± 38.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	8	1024	1024	pp 1024	1467.46 ± 42.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	9	1024	1024	pp 1024	1512.92 ± 44.17
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	10	1024	1024	pp 1024	1561.79 ± 32.51
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	11	1024	1024	pp 1024	1546.95 ± 33.21
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	12	1024	1024	pp 1024	1638.92 ± 38.17
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	13	1024	1024	pp 1024	1689.80 ± 66.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	14	1024	1024	pp 1024	1770.98 ± 30.59
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	15	1024	1024	pp 1024	1721.52 ± 79.84
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	16	1024	1024	pp 1024	1806.18 ± 95.38
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	17	1024	1024	pp 1024	1924.98 ± 55.63
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	18	1024	1024	pp 1024	1969.87 ± 81.24
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	19	1024	1024	pp 1024	2023.63 ± 63.53
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	20	1024	1024	pp 1024	2105.42 ± 160.57
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	21	1024	1024	pp 1024	2224.15 ± 130.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	22	1024	1024	pp 1024	2274.62 ± 54.49
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	23	1024	1024	pp 1024	2402.49 ± 98.09
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	24	1024	1024	pp 1024	2598.08 ± 99.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	25	1024	1024	pp 1024	2758.21 ± 67.71
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	26	1024	1024	pp 1024	2788.94 ± 168.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	27	1024	1024	pp 1024	3061.96 ± 81.72
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	28	1024	1024	pp 1024	3219.39 ± 97.09
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	29	1024	1024	pp 1024	3455.13 ± 77.40
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	30	1024	1024	pp 1024	3603.32 ± 77.86
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	31	1024	1024	pp 1024	3886.03 ± 106.86
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	32	1024	1024	pp 1024	4449.24 ± 4.91
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	33	1024	1024	pp 1024	4622.98 ± 8.33

build: 7664a45b (2441)

model	size	params	backend	ngl	n_batch	n_ubatch	mmap	test	t/s
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	0	1024	1024	0	pp 1024	134.99 ± 0.17
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	0	1024	1024	1	pp 1024	94.98 ± 0.96
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	1	1024	1024	0	pp 1024	137.99 ± 0.18
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	1	1024	1024	1	pp 1024	94.75 ± 6.26
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	2	1024	1024	0	pp 1024	140.96 ± 0.19
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	2	1024	1024	1	pp 1024	99.24 ± 0.99
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	3	1024	1024	0	pp 1024	144.41 ± 0.32
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	3	1024	1024	1	pp 1024	101.79 ± 0.87
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	4	1024	1024	0	pp 1024	147.93 ± 0.42
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	4	1024	1024	1	pp 1024	104.54 ± 1.73
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	5	1024	1024	0	pp 1024	151.50 ± 0.25
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	5	1024	1024	1	pp 1024	108.43 ± 0.68
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	6	1024	1024	0	pp 1024	155.25 ± 0.41
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	6	1024	1024	1	pp 1024	111.04 ± 0.83
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	7	1024	1024	0	pp 1024	159.65 ± 0.39
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	7	1024	1024	1	pp 1024	114.14 ± 1.36
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	8	1024	1024	0	pp 1024	164.27 ± 0.42
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	8	1024	1024	1	pp 1024	118.19 ± 0.47
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	9	1024	1024	0	pp 1024	168.53 ± 0.30
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	9	1024	1024	1	pp 1024	121.97 ± 0.78
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	10	1024	1024	0	pp 1024	173.33 ± 0.66
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	10	1024	1024	1	pp 1024	126.47 ± 0.79
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	11	1024	1024	0	pp 1024	178.72 ± 0.26
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	11	1024	1024	1	pp 1024	132.09 ± 1.04
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	12	1024	1024	0	pp 1024	184.57 ± 0.45
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	12	1024	1024	1	pp 1024	135.79 ± 1.24
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	13	1024	1024	0	pp 1024	190.21 ± 0.58
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	13	1024	1024	1	pp 1024	141.10 ± 1.34
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	14	1024	1024	0	pp 1024	196.32 ± 0.35
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	14	1024	1024	1	pp 1024	147.36 ± 1.18
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	15	1024	1024	0	pp 1024	203.56 ± 0.48
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	15	1024	1024	1	pp 1024	152.44 ± 0.84
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	16	1024	1024	0	pp 1024	209.75 ± 0.60
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	16	1024	1024	1	pp 1024	157.82 ± 1.16
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	17	1024	1024	0	pp 1024	217.25 ± 0.71
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	17	1024	1024	1	pp 1024	165.75 ± 0.76
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	18	1024	1024	0	pp 1024	225.32 ± 0.77
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	18	1024	1024	1	pp 1024	171.47 ± 1.00
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	19	1024	1024	0	pp 1024	233.52 ± 0.36
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	19	1024	1024	1	pp 1024	179.67 ± 1.13
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	20	1024	1024	0	pp 1024	243.01 ± 0.55
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	20	1024	1024	1	pp 1024	189.02 ± 1.60
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	21	1024	1024	0	pp 1024	253.07 ± 0.47
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	21	1024	1024	1	pp 1024	198.75 ± 1.02
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	22	1024	1024	0	pp 1024	263.99 ± 0.49
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	22	1024	1024	1	pp 1024	210.41 ± 0.96
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	23	1024	1024	0	pp 1024	276.09 ± 0.38
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	23	1024	1024	1	pp 1024	221.90 ± 0.81
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	24	1024	1024	0	pp 1024	288.64 ± 0.34
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	24	1024	1024	1	pp 1024	234.89 ± 0.71
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	25	1024	1024	0	pp 1024	303.30 ± 0.45
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	25	1024	1024	1	pp 1024	251.23 ± 0.95
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	26	1024	1024	0	pp 1024	318.69 ± 0.62
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	26	1024	1024	1	pp 1024	267.34 ± 1.26
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	27	1024	1024	0	pp 1024	336.82 ± 1.00
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	27	1024	1024	1	pp 1024	290.10 ± 0.65
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	28	1024	1024	0	pp 1024	357.83 ± 0.52
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	28	1024	1024	1	pp 1024	313.36 ± 1.26
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	29	1024	1024	0	pp 1024	379.97 ± 0.58
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	29	1024	1024	1	pp 1024	342.14 ± 1.99
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	30	1024	1024	0	pp 1024	405.32 ± 0.72
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	30	1024	1024	1	pp 1024	375.44 ± 2.22
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	31	1024	1024	0	pp 1024	435.00 ± 1.21
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	31	1024	1024	1	pp 1024	416.74 ± 1.62
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	32	1024	1024	0	pp 1024	468.47 ± 1.59
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	32	1024	1024	1	pp 1024	466.26 ± 1.79
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	33	1024	1024	0	pp 1024	475.30 ± 0.69
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	33	1024	1024	1	pp 1024	476.06 ± 0.96

build: 46acb36 (2437)

model	size	params	backend	ngl	n_batch	n_ubatch	test	t/s
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	0	1024	1024	pp 1024	241.84 ± 2.76
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	1	1024	1024	pp 1024	235.62 ± 4.85
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	2	1024	1024	pp 1024	247.34 ± 3.94
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	3	1024	1024	pp 1024	249.00 ± 2.90
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	4	1024	1024	pp 1024	256.21 ± 2.90
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	5	1024	1024	pp 1024	256.75 ± 5.63
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	6	1024	1024	pp 1024	256.88 ± 5.62
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	7	1024	1024	pp 1024	258.29 ± 7.07
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	8	1024	1024	pp 1024	267.07 ± 1.51
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	9	1024	1024	pp 1024	265.86 ± 4.53
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	10	1024	1024	pp 1024	275.18 ± 0.92
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	11	1024	1024	pp 1024	278.09 ± 1.90
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	12	1024	1024	pp 1024	282.25 ± 8.06
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	13	1024	1024	pp 1024	293.48 ± 8.02
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	14	1024	1024	pp 1024	295.53 ± 3.59
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	15	1024	1024	pp 1024	312.34 ± 5.02
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	16	1024	1024	pp 1024	316.29 ± 6.70
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	17	1024	1024	pp 1024	319.90 ± 10.61
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	18	1024	1024	pp 1024	326.59 ± 4.95
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	19	1024	1024	pp 1024	332.98 ± 4.75
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	20	1024	1024	pp 1024	344.74 ± 8.89
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	21	1024	1024	pp 1024	352.98 ± 3.65
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	22	1024	1024	pp 1024	357.80 ± 5.44
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	23	1024	1024	pp 1024	368.76 ± 5.73
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	24	1024	1024	pp 1024	374.94 ± 3.11
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	25	1024	1024	pp 1024	388.92 ± 6.49
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	26	1024	1024	pp 1024	401.82 ± 4.66
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	27	1024	1024	pp 1024	408.44 ± 6.74
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	28	1024	1024	pp 1024	422.58 ± 4.10
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	29	1024	1024	pp 1024	435.36 ± 2.60
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	30	1024	1024	pp 1024	435.46 ± 8.18
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	31	1024	1024	pp 1024	461.35 ± 1.81
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	32	1024	1024	pp 1024	476.20 ± 1.11
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	33	1024	1024	pp 1024	478.29 ± 0.65

build: 7664a45b (2441)

70B Q4_0

GPU layers	Model	Test	t/s master	t/s sl/sched-auto-offload	Speedup
0	llama 70B Q4_0	pp512	47.42	75.89	1.60
0	llama 70B Q4_0	pp1024	58.25	133.77	2.30
10	llama 70B Q4_0	pp512	53.49	86.37	1.61
10	llama 70B Q4_0	pp1024	65.27	154.15	2.36
20	llama 70B Q4_0	pp512	58.38	95.88	1.64
20	llama 70B Q4_0	pp1024	73.33	167.48	2.28
30	llama 70B Q4_0	pp512	70.39	148.83	2.11
30	llama 70B Q4_0	pp1024	84.76	240.11	2.83
40	llama 70B Q4_0	pp512	85.17	178.25	2.09
40	llama 70B Q4_0	pp1024	102.42	280.74	2.74

slaren · 2024-03-15T13:53:33Z

This is the first step to allow the CUDA backend to free its resources when its ggml-backend objects are deleted. Currently, the CUDA backend allocates many resources as globals to support this feature.

Artefact2 · 2024-03-15T14:33:11Z

diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 9e92acc0..13640f98 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -82,6 +82,10 @@
 #define cudaGetDeviceProperties hipGetDeviceProperties
 #define cudaGetErrorString hipGetErrorString
 #define cudaGetLastError hipGetLastError
+#define cudaHostRegister hipHostRegister
+#define cudaHostRegisterPortable hipHostRegisterPortable
+#define cudaHostRegisterReadOnly hipHostRegisterReadOnly
+#define cudaHostUnregister hipHostUnregister
 #define cudaLaunchHostFunc hipLaunchHostFunc
 #ifdef GGML_HIP_UMA
 #define cudaMalloc hipMallocManaged

model	size	params	backend	test	t/s
llama 13B Q4_0	6.88 GiB	13.02 B	ROCm pr	pp 512	300.14 ± 0.29
llama 13B Q4_0	6.88 GiB	13.02 B	ROCm master	pp 512	187.59 ± 0.21
llama 7B Q4_K - Small	24.91 GiB	46.70 B	ROCm pr	pp 512	114.20 ± 0.32
llama 7B Q4_K - Small	24.91 GiB	46.70 B	ROCm master	pp 512	59.93 ± 0.27

More benches here

Dampfinchen · 2024-03-15T16:46:58Z

Wow, I'm speechless. This is beyond incredible and a HUGE leap forward!

llama_print_timings:        load time =   25189,16 ms
llama_print_timings:      sample time =      72,49 ms /   180 runs   (    0,40 ms per token,  2483,10 tokens per second)
llama_print_timings: prompt eval time =   31513,38 ms /  3602 tokens (    8,75 ms per token,   114,30 tokens per second)
llama_print_timings:        eval time =   41897,48 ms /   179 runs   (  234,06 ms per token,     4,27 tokens per second)
llama_print_timings:       total time =   73539,82 ms /  3781 tokens

Speed before this PR:

llama_print_timings:        load time =    2482,92 ms
llama_print_timings:      sample time =      69,55 ms /   180 runs   (    0,39 ms per token,  2587,99 tokens per second)
llama_print_timings: prompt eval time =   51669,64 ms /  3602 tokens (   14,34 ms per token,    69,71 tokens per second)
llama_print_timings:        eval time =   42287,08 ms /   179 runs   (  236,24 ms per token,     4,23 tokens per second)
llama_print_timings:       total time =   94085,31 ms /  3781 tokens

That's indeed double the prompt processing speed! (5 layers offloaded with an RTX 2060 laptop and Mixtral.)

Thank you so much Slaren!!

USBhost · 2024-03-15T16:49:44Z

On my A6000 (using stock settings) there's a .31 tokens per second eval time regression for a 70b model. This .31 tps is consistent on just about every run.

./main -ngl 99 -m /mnt/40TB/AI/MiquMaid-v2-70B-DPO/ggml-model-Q4_K_M.gguf -p "Write a long story on why the sky is red."

Current HEAD 4e9a7f7f7fb6acbddd1462909c8d696e38edbfcc
llama_print_timings:        load time =   14921.14 ms
llama_print_timings:      sample time =     405.60 ms /   723 runs   (    0.56 ms per token,  1782.53 tokens per second)
llama_print_timings: prompt eval time =     286.73 ms /    12 tokens (   23.89 ms per token,    41.85 tokens per second)
llama_print_timings:        eval time =   50749.65 ms /   722 runs   (   70.29 ms per token,    14.23 tokens per second)
llama_print_timings:       total time =   51652.08 ms /   734 tokens

llama_print_timings:        load time =   14844.16 ms
llama_print_timings:      sample time =     678.35 ms /  1187 runs   (    0.57 ms per token,  1749.83 tokens per second)
llama_print_timings: prompt eval time =     287.25 ms /    12 tokens (   23.94 ms per token,    41.78 tokens per second)
llama_print_timings:        eval time =   83862.37 ms /  1186 runs   (   70.71 ms per token,    14.14 tokens per second)
llama_print_timings:       total time =   85181.27 ms /  1198 tokens

llama_print_timings:        load time =   14820.03 ms
llama_print_timings:      sample time =     671.43 ms /  1194 runs   (    0.56 ms per token,  1778.30 tokens per second)
llama_print_timings: prompt eval time =     287.75 ms /    12 tokens (   23.98 ms per token,    41.70 tokens per second)
llama_print_timings:        eval time =   84489.73 ms /  1193 runs   (   70.82 ms per token,    14.12 tokens per second)
llama_print_timings:       total time =   85801.20 ms /  1205 tokens

PR
llama_print_timings:        load time =   16542.88 ms
llama_print_timings:      sample time =     578.93 ms /  1032 runs   (    0.56 ms per token,  1782.61 tokens per second)
llama_print_timings: prompt eval time =     287.61 ms /    12 tokens (   23.97 ms per token,    41.72 tokens per second)
llama_print_timings:        eval time =   74705.90 ms /  1031 runs   (   72.46 ms per token,    13.80 tokens per second)
llama_print_timings:       total time =   75894.40 ms /  1043 tokens

llama_print_timings:        load time =   16675.73 ms
llama_print_timings:      sample time =     476.32 ms /   831 runs   (    0.57 ms per token,  1744.63 tokens per second)
llama_print_timings: prompt eval time =     289.29 ms /    12 tokens (   24.11 ms per token,    41.48 tokens per second)
llama_print_timings:        eval time =   59736.20 ms /   830 runs   (   71.97 ms per token,    13.89 tokens per second)
llama_print_timings:       total time =   60767.06 ms /   842 tokens

llama_print_timings:        load time =   16597.13 ms
llama_print_timings:      sample time =     392.05 ms /   692 runs   (    0.57 ms per token,  1765.06 tokens per second)
llama_print_timings: prompt eval time =     292.76 ms /    12 tokens (   24.40 ms per token,    40.99 tokens per second)
llama_print_timings:        eval time =   49830.24 ms /   691 runs   (   72.11 ms per token,    13.87 tokens per second)
llama_print_timings:       total time =   50735.73 ms /   703 tokens

A6000 + A4000. Again there's a regression around .30 tps also load time is longer on this PR.
taskset -ac 0 ./main -ngl 99 -m /mnt/40TB/AI/MiquMaid-v2-70B-DPO/ggml-model-Q4_K_M.gguf -p "Write a long story on the reason why the sky is green but make it spicy."

Current HEAD
llama_print_timings:        load time =   15157.09 ms
llama_print_timings:      sample time =     341.39 ms /   587 runs   (    0.58 ms per token,  1719.45 tokens per second)
llama_print_timings: prompt eval time =     554.58 ms /    19 tokens (   29.19 ms per token,    34.26 tokens per second)
llama_print_timings:        eval time =   47648.32 ms /   586 runs   (   81.31 ms per token,    12.30 tokens per second)
llama_print_timings:       total time =   48739.75 ms /   605 tokens

PR
llama_print_timings:        load time =   16780.64 ms
llama_print_timings:      sample time =     477.61 ms /   827 runs   (    0.58 ms per token,  1731.53 tokens per second)
llama_print_timings: prompt eval time =     558.27 ms /    19 tokens (   29.38 ms per token,    34.03 tokens per second)
llama_print_timings:        eval time =   68706.55 ms /   826 runs   (   83.18 ms per token,    12.02 tokens per second)
llama_print_timings:       total time =   70025.41 ms /   845 tokens

slaren · 2024-03-15T18:06:20Z

@USBhost should be fixed now.

Interestingly, this was caused by an increase to GGML_SCHED_MAX_SPLITS. Increasing this constant also used to increase the size of a hash table, which needs to be cleaned on every evaluation. That was enough to increase the overhead enough to be measurable.

tbocek · 2024-03-15T18:19:50Z

I just tried this PR. I'm not sure what fixed it, but I don't get this error reported here (#5701) with benchmark-matmult. It now completes with ROCm/7900XTX. With master I see the same abort error, with this PR it works fine.

main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code
n_threads=1
            m11: type = 0 (  f32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is 45088768.00
             m2: type = 0 (  f32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is 2818048.00
   gf->nodes[0]: type = 0 (  f32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            202256;     57.07
        1;       1; 11008;  4096;   128;    11542724608;            202198;     57.09
        2;       1; 11008;  4096;   128;    11542724608;            201140;     57.39
        3;       1; 11008;  4096;   128;    11542724608;            200890;     57.46
        4;       1; 11008;  4096;   128;    11542724608;            201915;     57.17
        5;       1; 11008;  4096;   128;    11542724608;            202146;     57.10
        6;       1; 11008;  4096;   128;    11542724608;            201838;     57.19
        7;       1; 11008;  4096;   128;    11542724608;            202511;     57.00
        8;       1; 11008;  4096;   128;    11542724608;            202692;     56.95
        9;       1; 11008;  4096;   128;    11542724608;            202369;     57.04

Average                                                                         57.14
=====================================================================================

slaren · 2024-03-15T18:23:15Z

@tbocek unfortunately that has not really been fixed. benchmark-matmul depends on the ability of the CUDA/HIP backend to offload large matrix multiplications automatically, but that is no longer done, it requires using ggml-backend with ggml_backend_sched. So what you are measuring there is just the CPU performance.

USBhost · 2024-03-15T18:39:45Z

@USBhost should be fixed now.

Interestingly, this was caused by an increase to GGML_SCHED_MAX_SPLITS. Increasing this constant also used to increase the size of a hash table, which needs to be cleaned on every evaluation. That was enough to increase the overhead enough to be measurable.

Yeah that fixed it Thanks and it feels just a tad faster than master. But that load time still looking sus...
A6000 only.

PR
llama_print_timings:        load time =   16623.04 ms
llama_print_timings:      sample time =     236.75 ms /   419 runs   (    0.57 ms per token,  1769.81 tokens per second)
llama_print_timings: prompt eval time =     456.24 ms /    19 tokens (   24.01 ms per token,    41.64 tokens per second)
llama_print_timings:        eval time =   29054.86 ms /   418 runs   (   69.51 ms per token,    14.39 tokens per second)
llama_print_timings:       total time =   29870.07 ms /   437 tokens

llama_print_timings:        load time =   16588.26 ms
llama_print_timings:      sample time =     470.08 ms /   842 runs   (    0.56 ms per token,  1791.17 tokens per second)
llama_print_timings: prompt eval time =     455.79 ms /    19 tokens (   23.99 ms per token,    41.69 tokens per second)
llama_print_timings:        eval time =   58816.35 ms /   841 runs   (   69.94 ms per token,    14.30 tokens per second)
llama_print_timings:       total time =   59983.48 ms /   860 tokens

llama_print_timings:        load time =   16525.14 ms
llama_print_timings:      sample time =     293.29 ms /   532 runs   (    0.55 ms per token,  1813.88 tokens per second)
llama_print_timings: prompt eval time =     454.93 ms /    19 tokens (   23.94 ms per token,    41.76 tokens per second)
llama_print_timings:        eval time =   36999.50 ms /   531 runs   (   69.68 ms per token,    14.35 tokens per second)
llama_print_timings:       total time =   37902.38 ms /   550 tokens

Dampfinchen · 2024-03-15T19:51:19Z

For some reason, my computer really doesn't like this PR though. After text Gen, the terminal doesn't accept any input anymore and I can't start browsers. I have restart it, which takes much longer than usual. I'm using Linux Pop OS LTS 22.04.

slaren · 2024-03-15T19:52:39Z

@Dampfinchen try setting the environment variable GGML_CUDA_NO_PINNED.

Dampfinchen · 2024-03-15T19:59:42Z

@Dampfinchen try setting the environment variable GGML_CUDA_NO_PINNED.

Yep, that fixes it! Thanks!

slaren · 2024-03-15T20:01:05Z

You can also try --no-mmap, it will cause less memory to be pinned, but it will still maintain the same performance.

fgdfgfthgr-fox · 2024-03-16T07:59:09Z

Using Radeon VII, can confirm this does offer a major speedup on prompt processing, although does seem to reduce the token generation speed by just a bit.

MaggotHATE · 2024-03-16T11:09:09Z

Tested with Vulkan, partial offload (7 layers, 7B model, Q6_K version, 478 tokens of prompt). On my low-end GPU (1060 3gb) there seems to be almost no difference:
Eval speed: 34.766766 (main) vs 34.025253 (PR)
Gen speed: 2.813251 vs 2.816204
Tokens generated: 1828 vs 1505 (just for additional context).

Looks like this PR would only help with more layers offloaded (and on better hardware) - but it works so far without problems.

slaren · 2024-03-16T12:26:02Z

Vulkan supports offloading large batches automatically, but it has its own implementation. Only the CUDA backend supports the functionality added by this PR. Other backends will need to be implement a (very simple) offload_op function to choose the operations that the backend wants to handle. This is the offload_op of the CUDA backend:

llama.cpp/ggml-cuda.cu

Lines 11391 to 11401 in dc93f5a

    
           GGML_CALL static bool ggml_backend_cuda_offload_op(ggml_backend_t backend, const ggml_tensor * op) { 
        
               const ggml_tensor * dst = op; 
        
               const int min_batch_size = 32; 
        
               if (dst->ne[1] > min_batch_size && dst->op != GGML_OP_GET_ROWS) { 
        
                   return true; 
        
               } 
        
               return false; 
        
           }

However for this is work properly, backends need to be able to execute many graphs with little overhead, since this will result in a very large number of graph splits (hundreds, at least one for each weight).

MaggotHATE · 2024-03-16T12:41:56Z

Ok, thanks for explaining - I saw

Currently, only CUDA and Vulkan support this.

and decided to test just in case.

8XXD8 · 2024-03-16T13:34:50Z

I'm not seeing any meaningful difference in prompt processing, but with -sm row I can't load a 13b_Q8 model into 3X Radeon Pro VIIs, only -sm layer works, error:

llama_model_load: error loading model: failed to allocate buffer
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/home/user/text-generation-webui/models/llama2-13b-tiefighter.Q8_0.gguf'
main: error: unable to load model

Results:

single GPU Master
llama_print_timings:        load time =    3288.75 ms
llama_print_timings:      sample time =      47.17 ms /   256 runs   (    0.18 ms per token,  5427.06 tokens per second)
llama_print_timings: prompt eval time =    1857.60 ms /   563 tokens (    3.30 ms per token,   303.08 tokens per second)
llama_print_timings:        eval time =    7460.75 ms /   255 runs   (   29.26 ms per token,    34.18 tokens per second)
llama_print_timings:       total time =    9417.34 ms /   818 tokens


single GPU PR
llama_print_timings:        load time =    4753.63 ms
llama_print_timings:      sample time =      45.71 ms /   256 runs   (    0.18 ms per token,  5600.04 tokens per second)
llama_print_timings: prompt eval time =    1863.55 ms /   563 tokens (    3.31 ms per token,   302.11 tokens per second)
llama_print_timings:        eval time =    7405.82 ms /   255 runs   (   29.04 ms per token,    34.43 tokens per second)
llama_print_timings:       total time =    9364.84 ms /   818 tokens

3X RVII Master layer split
llama_print_timings:        load time =    4752.12 ms
llama_print_timings:      sample time =      53.87 ms /   256 runs   (    0.21 ms per token,  4752.36 tokens per second)
llama_print_timings: prompt eval time =    1629.90 ms /   563 tokens (    2.90 ms per token,   345.42 tokens per second)
llama_print_timings:        eval time =    7447.81 ms /   255 runs   (   29.21 ms per token,    34.24 tokens per second)
llama_print_timings:       total time =    9193.20 ms /   818 tokens

3X RVII PR layer split
llama_print_timings:        load time =    6615.16 ms
llama_print_timings:      sample time =      59.28 ms /   256 runs   (    0.23 ms per token,  4318.20 tokens per second)
llama_print_timings: prompt eval time =    1632.83 ms /   563 tokens (    2.90 ms per token,   344.80 tokens per second)
llama_print_timings:        eval time =    7435.52 ms /   255 runs   (   29.16 ms per token,    34.29 tokens per second)
llama_print_timings:       total time =    9195.99 ms /   818 tokens

slaren · 2024-03-16T14:52:55Z

@8XXD8 this only affects prompt processing with partial offloading. Full offloading is unchanged. The issue with -sm row should be fixed now.

Artefact2 · 2024-03-16T17:10:03Z

I think this PR breaks imatrix when partially offloading, I am getting smaller imatrix files with lots of missing info for some tensors.

ggml-backend-impl.h

ggml-backend.c

ggml-cuda.cu

ggml-cuda.h

ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

slaren · 2024-03-16T18:06:17Z

I think this PR breaks imatrix when partially offloading

Yes, it does. When the weights are copied to the GPU, the name of the tensor is different (for example blk.0.attn_k.weight becomes CUDA0#blk.0.attn_k.weight#0), and imatrix fails to recognize them. Not sure what is the best way to fix that yet.

reduce max inputs per split more cleanup

slaren · 2024-03-17T16:15:40Z

I think the ggml-ci cuda-v100 runner has some issue, the logs say no CUDA-capable device is detected. It is also failing in master.

ggerganov · 2024-03-17T17:32:10Z

I think I fixed the drivers and restarted the job. Will review the PR tomorrow

ggml-ci

slaren · 2024-03-18T10:37:40Z

@0cc4m is should be possible to adapt the Vulkan backend now to use this and remove ggml_vk_free_cpu_assist and the code in ggml.c.

JohannesGaessler · 2024-03-21T00:12:49Z

Are there plans to also implement pre-loading the data for the next layer as the current one is being processed? Since prompt processing is compute bound it should theoretically be possible to achieve ~100% GPU speed even at 0 GPU layers. The tradeoff would be that VRAM usage goes up so you would be able to offload fewer layers which in turn makes generation slower.

slaren · 2024-03-21T00:44:32Z

We should implement that for sure. With a large enough batch size we could reach close to the batch performance of full offload, which could have a significant impact. It's not an immediate priority for me right now, but I will work on this eventually if nobody does it before.

JohannesGaessler · 2024-03-21T00:55:57Z

Regarding my previous comment: looking at some profiling data suggests that it won't be quite as simple:

With a Ryzen 5950X, 3200 MHz dual channel RAM, and an RTX 3090 the amount of time spent on memory transfers currently seems to be significantly larger than the amount of time spent on compute. Also there are still significant gaps where the GPU is idling and the CPU seems to be doing some work.

slaren · 2024-03-21T01:03:17Z

I don't know what batch size you are using, but with a large enough batch size, I can already see over 50% utilization with -ngl 0. The CPU work that you are seeing may be perplexity calculating the perplexity from the logits, when testing pipeline parallelism it was easy to get this to take over 50% of the total time.

JohannesGaessler · 2024-03-21T08:30:51Z

I don't know what batch size you are using, but with a large enough batch size, I can already see over 50% utilization with -ngl 0.

I was using a batch size of 512 for the perplexity binary.

The CPU work that you are seeing may be perplexity calculating the perplexity from the logits, when testing pipeline parallelism it was easy to get this to take over 50% of the total time.

No, the area that I was showing was from the middle of the calculation. Also, I am seeing the same gaps with llama-bench. Against my initial expectation I am also seeing that llama-bench pp scales with the number of threads:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	1	pp 512	791.24 ± 0.53
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	2	pp 512	883.43 ± 2.58
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	3	pp 512	929.52 ± 2.14
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	4	pp 512	946.93 ± 1.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	5	pp 512	959.62 ± 1.45
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	6	pp 512	963.22 ± 0.65
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	7	pp 512	967.96 ± 0.75
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	8	pp 512	968.98 ± 1.42
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	9	pp 512	970.06 ± 1.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	10	pp 512	968.56 ± 1.14
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	11	pp 512	965.29 ± 0.59
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	12	pp 512	960.36 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	13	pp 512	957.10 ± 1.20
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	14	pp 512	954.39 ± 0.87
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	15	pp 512	950.37 ± 0.47
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	16	pp 512	944.65 ± 0.43

gprof for ./llama-bench --model models/opt/${model_name}-${quantization}.gguf -r 100 -ngl 0 -n 0 -t 1 suggests that the culprit is some tensor duplication:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 93.56     17.28    17.28     6464     2.67     2.81  ggml_compute_forward_dup_f32
  4.82     18.17     0.89 52953088     0.00     0.00  ggml_fp32_to_fp16_row
  0.81     18.32     0.15    51712     0.00     0.00  dequantize_row_q4_0
  0.43     18.40     0.08      101     0.79   182.53  llama_decode
  0.05     18.41     0.01  1127987     0.00     0.00  ggml_blck_size
  0.05     18.42     0.01   112846     0.00     0.00  ggml_new_tensor
  0.05     18.43     0.01    42622     0.00     0.00  ggml_backend_sched_get_tensor_backend
  0.05     18.44     0.01      101     0.10     0.10  ggml_gallocr_alloc_graph
  0.05     18.45     0.01       18     0.56     0.56  std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_rehash(unsigned long, unsigned long const&)
  0.05     18.46     0.01                             ggml_cuda_mul(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*)
  0.05     18.47     0.01                             ggml_backend_cuda_set_tensor_async(ggml_backend*, ggml_tensor*, void const*, unsigned long, unsigned long)
  0.00     18.47     0.00  1690128     0.00     0.00  ggml_hash_find

The total runtime was 67.02 s so ggml_compute_forward_dup_f32 took up ~25% of the total runtime.

slaren · 2024-03-21T13:28:29Z

It's the ggml_cpy to store the new blocks in the KV cache. An observation is that it would be possible to do the conversion to F16 in the GPU, which would reduce the amount of data that needs to be copied to the CPU, and reduce the overhead of the ggml_cpy. I am surprised that you get better performance with more than 1 thread, with batch size 512 the time is probably dominated by the transfer so the overhead of launching the threads becomes less significant.

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	threads	n_ubatch	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	0	1	4096	pp 4096	1694.34 ± 7.93
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	4096	pp 4096	3653.03 ± 3.42

diff --git a/llama.cpp b/llama.cpp
index cd7a7b8d..bd0847bb 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -5428,6 +5428,10 @@ static void llm_build_kv_store(
     cb(v_cache_view, "v_cache_view", il);

     // important: storing RoPE-ed version of K in the KV cache!
+    k_cur = ggml_cast(ctx, k_cur, k_cache_view->type);
+    v_cur_t = ggml_cast(ctx, v_cur_t, v_cache_view->type);
+    ggml_build_forward_expand(graph, k_cur);
+    ggml_build_forward_expand(graph, v_cur_t);
     ggml_build_forward_expand(graph, ggml_cpy(ctx, k_cur,   k_cache_view));
     ggml_build_forward_expand(graph, ggml_cpy(ctx, v_cur_t, v_cache_view));
 }

* backend : offload large batches to GPU * fix hip * code cleanup * fix CUDA split buffers * Update ggml-backend-impl.h Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix memset without set_device * imatrix : remove sched affix from weight names * sched : add a new split if the current one has too many inputs reduce max inputs per split more cleanup * update backends ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

slaren added 4 commits March 16, 2024 14:48

backend : offload large batches to GPU

5b6b4ac

fix hip

c2dba04

code cleanup

3a77442

fix CUDA split buffers

c0fe629

slaren force-pushed the sl/sched-auto-offload branch from dc93f5a to c0fe629 Compare March 16, 2024 13:50

JohannesGaessler reviewed Mar 16, 2024

View reviewed changes

ggml-backend-impl.h Outdated Show resolved Hide resolved

ggml-backend-impl.h Show resolved Hide resolved

ggml-backend.c Show resolved Hide resolved

ggml-cuda.cu Show resolved Hide resolved

ggml-cuda.h Outdated Show resolved Hide resolved

ggml-cuda.cu Show resolved Hide resolved

Update ggml-backend-impl.h

8e717e8

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

slaren added 3 commits March 16, 2024 19:59

cuda : fix memset without set_device

9cba8a1

imatrix : remove sched affix from weight names

9809075

sched : add a new split if the current one has too many inputs

0661e6a

reduce max inputs per split more cleanup

slaren marked this pull request as ready for review March 17, 2024 13:56

slaren force-pushed the sl/sched-auto-offload branch from 1fc2eef to d4e9187 Compare March 17, 2024 13:59

update backends

cc9299c

ggml-ci

slaren force-pushed the sl/sched-auto-offload branch from d4e9187 to cc9299c Compare March 17, 2024 17:33

ggerganov approved these changes Mar 18, 2024

View reviewed changes

ggerganov added the high priority Very important issue label Mar 18, 2024

slaren merged commit 2bf8d0f into master Mar 18, 2024
63 of 69 checks passed

slaren deleted the sl/sched-auto-offload branch March 18, 2024 10:03

markdols mentioned this pull request Mar 19, 2024

main does not terminate #6149

Closed

slaren mentioned this pull request Mar 23, 2024

[SYCL] offload op #6217

Merged

ggerganov mentioned this pull request Mar 24, 2024

imatrix : fix wname for mul_mat_id ops #6271

Merged

JohannesGaessler mentioned this pull request Apr 1, 2024

Improve cpu prompt eval speed #6414

Merged

zhouwg mentioned this pull request Apr 23, 2024

PoC: Add Qualcomm mobile SoC native backend for GGML zhouwg/kantv#121

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend : offload large batches to GPU #6083

backend : offload large batches to GPU #6083

slaren commented Mar 15, 2024 •

edited

slaren commented Mar 15, 2024

Artefact2 commented Mar 15, 2024 •

edited

Dampfinchen commented Mar 15, 2024 •

edited

USBhost commented Mar 15, 2024 •

edited

slaren commented Mar 15, 2024 •

edited

tbocek commented Mar 15, 2024 •

edited

slaren commented Mar 15, 2024

USBhost commented Mar 15, 2024 •

edited

Dampfinchen commented Mar 15, 2024

slaren commented Mar 15, 2024

Dampfinchen commented Mar 15, 2024

slaren commented Mar 15, 2024

fgdfgfthgr-fox commented Mar 16, 2024

MaggotHATE commented Mar 16, 2024

slaren commented Mar 16, 2024 •

edited

MaggotHATE commented Mar 16, 2024

8XXD8 commented Mar 16, 2024

slaren commented Mar 16, 2024

Artefact2 commented Mar 16, 2024

slaren commented Mar 16, 2024

slaren commented Mar 17, 2024 •

edited

ggerganov commented Mar 17, 2024

slaren commented Mar 18, 2024 •

edited

JohannesGaessler commented Mar 21, 2024

slaren commented Mar 21, 2024

JohannesGaessler commented Mar 21, 2024

slaren commented Mar 21, 2024

JohannesGaessler commented Mar 21, 2024 •

edited

slaren commented Mar 21, 2024 •

edited

backend : offload large batches to GPU #6083

backend : offload large batches to GPU #6083

Conversation

slaren commented Mar 15, 2024 • edited

slaren commented Mar 15, 2024

Artefact2 commented Mar 15, 2024 • edited

Dampfinchen commented Mar 15, 2024 • edited

USBhost commented Mar 15, 2024 • edited

slaren commented Mar 15, 2024 • edited

tbocek commented Mar 15, 2024 • edited

slaren commented Mar 15, 2024

USBhost commented Mar 15, 2024 • edited

Dampfinchen commented Mar 15, 2024

slaren commented Mar 15, 2024

Dampfinchen commented Mar 15, 2024

slaren commented Mar 15, 2024

fgdfgfthgr-fox commented Mar 16, 2024

MaggotHATE commented Mar 16, 2024

slaren commented Mar 16, 2024 • edited

MaggotHATE commented Mar 16, 2024

8XXD8 commented Mar 16, 2024

slaren commented Mar 16, 2024

Artefact2 commented Mar 16, 2024

slaren commented Mar 16, 2024

slaren commented Mar 17, 2024 • edited

ggerganov commented Mar 17, 2024

slaren commented Mar 18, 2024 • edited

JohannesGaessler commented Mar 21, 2024

slaren commented Mar 21, 2024

JohannesGaessler commented Mar 21, 2024

slaren commented Mar 21, 2024

JohannesGaessler commented Mar 21, 2024 • edited

slaren commented Mar 21, 2024 • edited

slaren commented Mar 15, 2024 •

edited

Artefact2 commented Mar 15, 2024 •

edited

Dampfinchen commented Mar 15, 2024 •

edited

USBhost commented Mar 15, 2024 •

edited

slaren commented Mar 15, 2024 •

edited

tbocek commented Mar 15, 2024 •

edited

USBhost commented Mar 15, 2024 •

edited

slaren commented Mar 16, 2024 •

edited

slaren commented Mar 17, 2024 •

edited

slaren commented Mar 18, 2024 •

edited

JohannesGaessler commented Mar 21, 2024 •

edited

slaren commented Mar 21, 2024 •

edited