Add ability to use Q5_0, Q5_1, and IQ4_NL for quantized K cache #6183

ikawrakow · 2024-03-20T17:33:07Z

Tested to work on CUDA and Metal (CPU worked before this PR).

Table shows PPL for LLaMA-v2-7B quantized with Q4_K_S using the available quantization types for the K cache. The "K cache size" column is for a context of 4096 tokens:

K cache type	K cache size	PPL
fp16	1024 MB	5.8671
Q8_0	544 MB	5.8681
Q5_1	384 MB	5.8802
Q5_0	352 MB	5.8920
Q4_1	320 MB	5.9233
IQ4_NL	288 MB	5.9323
Q4_0	288 MB	5.9790

common/common.cpp

ggml-metal.m

…orts_op

sorasoras · 2024-03-20T18:51:39Z

What is the performance of these quants

ikawrakow · 2024-03-20T19:33:23Z

What is the performance of these quants

For LLaMA-v2-7B quantized with Q4_K_S I get

CUDA (running on RTX-4080)

model	backend	ngl	threads	type_k	test	t/s
llama 7B Q4_K - Small	CUDA	100	1	f16	tg 128	127.57 ± 0.59
llama 7B Q4_K - Small	CUDA	100	1	q8_0	tg 128	76.93 ± 0.26
llama 7B Q4_K - Small	CUDA	100	1	q5_1	tg 128	76.66 ± 0.00
llama 7B Q4_K - Small	CUDA	100	1	q5_0	tg 128	76.77 ± 0.02
llama 7B Q4_K - Small	CUDA	100	1	q4_1	tg 128	77.35 ± 0.01
llama 7B Q4_K - Small	CUDA	100	1	iq4_nl	tg 128	75.07 ± 0.36
llama 7B Q4_K - Small	CUDA	100	1	q4_0	tg 128	76.86 ± 0.00
llama 7B Q4_K - Small	CUDA	100	1	f16	pp 512	5675.11 ± 10.20
llama 7B Q4_K - Small	CUDA	100	1	q8_0	pp 512	5209.73 ± 4.39
llama 7B Q4_K - Small	CUDA	100	1	q5_1	pp 512	5203.32 ± 6.13
llama 7B Q4_K - Small	CUDA	100	1	q5_0	pp 512	5200.40 ± 4.07
llama 7B Q4_K - Small	CUDA	100	1	q4_1	pp 512	5197.36 ± 3.78
llama 7B Q4_K - Small	CUDA	100	1	iq4_nl	pp 512	4727.66 ± 0.86
llama 7B Q4_K - Small	CUDA	100	1	q4_0	pp 512	5199.24 ± 1.48

Metal (running on 30-core M2 Max GPU)

model	backend	ngl	threads	type_k	test	t/s
llama 7B Q4_K - Small	Metal	100	4	f16	tg 128	54.13 ± 0.19
llama 7B Q4_K - Small	Metal	100	4	q8_0	tg 128	53.05 ± 0.20
llama 7B Q4_K - Small	Metal	100	4	q5_1	tg 128	52.96 ± 0.10
llama 7B Q4_K - Small	Metal	100	4	q5_0	tg 128	53.24 ± 0.05
llama 7B Q4_K - Small	Metal	100	4	q4_1	tg 128	53.45 ± 0.07
llama 7B Q4_K - Small	Metal	100	4	iq4_nl	tg 128	51.53 ± 0.08
llama 7B Q4_K - Small	Metal	100	4	q4_0	tg 128	53.41 ± 0.11
llama 7B Q4_K - Small	Metal	100	4	f16	pp 512	474.85 ± 3.24
llama 7B Q4_K - Small	Metal	100	4	q8_0	pp 512	474.59 ± 3.27
llama 7B Q4_K - Small	Metal	100	4	q5_1	pp 512	473.34 ± 3.52
llama 7B Q4_K - Small	Metal	100	4	q5_0	pp 512	472.48 ± 3.57
llama 7B Q4_K - Small	Metal	100	4	q4_1	pp 512	475.29 ± 2.37
llama 7B Q4_K - Small	Metal	100	4	iq4_nl	pp 512	468.29 ± 3.38
llama 7B Q4_K - Small	Metal	100	4	q4_0	pp 512	475.45 ± 2.10

The IQ4_NL quantization used for quantizing the K cache is simpler than IQ4_NL used for weight quantization to keep the performance drop acceptable.

So, in short, performance is OK-ish on Metal, where we need K-cache quantization less, but quite horrible on CUDA, where we need it more.

ggerganov · 2024-03-21T08:07:11Z

The IQ4_NL quantization used for quantizing the K cache is simpler than IQ4_NL used for weight quantization to keep the performance drop acceptable.

This makes the test-backend-ops fail because the results of the CPY operation between the CPU and GPU differ:

  CPY(type_src=f32,type_dst=iq4_nl,ne=[256,4,4,4]): [CPY] NMSE = 0.004881965 > 0.000000100 FAIL

@ikawrakow How much is the performance drop with the original implementation?

ikawrakow · 2024-03-21T08:15:46Z

How much is the performance drop with the original implementation?

On CUDA we go from 76 t/s to 50 t/s for TG-128. Haven't measured Metal.

I guess, if we want the CPU and GPU to be the same in the use case of CPY, it would be better to make a faster CPU version that is the same as CUDA/Metal. Do you agree?

ggerganov · 2024-03-21T08:20:18Z

Yes - it would be a new quantization type in order to not break backwards compatibility, correct?

ikawrakow · 2024-03-21T08:36:44Z

No, simply change quantize_row_iq4_nl(), which is not used during quantization but is the from_float type trait for IQ4_NL, to do the same thing as CUDA/Metal.

ggerganov · 2024-03-21T09:14:57Z

Yes, good idea

Artefact2 · 2024-03-21T17:21:05Z

Q{4,5}_{0,1} seem broken on AMD. perplexity returns nan.

./perplexity --chunks 1 -f wiki.test.raw -c 4096 -ngl 999 -ctk q4_0 -m LLaMA2-13B-Q4_0.gguf
perplexity: tokenizing the input ..
perplexity: tokenization took 482.161 ms
perplexity: calculating perplexity over 1 chunks, n_ctx=4096, batch_size=2048, n_seq=1
perplexity: 11.25 seconds per pass - ETA 0.18 minutes
[1]nan,
Unexpected negative standard deviation of log(prob)

ikawrakow · 2024-03-21T17:57:23Z

Q{4,5}_{0,1} seem broken on AMD. perplexity returns nan.

Did I break it, or was it broken before I added Q5_0/1 and IQ4_NL ?

Artefact2 · 2024-03-21T18:18:57Z

q4_0 was already broken before.

…ganov#6183) * k_cache: be able to use Q5_0 * k_cache: be able to use Q5_1 on CODA * k_cache: be able to use Q5_0 on Metal * k_cache: be able to use Q5_1 on Metal * k_cache: be able to use IQ4_NL - just CUDA for now * k_cache: be able to use IQ4_NL on Metal * k_cache: add newly added supported types to llama-bench and CUDA supports_op --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Kawrakow added 6 commits March 20, 2024 19:07

k_cache: be able to use Q5_0

5e09ce4

k_cache: be able to use Q5_1 on CODA

5d8822b

k_cache: be able to use Q5_0 on Metal

fef4a23

k_cache: be able to use Q5_1 on Metal

d68030b

k_cache: be able to use IQ4_NL - just CUDA for now

9711e1e

k_cache: be able to use IQ4_NL on Metal

d8a498d

ggerganov approved these changes Mar 20, 2024

View reviewed changes

ggerganov requested a review from slaren March 20, 2024 18:20

slaren reviewed Mar 20, 2024

View reviewed changes

common/common.cpp Show resolved Hide resolved

slaren approved these changes Mar 20, 2024

View reviewed changes

slaren reviewed Mar 20, 2024

View reviewed changes

ggml-metal.m Show resolved Hide resolved

k_cache: add newly added supported types to llama-bench and CUDA supp…

9e1bda9

…orts_op

ikawrakow merged commit 76aa30a into master Mar 21, 2024
56 checks passed

ikawrakow deleted the ik/k_cache_q5 branch March 21, 2024 07:28

ikawrakow mentioned this pull request Mar 21, 2024

Try fix quantized k-cache on ROCm #6205

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to use Q5_0, Q5_1, and IQ4_NL for quantized K cache #6183

Add ability to use Q5_0, Q5_1, and IQ4_NL for quantized K cache #6183

ikawrakow commented Mar 20, 2024

sorasoras commented Mar 20, 2024

ikawrakow commented Mar 20, 2024

ggerganov commented Mar 21, 2024 •

edited

ikawrakow commented Mar 21, 2024

ggerganov commented Mar 21, 2024 •

edited

ikawrakow commented Mar 21, 2024

ggerganov commented Mar 21, 2024

Artefact2 commented Mar 21, 2024

ikawrakow commented Mar 21, 2024

Artefact2 commented Mar 21, 2024 •

edited

Add ability to use Q5_0, Q5_1, and IQ4_NL for quantized K cache #6183

Add ability to use Q5_0, Q5_1, and IQ4_NL for quantized K cache #6183

Conversation

ikawrakow commented Mar 20, 2024

sorasoras commented Mar 20, 2024

ikawrakow commented Mar 20, 2024

ggerganov commented Mar 21, 2024 • edited

ikawrakow commented Mar 21, 2024

ggerganov commented Mar 21, 2024 • edited

ikawrakow commented Mar 21, 2024

ggerganov commented Mar 21, 2024

Artefact2 commented Mar 21, 2024

ikawrakow commented Mar 21, 2024

Artefact2 commented Mar 21, 2024 • edited

ggerganov commented Mar 21, 2024 •

edited

ggerganov commented Mar 21, 2024 •

edited

Artefact2 commented Mar 21, 2024 •

edited