Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to use Q5_0, Q5_1, and IQ4_NL for quantized K cache #6183

Merged
merged 7 commits into from Mar 21, 2024

Conversation

ikawrakow
Copy link
Contributor

Tested to work on CUDA and Metal (CPU worked before this PR).

Table shows PPL for LLaMA-v2-7B quantized with Q4_K_S using the available quantization types for the K cache. The "K cache size" column is for a context of 4096 tokens:

K cache type K cache size PPL
fp16 1024 MB 5.8671
Q8_0 544 MB 5.8681
Q5_1 384 MB 5.8802
Q5_0 352 MB 5.8920
Q4_1 320 MB 5.9233
IQ4_NL 288 MB 5.9323
Q4_0 288 MB 5.9790

@ggerganov ggerganov requested a review from slaren March 20, 2024 18:20
ggml-metal.m Show resolved Hide resolved
@sorasoras
Copy link

What is the performance of these quants

@ikawrakow
Copy link
Contributor Author

What is the performance of these quants

For LLaMA-v2-7B quantized with Q4_K_S I get

  • CUDA (running on RTX-4080)
model backend ngl threads type_k test t/s
llama 7B Q4_K - Small CUDA 100 1 f16 tg 128 127.57 ± 0.59
llama 7B Q4_K - Small CUDA 100 1 q8_0 tg 128 76.93 ± 0.26
llama 7B Q4_K - Small CUDA 100 1 q5_1 tg 128 76.66 ± 0.00
llama 7B Q4_K - Small CUDA 100 1 q5_0 tg 128 76.77 ± 0.02
llama 7B Q4_K - Small CUDA 100 1 q4_1 tg 128 77.35 ± 0.01
llama 7B Q4_K - Small CUDA 100 1 iq4_nl tg 128 75.07 ± 0.36
llama 7B Q4_K - Small CUDA 100 1 q4_0 tg 128 76.86 ± 0.00
llama 7B Q4_K - Small CUDA 100 1 f16 pp 512 5675.11 ± 10.20
llama 7B Q4_K - Small CUDA 100 1 q8_0 pp 512 5209.73 ± 4.39
llama 7B Q4_K - Small CUDA 100 1 q5_1 pp 512 5203.32 ± 6.13
llama 7B Q4_K - Small CUDA 100 1 q5_0 pp 512 5200.40 ± 4.07
llama 7B Q4_K - Small CUDA 100 1 q4_1 pp 512 5197.36 ± 3.78
llama 7B Q4_K - Small CUDA 100 1 iq4_nl pp 512 4727.66 ± 0.86
llama 7B Q4_K - Small CUDA 100 1 q4_0 pp 512 5199.24 ± 1.48
  • Metal (running on 30-core M2 Max GPU)
model backend ngl threads type_k test t/s
llama 7B Q4_K - Small Metal 100 4 f16 tg 128 54.13 ± 0.19
llama 7B Q4_K - Small Metal 100 4 q8_0 tg 128 53.05 ± 0.20
llama 7B Q4_K - Small Metal 100 4 q5_1 tg 128 52.96 ± 0.10
llama 7B Q4_K - Small Metal 100 4 q5_0 tg 128 53.24 ± 0.05
llama 7B Q4_K - Small Metal 100 4 q4_1 tg 128 53.45 ± 0.07
llama 7B Q4_K - Small Metal 100 4 iq4_nl tg 128 51.53 ± 0.08
llama 7B Q4_K - Small Metal 100 4 q4_0 tg 128 53.41 ± 0.11
llama 7B Q4_K - Small Metal 100 4 f16 pp 512 474.85 ± 3.24
llama 7B Q4_K - Small Metal 100 4 q8_0 pp 512 474.59 ± 3.27
llama 7B Q4_K - Small Metal 100 4 q5_1 pp 512 473.34 ± 3.52
llama 7B Q4_K - Small Metal 100 4 q5_0 pp 512 472.48 ± 3.57
llama 7B Q4_K - Small Metal 100 4 q4_1 pp 512 475.29 ± 2.37
llama 7B Q4_K - Small Metal 100 4 iq4_nl pp 512 468.29 ± 3.38
llama 7B Q4_K - Small Metal 100 4 q4_0 pp 512 475.45 ± 2.10

The IQ4_NL quantization used for quantizing the K cache is simpler than IQ4_NL used for weight quantization to keep the performance drop acceptable.

So, in short, performance is OK-ish on Metal, where we need K-cache quantization less, but quite horrible on CUDA, where we need it more.

@ikawrakow ikawrakow merged commit 76aa30a into master Mar 21, 2024
56 checks passed
@ikawrakow ikawrakow deleted the ik/k_cache_q5 branch March 21, 2024 07:28
@ggerganov
Copy link
Owner

ggerganov commented Mar 21, 2024

The IQ4_NL quantization used for quantizing the K cache is simpler than IQ4_NL used for weight quantization to keep the performance drop acceptable.

This makes the test-backend-ops fail because the results of the CPY operation between the CPU and GPU differ:

  CPY(type_src=f32,type_dst=iq4_nl,ne=[256,4,4,4]): [CPY] NMSE = 0.004881965 > 0.000000100 FAIL

@ikawrakow How much is the performance drop with the original implementation?

@ikawrakow
Copy link
Contributor Author

How much is the performance drop with the original implementation?

On CUDA we go from 76 t/s to 50 t/s for TG-128. Haven't measured Metal.

I guess, if we want the CPU and GPU to be the same in the use case of CPY, it would be better to make a faster CPU version that is the same as CUDA/Metal. Do you agree?

@ggerganov
Copy link
Owner

ggerganov commented Mar 21, 2024

Yes - it would be a new quantization type in order to not break backwards compatibility, correct?

@ikawrakow
Copy link
Contributor Author

No, simply change quantize_row_iq4_nl(), which is not used during quantization but is the from_float type trait for IQ4_NL, to do the same thing as CUDA/Metal.

@ggerganov
Copy link
Owner

Yes, good idea

@Artefact2
Copy link
Collaborator

Q{4,5}_{0,1} seem broken on AMD. perplexity returns nan.

./perplexity --chunks 1 -f wiki.test.raw -c 4096 -ngl 999 -ctk q4_0 -m LLaMA2-13B-Q4_0.gguf
perplexity: tokenizing the input ..
perplexity: tokenization took 482.161 ms
perplexity: calculating perplexity over 1 chunks, n_ctx=4096, batch_size=2048, n_seq=1
perplexity: 11.25 seconds per pass - ETA 0.18 minutes
[1]nan,
Unexpected negative standard deviation of log(prob)

@ikawrakow
Copy link
Contributor Author

Q{4,5}_{0,1} seem broken on AMD. perplexity returns nan.

Did I break it, or was it broken before I added Q5_0/1 and IQ4_NL ?

@Artefact2
Copy link
Collaborator

Artefact2 commented Mar 21, 2024

q4_0 was already broken before.

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
…ganov#6183)

* k_cache: be able to use Q5_0

* k_cache: be able to use Q5_1 on CODA

* k_cache: be able to use Q5_0 on Metal

* k_cache: be able to use Q5_1 on Metal

* k_cache: be able to use IQ4_NL - just CUDA for now

* k_cache: be able to use IQ4_NL on Metal

* k_cache: add newly added supported types to llama-bench and CUDA supports_op

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
…ganov#6183)

* k_cache: be able to use Q5_0

* k_cache: be able to use Q5_1 on CODA

* k_cache: be able to use Q5_0 on Metal

* k_cache: be able to use Q5_1 on Metal

* k_cache: be able to use IQ4_NL - just CUDA for now

* k_cache: be able to use IQ4_NL on Metal

* k_cache: add newly added supported types to llama-bench and CUDA supports_op

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
tybalex pushed a commit to tybalex/function.cpp that referenced this pull request Apr 17, 2024
…ganov#6183)

* k_cache: be able to use Q5_0

* k_cache: be able to use Q5_1 on CODA

* k_cache: be able to use Q5_0 on Metal

* k_cache: be able to use Q5_1 on Metal

* k_cache: be able to use IQ4_NL - just CUDA for now

* k_cache: be able to use IQ4_NL on Metal

* k_cache: add newly added supported types to llama-bench and CUDA supports_op

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants