New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to use Q5_0, Q5_1, and IQ4_NL for quantized K cache #6183
Conversation
What is the performance of these quants |
For LLaMA-v2-7B quantized with
The So, in short, performance is OK-ish on Metal, where we need K-cache quantization less, but quite horrible on CUDA, where we need it more. |
This makes the
@ikawrakow How much is the performance drop with the original implementation? |
On CUDA we go from 76 t/s to 50 t/s for TG-128. Haven't measured Metal. I guess, if we want the CPU and GPU to be the same in the use case of |
Yes - it would be a new quantization type in order to not break backwards compatibility, correct? |
No, simply change |
Yes, good idea |
Q{4,5}_{0,1} seem broken on AMD.
|
Did I break it, or was it broken before I added |
q4_0 was already broken before. |
…ganov#6183) * k_cache: be able to use Q5_0 * k_cache: be able to use Q5_1 on CODA * k_cache: be able to use Q5_0 on Metal * k_cache: be able to use Q5_1 on Metal * k_cache: be able to use IQ4_NL - just CUDA for now * k_cache: be able to use IQ4_NL on Metal * k_cache: add newly added supported types to llama-bench and CUDA supports_op --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
…ganov#6183) * k_cache: be able to use Q5_0 * k_cache: be able to use Q5_1 on CODA * k_cache: be able to use Q5_0 on Metal * k_cache: be able to use Q5_1 on Metal * k_cache: be able to use IQ4_NL - just CUDA for now * k_cache: be able to use IQ4_NL on Metal * k_cache: add newly added supported types to llama-bench and CUDA supports_op --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
…ganov#6183) * k_cache: be able to use Q5_0 * k_cache: be able to use Q5_1 on CODA * k_cache: be able to use Q5_0 on Metal * k_cache: be able to use Q5_1 on Metal * k_cache: be able to use IQ4_NL - just CUDA for now * k_cache: be able to use IQ4_NL on Metal * k_cache: add newly added supported types to llama-bench and CUDA supports_op --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Tested to work on CUDA and Metal (CPU worked before this PR).
Table shows PPL for LLaMA-v2-7B quantized with
Q4_K_S
using the available quantization types for theK
cache. The "K cache size" column is for a context of 4096 tokens: