Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama-bench : use random tokens to improve accuracy with mixtral #6069

Merged
merged 1 commit into from Mar 15, 2024

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Mar 14, 2024

llama-bench currently does not produce accurate results with mixtral because it uses the same token for the entire prompt (bos). This results in the same experts being chosen repeatedly, which is not what happens during real usage. With this change llama-bench uses random tokens instead.

Current llama-bench results in master:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 0 pp 512 189.62 ± 0.97
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 0 pp 1024 182.17 ± 0.54
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 99 pp 512 613.36 ± 0.81
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 99 pp 1024 607.84 ± 0.48

build: 4755afd (2431)

Using main with a large representative prompt (extracted from the frankenstein book text) produces these values instead:

With -ngl 0:

llama_print_timings: prompt eval time =    8695.65 ms /   512 tokens (   16.98 ms per token,    58.88 tokens per second)
llama_print_timings: prompt eval time =   17340.24 ms /  1024 tokens (   16.93 ms per token,    59.05 tokens per second)

With -ngl 99:

llama_print_timings: prompt eval time =    1411.63 ms /   512 tokens (    2.76 ms per token,   362.70 tokens per second)
llama_print_timings: prompt eval time =    2811.67 ms /  1024 tokens (    2.75 ms per token,   364.20 tokens per second)

llama-bench after this PR:

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 0 pp 512 61.87 ± 0.26
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 0 pp 1024 61.56 ± 0.11
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 99 pp 512 378.69 ± 0.87
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 99 pp 1024 377.54 ± 1.76

The small difference is probably due to the warmup run performed by llama-bench.

Why is this important: a future change will cause all experts to be copied to VRAM during prompt processing regardless of if they are actually used, while currently only the experts used are copied. This change is important to understand the performance impact of doing that.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M2 Ultra

master

model size params backend ngl test t/s
llama 7B F16 86.99 GiB 46.70 B Metal 99 pp 512 302.32 ± 0.54
llama 7B F16 86.99 GiB 46.70 B Metal 99 pp 1024 301.49 ± 0.12

build: 4755afd (2431)

PR

model size params backend ngl test t/s
llama 7B F16 86.99 GiB 46.70 B Metal 99 pp 512 275.43 ± 1.19
llama 7B F16 86.99 GiB 46.70 B Metal 99 pp 1024 279.04 ± 0.67

build: 8281389 (2432)

@ggerganov ggerganov merged commit b0bc9f4 into master Mar 15, 2024
58 of 63 checks passed
@slaren slaren deleted the sl/bench-random-tokens branch March 15, 2024 10:46
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants