Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lookup: complement data from context with general text statistics #5479

Merged
merged 5 commits into from Mar 23, 2024

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented Feb 13, 2024

The goal of this PR is a proper implementation of #5398 . It builds on top of #5462 . The changes relative to #5462 are that lookup decoding now not only uses information from the context but also from a static text corpus.

More specifically, the code first tries to find a suitable draft token in the current context. The ranking of those draft tokens is the product of the frequencies with which the token appears as a continuation in the context and the text corpus. If no suitable token can be drafted from the context the code instead tries to draft a token from the text corpus only.

I use commands such as this for testing:

export model_name=miqu-70b && export quantization=q5_k_m
export nd=3
./lookup --model models/nvme/${model_name}-${quantization}.gguf -ngl 99 --ctx-size 4096 --split-mode row --n-predict 1024 --seed 1337 --temp 0 --ignore-eos --draft $nd --color --prompt "[INST] Write a love story about two stars that tragically ends in a type Ia supernova. Use a lot of emotional and dramatic language. [/INST]"

On my system with 3x P40 I currently get the following results for Miqu q5_K_M:

--draft 0 1 2 3 4 5 6 7
n_drafted master 0 146 246 345 468 570 672 777
n_drafted PR 0 292 413 492 616 724 818 927
n_accept master 0 67 90 98 96 99 101 101
n_accept PR 0 163 252 270 180 181 184 184
accept % master - 45.890% 36.585% 28.406% 20.513% 17.368% 15.030% 12.999%
accept % PR - 55.822% 61.017% 54.878% 29.221% 25.000% 22.494% 19.849%
t/s master 8.511 8.032 8.261 8.143 7.075 6.891 6.699 6.484
t/s PR 8.509 8.640 9.352 9.475 8.048 7.898 7.793 7.645

Compared to master more tokens are drafted and the acceptance rate is also higher. The end-to-end speedup compared to regular decoding is up to ~1.11. The drop when going from --draft 3 to --draft 4 comes from the generation output changing (if I remember correctly rejection sampling does not guarantee bit-for-bit identical results). In any case, due to the low amount of compute on P40s they scale relatively poorly with batch size; I expect the speedup on e.g. 3x RTX 3090 to be much larger. I currently don't have a good instruct model for 1x RTX 3090; I'll post numbers once I do.

To use this PR, you need a lookup.bin file.
On Linux it can be generated with these commands:

./scripts/get-wikitext-103.sh
./lookup-create -m <PATH_TO_MODEL>

I've also uploaded the file that I generated here. Put the file into the llama.cpp root directory and run the lookup example as normal. The lookup.bin file is based on 541 MiB of text and is 352 MiB in size. However, this scaling is not linear. With a smaller 11 MiB input file lookup.bin is 20 MiB in size.

What is still to do in this PR is figure out the best way to integrate this feature with the rest of the llama.cpp code. Currently the static lookup information is simply stored as a binary file (because I'm not familiar with how GGUF works). The creation of the file also takes a long time and is not really scalable. For the user-facing interface I imagine something like this:

./lookup-create -m <PATH_TO_MODEL> --lookup-file-static lookup_1.bin
./lookup-create -m <PATH_TO_MODEL> --lookup-file-static lookup_2.bin
./lookup-merge lookup_1.bin lookup_2.bin lookup.bin
./lookup -m <PATH_TO_MODEL> --lookup-file-static lookup.bin

I'm not sure about the llama.cpp API. Currently the data structures are C++ hashmaps. My understanding is that the API in llama.h is supposed to be C compatible so I don't think they can simply be exposed. The hashmaps can be serialized to uint64_t and int32_t (which is how they're stored in the binary) but the conversion does take some time. Each n-gram size needs a separate hashmap. The context hashmaps exist for n-grams length 1-4, the static hashmaps only exist for n-gram length 2 (1 too unspecific, not enough input data for 3). The methods needed would essentially be saving/loading the hashmaps to/from files as well as methods to initialize a hashmap and to fill it up with tokens. Feedback would be very welcome.

@kalomaze
Copy link
Contributor

kalomaze commented Feb 13, 2024

I think intuitively speaking drafting should focus on longer phrases, sequences, etc that are deterministically predictable in most cases. Like for example “state-of-the-art” being separated in multiple tokens is pretty redundant but once that pattern is established its trivial to predict the last half in contexts where it appears. I wonder if you can explicitly target continuations like this by avoiding drafting when the next token isn’t a continuation of a word / sequence and is the start of a new sentence (less predictable) for example?

@JohannesGaessler
Copy link
Collaborator Author

I tried explicitly checking for work boundaries but the results were not good. But of course it's always possible that I just did it wrong. In any case, it would also be useful to have a generic implementation since you can (in principle) use transformers for more than just text.

Comment on lines 284 to 288
const int static_ngram_start = inp_size-2 + draft.size()-1;
uint64_t static_ngram = get_token(inp, draft, static_ngram_start);
for (int j = static_ngram_start; j < static_ngram_start + 2; ++j) {
const uint64_t ngram_part = get_token(inp, draft, j);
static_ngram <<= 16;
Copy link
Owner

@ggerganov ggerganov Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems incorrect - it's calling get_token(inp, draft, static_ngram_start); twice: one before the loop on line 285 and one in the first loop iter on line 287

Also, printing the resulting static_ngram, the major 16-bits are always 0

static_ngram: 0000000000000000 0010010110101010 0010010110101010 0111010011000100 
 their
static_ngram: 0000000000000000 0111010011000100 0111010011000100 0000001111110001 
 love
static_ngram: 0000000000000000 0000001111110001 0000001111110001 0001010011110000 
 for
static_ngram: 0000000000000000 0001010011110000 0001010011110000 0000000101101011 
 each
static_ngram: 0000000000000000 0000000101101011 0000000101101011 0000010011110101 
 other
static_ngram: 0000000000000000 0000010011110101 0000010011110101 0000001110010100 

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You see, the trick is that I did it wrong but because the I did it consistently wrong the code still works. The static n-grams always have a size of 2 (1 is underfitting, 3 is overfitting with wikitext-103) so in actuality the major 32 bits are supposed to always be 0. The impact of the bug should only affect n-grams of size 4 but those are not very useful anyways because 3 already overfits.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see

The impact of the bug should only affect n-grams of size 4 but those are not very useful anyways because 3 already overfits.

My intuitive understanding is that the larger the n-gram that we can match in the corpus - the better. For example, for code completion use cases this is most certainly true due to the repetitive nature of code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an infinitely large corpus this is definitely true. But you essentially have a bias <-> variance tradeoff. Smaller n-grams have more bias because they cannot model the text distribution as well as larger n-grams. But larger n-grams have more variance because the number of possible token combinations increases exponentially with n-gram size. Since the tokens for a continuation follow a multinomial distribution you can estimate that for a vocabulary size of 32000 an increase of 1 in the n-gram size increases the error on the empirical token probabilities increases by a factor of $\sqrt{32000} \approx 178$. To keep the error constant you would need 32000 times as much data. A static n-gram size of 2 works for ~500 MB of text so you can assume that a static n-gram size of 3 would work for ~16 TB of text (but maybe less).

When it comes to the lookup from the context 3 vs. 4 also makes little difference because there would need to be token sequences where three tokens are the same but those immediately before and afterwards are different. And I think there simply aren't that many cases where this happens.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it comes to generating code or highly structured data (e.g. JSON) or summarizing/extracting information, large n-gram matches are much more frequent compared to free-form text completion. My main desire here is that the potential implementation for n-gram caches that we will implement in llama will be general enough and not have limitations on the n-gram size

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, the main reason I'm encoding the tokens as 64 bit integers is that uint64_t is a hashable type. If we were to write a custom hashing function we could go up to whatever length we want.

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Feb 13, 2024

I fixed the off-by-one issue with n-gram size and uploaded a fixed lookup.bin. I also fixed a bug where wikitext-2 instead of wikitext-103 was used as the hard-coded input file. I am aware of the bug where the progress prints for wikitext-103 are incorrect.

These are my results for the fixed version:

--draft 0 1 2 3 4 5 6 7
n_drafted master 0 146 246 345 468 570 672 777
n_drafted PR 0 188 220 246 266 294 316 343
n_accept master 0 67 90 98 96 99 101 101
n_accept PR 0 111 115 121 115 115 116 116
accept % master - 45.890% 36.585% 28.406% 20.513% 17.368% 15.030% 12.999%
accept % PR - 59.043% 52.273% 49.187% 43.233% 39.116% 36.709% 33.819%
t/s master 8.511 8.032 8.261 8.143 7.075 6.891 6.699 6.484
t/s PR 8.509 8.826 8.819 8.865 8.674 8.627 8.605 8.555

I would say they are worse than the results I got before but because the output has changed and I only tested a single prompt/seed the result is inconclusive. I think it would be useful to have something like the perplexity binary for the various drafting methods where drafts are generated for a fixed input text and the program simply collects statistics regarding how well the drafts align with the text. It would allow you to get much more precise results regarding how good/bad specific drafting methods or parameters work.

@JohannesGaessler
Copy link
Collaborator Author

I tested this PR on my RTX 3090 using Yi 34b 200k RPMerge q4_K_M. The tokenizer is different so a different lookup.bin file is needed (I uploaded it). Also because the model tends towards shorter responses I had to modify the prompt a little to get it to produce responses longer than 1024 tokens:

SYSTEM: You always use emotional and dramatic language. You describe everything in extreme detail and take your time as you slowly drive things towards its climax in a creative yet logical way.
USER: Write the first few chapters of an extremely long and elaborate romance novel about two stars. Include a lot of plot twists and a type Ia supernova. It is very important for my career that the novel is at least 100 pages long. If it's even a little shorter I will lose my job and become homeless.
ASSISTANT:

I get these results:

--draft 0 1 2 3 4 5 6 7
n_drafted master 0 151 254 450 405 528 895 1015
n_drafted PR 0 144 160 172 147 171 206 176
n_accept master 0 22 24 70 37 32 53 38
n_accept PR 0 55 48 58 47 38 50 40
accept % master - 14.570% 9.843% 15.556% 9.136% 6.061% 5.922% 3.744%
accept % PR - 38.194% 30.000% 33.721% 31.973% 22.222% 24.272% 22.727%
t/s master 29.809 30.057 29.010 29.036 28.481 27.726 25.736 24.794
t/s PR 29.912 29.579 29.858 30.891 30.605 30.392 29.726 28.734

Compared to Miqu the generations had fewer repetitions which is good in terms of output quality but of course also means that there is less potential for speedup from lookup decoding. So even though the t/s as a function of batch size scales much better on an RTX 3090 compared to a P40 the speedup is still low. But again, with a sample size of 1 it is very difficult to make conclusive statements.

@JohannesGaessler
Copy link
Collaborator Author

I've been thinking, maybe you could synthetically generate the data to use for the static lookup. Basically, just generate a bunch of outputs for different seeds and use them as input for lookup-create the issue would be getting enough compute though. With my machine with 3x P40 running for 24 hours I could generate ~1.4 MiB of text with Miqu q5_K_M or ~4.6 MiB with Mixtral Instruct q8_0. So it would take months to generate as much text as is in Wikitext-103 (541 MiB). But of course with better hardware or smaller models you would be able to do this much faster. For dense models batched generation would also make sense. And this is also something that you could efficiently scale up or crowd source.

I would also want to try just using a lot more data for the static lookup in order to make an n-gram size of 3 usable. Unfortunately The Pile was taken down due to copyright issues. But maybe there are other publicly available datasets that are suitable; I'll investigate.

@JohannesGaessler
Copy link
Collaborator Author

I ran some synthetic data generation as a test and at a large sample size the rate of data generation for Mixtral q8_0 with 3xP40 is ~7.9 MiB / 24 hours. It would still take months to get even a single gigabyte of data but it seems my earlier estimate was too pessimistic.

@slaren
Copy link
Collaborator

slaren commented Feb 14, 2024

It should be possible to generate multiple sequences simultaneously with the batch API, that should be a lot faster.

@JohannesGaessler
Copy link
Collaborator Author

Only for dense models. For MoE models the amount of data that you need to load increases by a factor of 4 for batched generation. Or if you can fit the entire model onto a single GPU you could use job scheduling software like SLURM to more efficiently parallelize the generation. Or just use better hardware than P40s.

@JohannesGaessler
Copy link
Collaborator Author

I moved the n-gram cache code to common with a C++-compliant interface. I added a CLI argument --lookup-cache-static to specify the file that lookup-create writes to and lookup reads from. lookup-create takes the prompt as input. I also added a binary lookup-stats that evaluates the sampling on the prompt without any model evaluations (a model is still needed to specify a tokenizer).

common/common.h Outdated
Comment on lines 283 to 285
void llama_ngram_cache_draft(
std::vector<llama_token> & inp, std::vector<llama_token> & draft, int n_draft,
std::vector<llama_ngram_cache> & ncs_t1, int ngram_min, llama_ngram_cache & nc_t2);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of ncs_t1 and nc_t2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to add a comment; "t1" means tier 1 and "t2" means tier 2. With lookup decoding there is a tradeoff between relevance (e.g. tokens in context) and sample size (e.g. Wikitext). So I think the best approach will be to first try to use tokens with high relevance (tier 1) and to then broaden the search to sources with less relevance but higher likelihood of containing the n-gram in question (tier 2, tier 3). I think the best solution (for regular text) would be to use the context as tier 1, all past user generations as tier 2, and a general text corpus as tier 3.

@JohannesGaessler
Copy link
Collaborator Author

I generated ~8 MiB of text and used it to create a lookup cache. The results seem to be improving as more data is added but very slowly. With a month of compute the results could become usable but I don't really want to wait that long. I'm currently downloading Project Gutenberg which will hopefully yield better results.

@JohannesGaessler
Copy link
Collaborator Author

I tried scaling up the amount of data for the static ngram cache with Project Gutenberg. The ngram size is 2. Data obtained from lookup-stats on Wikitext test set with a context size of 4096. The number of drafted and accepted tokens seems to scale only very weakly with the amount of input data if at all:

lookup_n_drafted_n_accept

The acceptance rate improves with any static ngram cache but there seems to be no indication that for 2-grams there is scaling beyond ~700 MiB of input data:

lookup_accept_percent

Ngram caches obtained from Wikitext train perform better than the caches from Project Gutenberg but this may be the result of overfitting on formatting: Wikitext contains formatting that can be much better predicted with a Wikitext cache and Project Gutenberg contains repeating text that I maybe did not sufficiently filter out.

@JohannesGaessler
Copy link
Collaborator Author

When subjectively evaluating lookup performance on actual model outputs the trend seems to be that model outputs > Wikitext > Project Gutenberg. It may be that (I think) Wikitext only contains English compared to Project Gutenberg which also includes other languages. Or I just did the data preparation wrong.

@JohannesGaessler
Copy link
Collaborator Author

Ive done some more testing. Data generated from LLaMA 2 q8_0 also seems to work for ngram caches:

lookup_n_drafted_n_accept

lookup_accept_percent

This type of data production is much more scalable than with a big model like Mixtral, especially because I can now use as many GPUs as I want without even the need for these GPUs to run the same backends or to be on the same machines. It would also be possible to generate the data with a batch size >1 since LLaMA 2 7b is a dense model.

@JohannesGaessler
Copy link
Collaborator Author

Today at LIPS I talked with Francois Charton from Meta AI. He suggested that, in order to get more information per generated token I should not just generate text and take the frequencies of the tokens that were actually generated but to instead build the ngram cache based on logits or the token probabilities put out by the model. The idea is sound and should make synthetic data generation more efficient.

@JohannesGaessler
Copy link
Collaborator Author

I added a CLI argument for a dynamic lookup cache that gets updated by user generations; It seems to help a bit. The idea of using token probability distributions instead of the actually generated tokens did not seem to result in an improvement (but made the code slower and more complicated).

@JohannesGaessler
Copy link
Collaborator Author

This is the algorithm that I have settled on:

  1. Try to find a suitable n-gram in the current context. Rank candidates by multiplying their frequency in the context with the frequency in the static lookup cache. If the top candidate passes filters for sample size and empirical probability, draft the token. Otherwise
  2. Do the same thing except for search in the dynamic lookup cache instead of the current context. The dynamic lookup cache is simply the merged caches of previous generations. If this fails
  3. Search for a suitable n-gram in the static lookup cache that is built from a huge number of tokens with little relevance to the current context but high statistical accuracy.

For the last week I've been generating more data for static lookup caches using 1x RTX 3090, 3x P40, and 1x RX 6800. I analyzed the scaling for up to 51.2 million tokens with --draft 3. It seems that even with this amount of data 2-gram caches are not yet saturated:

lookup_scaling_accept_per_drafted

The acceptance rate for 2-gram and 3-gram caches decreases at first due to overfitting and they still show scaling. 2-gram caches should be able to reach at least the same acceptance rate as 1-gram caches so the scaling should continue for at least ~10x more data. The rate of accepted tokens per generated tokens is also still scaling:

lookup_scaling_accept_per_generated

Roughly 1 in 6 tokens can be drafted correctly. With the current code and the biggest 2-gram cache I also looked at the performance when repeatedly generating tokens for the same prompt:

lookup_dynamic_scaling_rtx3090_f16

lookup_scaling_accept_per_generated

The command that I used is something like this:

export model_name=mistral_instruct_0.2-7b && export quantization=f16
./lookup --model models/opt/${model_name}-${quantization}.gguf -ngl 99 --ctx-size 4096 --n-predict 1024 --ignore-eos --draft 3 --color --prompt "[INST] Write a love story about two stars that tragically ends in a type Ia supernova. Use a lot of emotional and dramatic language. [/INST]" -lcs /mnt/A2SDi/RAID1/Machine_Learning/lookup_cache/merged/mistral_0.1-7b-q8_0-100000x512.bin -lcd lookup-dynamic.bin

Lookup from the context, a static text corpus, and previous generations all improve the overall generation speed. The top speedup is ~1.4x. I will now focue on getting this PR into a state where it can be merged.

@JohannesGaessler JohannesGaessler marked this pull request as ready for review March 17, 2024 19:52
@JohannesGaessler
Copy link
Collaborator Author

I would now consider this PR ready for review; I am still generating data but from my end it should be fine to merge this PR.

I also noticed that I did the tests for 1-grams and 3-grams incorrectly (but this did not affect the conclusions). These are the updated plots:

lookup_scaling_accept_per_drafted

lookup_scaling_accept_per_generated

Comment on lines 7 to 23
set(TARGET lookup-create)
add_executable(${TARGET} lookup.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_11)

set(TARGET lookup-merge)
add_executable(${TARGET} lookup.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_11)

set(TARGET lookup-stats)
add_executable(${TARGET} lookup.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_11)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these have the same source lookup.cpp.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, at some point I refactored this and forgot to adjust those lines.

common/common.h Outdated
Comment on lines 271 to 272
typedef uint64_t llama_ngram; // Each of the 4 16 bit sections represents a token id.
typedef std::unordered_map<llama_token, int32_t> llama_ngram_cache_part; // token -> number of times token has been seen
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is an array of 4 16-bit tokens, why not define it as such instead of using bit manipulation? However 16-bit does not seem enough to store a token id, we already have models with n_vocab far above 65535, eg. baichuan2 has 125696.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that an array is not a hashable type. Because of this you cannot use it as the key in std::unordered_map without also defining e.g. a corresponding hashing function. I would have preferred not to do this since it adds complexity but if we already have models with larger vocabularies it cannot be helped.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the implementation. Regarding the C-style API - we will figure this out later. For now we can use the proposed C++ interface and try to create wider application of the speculative sampling approach

One minor thing to consider is to move the llama_ngram_cache functions into separate common/ngram-cache.h/.cpp files (similar to common/sampling.h/.cpp)

@JohannesGaessler
Copy link
Collaborator Author

I rebased on top of master and moved the new code to dedicated files. Since several things have been changed and I have generated additional data I will re-run my workflow for the static ngram cache soon. If I encounter no issues I will then merge.

@JohannesGaessler JohannesGaessler force-pushed the lookup-static-5 branch 2 times, most recently from e7d1e38 to 45cb565 Compare March 22, 2024 22:15
@JohannesGaessler JohannesGaessler merged commit 50ccaf5 into ggerganov:master Mar 23, 2024
56 checks passed
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
…erganov#5479)

* lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
…erganov#5479)

* lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens
tybalex pushed a commit to tybalex/function.cpp that referenced this pull request Apr 17, 2024
…erganov#5479)

* lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens
tybalex pushed a commit to tybalex/function.cpp that referenced this pull request Apr 18, 2024
…erganov#5479)

* lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants