lookup: complement data from context with general text statistics #5479

JohannesGaessler · 2024-02-13T13:33:50Z

The goal of this PR is a proper implementation of #5398 . It builds on top of #5462 . The changes relative to #5462 are that lookup decoding now not only uses information from the context but also from a static text corpus.

More specifically, the code first tries to find a suitable draft token in the current context. The ranking of those draft tokens is the product of the frequencies with which the token appears as a continuation in the context and the text corpus. If no suitable token can be drafted from the context the code instead tries to draft a token from the text corpus only.

I use commands such as this for testing:

export model_name=miqu-70b && export quantization=q5_k_m
export nd=3
./lookup --model models/nvme/${model_name}-${quantization}.gguf -ngl 99 --ctx-size 4096 --split-mode row --n-predict 1024 --seed 1337 --temp 0 --ignore-eos --draft $nd --color --prompt "[INST] Write a love story about two stars that tragically ends in a type Ia supernova. Use a lot of emotional and dramatic language. [/INST]"

On my system with 3x P40 I currently get the following results for Miqu q5_K_M:

--draft	0	1	2	3	4	5	6	7
n_drafted master	0	146	246	345	468	570	672	777
n_drafted PR	0	292	413	492	616	724	818	927
n_accept master	0	67	90	98	96	99	101	101
n_accept PR	0	163	252	270	180	181	184	184
accept % master	-	45.890%	36.585%	28.406%	20.513%	17.368%	15.030%	12.999%
accept % PR	-	55.822%	61.017%	54.878%	29.221%	25.000%	22.494%	19.849%
t/s master	8.511	8.032	8.261	8.143	7.075	6.891	6.699	6.484
t/s PR	8.509	8.640	9.352	9.475	8.048	7.898	7.793	7.645

Compared to master more tokens are drafted and the acceptance rate is also higher. The end-to-end speedup compared to regular decoding is up to ~1.11. The drop when going from --draft 3 to --draft 4 comes from the generation output changing (if I remember correctly rejection sampling does not guarantee bit-for-bit identical results). In any case, due to the low amount of compute on P40s they scale relatively poorly with batch size; I expect the speedup on e.g. 3x RTX 3090 to be much larger. I currently don't have a good instruct model for 1x RTX 3090; I'll post numbers once I do.

To use this PR, you need a lookup.bin file.
On Linux it can be generated with these commands:

./scripts/get-wikitext-103.sh
./lookup-create -m <PATH_TO_MODEL>

I've also uploaded the file that I generated here. Put the file into the llama.cpp root directory and run the lookup example as normal. The lookup.bin file is based on 541 MiB of text and is 352 MiB in size. However, this scaling is not linear. With a smaller 11 MiB input file lookup.bin is 20 MiB in size.

What is still to do in this PR is figure out the best way to integrate this feature with the rest of the llama.cpp code. Currently the static lookup information is simply stored as a binary file (because I'm not familiar with how GGUF works). The creation of the file also takes a long time and is not really scalable. For the user-facing interface I imagine something like this:

./lookup-create -m <PATH_TO_MODEL> --lookup-file-static lookup_1.bin
./lookup-create -m <PATH_TO_MODEL> --lookup-file-static lookup_2.bin
./lookup-merge lookup_1.bin lookup_2.bin lookup.bin
./lookup -m <PATH_TO_MODEL> --lookup-file-static lookup.bin

I'm not sure about the llama.cpp API. Currently the data structures are C++ hashmaps. My understanding is that the API in llama.h is supposed to be C compatible so I don't think they can simply be exposed. The hashmaps can be serialized to uint64_t and int32_t (which is how they're stored in the binary) but the conversion does take some time. Each n-gram size needs a separate hashmap. The context hashmaps exist for n-grams length 1-4, the static hashmaps only exist for n-gram length 2 (1 too unspecific, not enough input data for 3). The methods needed would essentially be saving/loading the hashmaps to/from files as well as methods to initialize a hashmap and to fill it up with tokens. Feedback would be very welcome.

kalomaze · 2024-02-13T14:02:00Z

I think intuitively speaking drafting should focus on longer phrases, sequences, etc that are deterministically predictable in most cases. Like for example “state-of-the-art” being separated in multiple tokens is pretty redundant but once that pattern is established its trivial to predict the last half in contexts where it appears. I wonder if you can explicitly target continuations like this by avoiding drafting when the next token isn’t a continuation of a word / sequence and is the start of a new sentence (less predictable) for example?

JohannesGaessler · 2024-02-13T14:06:02Z

I tried explicitly checking for work boundaries but the results were not good. But of course it's always possible that I just did it wrong. In any case, it would also be useful to have a generic implementation since you can (in principle) use transformers for more than just text.

ggerganov · 2024-02-13T14:29:11Z

examples/lookup/lookup.cpp

+                const int static_ngram_start = inp_size-2 + draft.size()-1;
+                uint64_t static_ngram = get_token(inp, draft, static_ngram_start);
+                for (int j = static_ngram_start; j < static_ngram_start + 2; ++j) {
+                    const uint64_t ngram_part = get_token(inp, draft, j);
+                    static_ngram <<= 16;


This seems incorrect - it's calling get_token(inp, draft, static_ngram_start); twice: one before the loop on line 285 and one in the first loop iter on line 287

Also, printing the resulting static_ngram, the major 16-bits are always 0

static_ngram: 0000000000000000 0010010110101010 0010010110101010 0111010011000100 their static_ngram: 0000000000000000 0111010011000100 0111010011000100 0000001111110001 love static_ngram: 0000000000000000 0000001111110001 0000001111110001 0001010011110000 for static_ngram: 0000000000000000 0001010011110000 0001010011110000 0000000101101011 each static_ngram: 0000000000000000 0000000101101011 0000000101101011 0000010011110101 other static_ngram: 0000000000000000 0000010011110101 0000010011110101 0000001110010100

You see, the trick is that I did it wrong but because the I did it consistently wrong the code still works. The static n-grams always have a size of 2 (1 is underfitting, 3 is overfitting with wikitext-103) so in actuality the major 32 bits are supposed to always be 0. The impact of the bug should only affect n-grams of size 4 but those are not very useful anyways because 3 already overfits.

Ok I see

The impact of the bug should only affect n-grams of size 4 but those are not very useful anyways because 3 already overfits.

My intuitive understanding is that the larger the n-gram that we can match in the corpus - the better. For example, for code completion use cases this is most certainly true due to the repetitive nature of code

For an infinitely large corpus this is definitely true. But you essentially have a bias <-> variance tradeoff. Smaller n-grams have more bias because they cannot model the text distribution as well as larger n-grams. But larger n-grams have more variance because the number of possible token combinations increases exponentially with n-gram size. Since the tokens for a continuation follow a multinomial distribution you can estimate that for a vocabulary size of 32000 an increase of 1 in the n-gram size increases the error on the empirical token probabilities increases by a factor of $\sqrt{32000} \approx 178$. To keep the error constant you would need 32000 times as much data. A static n-gram size of 2 works for ~500 MB of text so you can assume that a static n-gram size of 3 would work for ~16 TB of text (but maybe less).

When it comes to the lookup from the context 3 vs. 4 also makes little difference because there would need to be token sequences where three tokens are the same but those immediately before and afterwards are different. And I think there simply aren't that many cases where this happens.

When it comes to generating code or highly structured data (e.g. JSON) or summarizing/extracting information, large n-gram matches are much more frequent compared to free-form text completion. My main desire here is that the potential implementation for n-gram caches that we will implement in llama will be general enough and not have limitations on the n-gram size

I mean, the main reason I'm encoding the tokens as 64 bit integers is that uint64_t is a hashable type. If we were to write a custom hashing function we could go up to whatever length we want.

JohannesGaessler · 2024-02-13T16:06:03Z

I fixed the off-by-one issue with n-gram size and uploaded a fixed lookup.bin. I also fixed a bug where wikitext-2 instead of wikitext-103 was used as the hard-coded input file. I am aware of the bug where the progress prints for wikitext-103 are incorrect.

These are my results for the fixed version:

--draft	0	1	2	3	4	5	6	7
n_drafted master	0	146	246	345	468	570	672	777
n_drafted PR	0	188	220	246	266	294	316	343
n_accept master	0	67	90	98	96	99	101	101
n_accept PR	0	111	115	121	115	115	116	116
accept % master	-	45.890%	36.585%	28.406%	20.513%	17.368%	15.030%	12.999%
accept % PR	-	59.043%	52.273%	49.187%	43.233%	39.116%	36.709%	33.819%
t/s master	8.511	8.032	8.261	8.143	7.075	6.891	6.699	6.484
t/s PR	8.509	8.826	8.819	8.865	8.674	8.627	8.605	8.555

I would say they are worse than the results I got before but because the output has changed and I only tested a single prompt/seed the result is inconclusive. I think it would be useful to have something like the perplexity binary for the various drafting methods where drafts are generated for a fixed input text and the program simply collects statistics regarding how well the drafts align with the text. It would allow you to get much more precise results regarding how good/bad specific drafting methods or parameters work.

JohannesGaessler · 2024-02-13T17:26:03Z

I tested this PR on my RTX 3090 using Yi 34b 200k RPMerge q4_K_M. The tokenizer is different so a different lookup.bin file is needed (I uploaded it). Also because the model tends towards shorter responses I had to modify the prompt a little to get it to produce responses longer than 1024 tokens:

SYSTEM: You always use emotional and dramatic language. You describe everything in extreme detail and take your time as you slowly drive things towards its climax in a creative yet logical way.
USER: Write the first few chapters of an extremely long and elaborate romance novel about two stars. Include a lot of plot twists and a type Ia supernova. It is very important for my career that the novel is at least 100 pages long. If it's even a little shorter I will lose my job and become homeless.
ASSISTANT:

I get these results:

--draft	0	1	2	3	4	5	6	7
n_drafted master	0	151	254	450	405	528	895	1015
n_drafted PR	0	144	160	172	147	171	206	176
n_accept master	0	22	24	70	37	32	53	38
n_accept PR	0	55	48	58	47	38	50	40
accept % master	-	14.570%	9.843%	15.556%	9.136%	6.061%	5.922%	3.744%
accept % PR	-	38.194%	30.000%	33.721%	31.973%	22.222%	24.272%	22.727%
t/s master	29.809	30.057	29.010	29.036	28.481	27.726	25.736	24.794
t/s PR	29.912	29.579	29.858	30.891	30.605	30.392	29.726	28.734

Compared to Miqu the generations had fewer repetitions which is good in terms of output quality but of course also means that there is less potential for speedup from lookup decoding. So even though the t/s as a function of batch size scales much better on an RTX 3090 compared to a P40 the speedup is still low. But again, with a sample size of 1 it is very difficult to make conclusive statements.

JohannesGaessler · 2024-02-14T11:28:16Z

I've been thinking, maybe you could synthetically generate the data to use for the static lookup. Basically, just generate a bunch of outputs for different seeds and use them as input for lookup-create the issue would be getting enough compute though. With my machine with 3x P40 running for 24 hours I could generate ~1.4 MiB of text with Miqu q5_K_M or ~4.6 MiB with Mixtral Instruct q8_0. So it would take months to generate as much text as is in Wikitext-103 (541 MiB). But of course with better hardware or smaller models you would be able to do this much faster. For dense models batched generation would also make sense. And this is also something that you could efficiently scale up or crowd source.

I would also want to try just using a lot more data for the static lookup in order to make an n-gram size of 3 usable. Unfortunately The Pile was taken down due to copyright issues. But maybe there are other publicly available datasets that are suitable; I'll investigate.

JohannesGaessler · 2024-02-14T13:33:20Z

I ran some synthetic data generation as a test and at a large sample size the rate of data generation for Mixtral q8_0 with 3xP40 is ~7.9 MiB / 24 hours. It would still take months to get even a single gigabyte of data but it seems my earlier estimate was too pessimistic.

slaren · 2024-02-14T13:36:43Z

It should be possible to generate multiple sequences simultaneously with the batch API, that should be a lot faster.

JohannesGaessler · 2024-02-14T13:40:38Z

Only for dense models. For MoE models the amount of data that you need to load increases by a factor of 4 for batched generation. Or if you can fit the entire model onto a single GPU you could use job scheduling software like SLURM to more efficiently parallelize the generation. Or just use better hardware than P40s.

JohannesGaessler · 2024-02-14T22:47:28Z

I moved the n-gram cache code to common with a C++-compliant interface. I added a CLI argument --lookup-cache-static to specify the file that lookup-create writes to and lookup reads from. lookup-create takes the prompt as input. I also added a binary lookup-stats that evaluates the sampling on the prompt without any model evaluations (a model is still needed to specify a tokenizer).

ggerganov · 2024-02-15T09:34:57Z

common/common.h

+void llama_ngram_cache_draft(
+    std::vector<llama_token> & inp, std::vector<llama_token> & draft, int n_draft,
+    std::vector<llama_ngram_cache> & ncs_t1, int ngram_min, llama_ngram_cache & nc_t2);


What is the purpose of ncs_t1 and nc_t2?

I forgot to add a comment; "t1" means tier 1 and "t2" means tier 2. With lookup decoding there is a tradeoff between relevance (e.g. tokens in context) and sample size (e.g. Wikitext). So I think the best approach will be to first try to use tokens with high relevance (tier 1) and to then broaden the search to sources with less relevance but higher likelihood of containing the n-gram in question (tier 2, tier 3). I think the best solution (for regular text) would be to use the context as tier 1, all past user generations as tier 2, and a general text corpus as tier 3.

JohannesGaessler · 2024-02-15T12:51:31Z

I generated ~8 MiB of text and used it to create a lookup cache. The results seem to be improving as more data is added but very slowly. With a month of compute the results could become usable but I don't really want to wait that long. I'm currently downloading Project Gutenberg which will hopefully yield better results.

JohannesGaessler · 2024-02-18T14:37:46Z

I tried scaling up the amount of data for the static ngram cache with Project Gutenberg. The ngram size is 2. Data obtained from lookup-stats on Wikitext test set with a context size of 4096. The number of drafted and accepted tokens seems to scale only very weakly with the amount of input data if at all:

The acceptance rate improves with any static ngram cache but there seems to be no indication that for 2-grams there is scaling beyond ~700 MiB of input data:

Ngram caches obtained from Wikitext train perform better than the caches from Project Gutenberg but this may be the result of overfitting on formatting: Wikitext contains formatting that can be much better predicted with a Wikitext cache and Project Gutenberg contains repeating text that I maybe did not sufficiently filter out.

JohannesGaessler · 2024-02-18T14:51:11Z

When subjectively evaluating lookup performance on actual model outputs the trend seems to be that model outputs > Wikitext > Project Gutenberg. It may be that (I think) Wikitext only contains English compared to Project Gutenberg which also includes other languages. Or I just did the data preparation wrong.

JohannesGaessler · 2024-02-20T15:11:36Z

Ive done some more testing. Data generated from LLaMA 2 q8_0 also seems to work for ngram caches:

This type of data production is much more scalable than with a big model like Mixtral, especially because I can now use as many GPUs as I want without even the need for these GPUs to run the same backends or to be on the same machines. It would also be possible to generate the data with a batch size >1 since LLaMA 2 7b is a dense model.

JohannesGaessler · 2024-02-21T19:49:45Z

Today at LIPS I talked with Francois Charton from Meta AI. He suggested that, in order to get more information per generated token I should not just generate text and take the frequencies of the tokens that were actually generated but to instead build the ngram cache based on logits or the token probabilities put out by the model. The idea is sound and should make synthetic data generation more efficient.

JohannesGaessler · 2024-03-10T17:51:23Z

I added a CLI argument for a dynamic lookup cache that gets updated by user generations; It seems to help a bit. The idea of using token probability distributions instead of the actually generated tokens did not seem to result in an improvement (but made the code slower and more complicated).

JohannesGaessler · 2024-03-16T15:20:12Z

This is the algorithm that I have settled on:

Try to find a suitable n-gram in the current context. Rank candidates by multiplying their frequency in the context with the frequency in the static lookup cache. If the top candidate passes filters for sample size and empirical probability, draft the token. Otherwise
Do the same thing except for search in the dynamic lookup cache instead of the current context. The dynamic lookup cache is simply the merged caches of previous generations. If this fails
Search for a suitable n-gram in the static lookup cache that is built from a huge number of tokens with little relevance to the current context but high statistical accuracy.

For the last week I've been generating more data for static lookup caches using 1x RTX 3090, 3x P40, and 1x RX 6800. I analyzed the scaling for up to 51.2 million tokens with --draft 3. It seems that even with this amount of data 2-gram caches are not yet saturated:

The acceptance rate for 2-gram and 3-gram caches decreases at first due to overfitting and they still show scaling. 2-gram caches should be able to reach at least the same acceptance rate as 1-gram caches so the scaling should continue for at least ~10x more data. The rate of accepted tokens per generated tokens is also still scaling:

Roughly 1 in 6 tokens can be drafted correctly. With the current code and the biggest 2-gram cache I also looked at the performance when repeatedly generating tokens for the same prompt:

The command that I used is something like this:

export model_name=mistral_instruct_0.2-7b && export quantization=f16
./lookup --model models/opt/${model_name}-${quantization}.gguf -ngl 99 --ctx-size 4096 --n-predict 1024 --ignore-eos --draft 3 --color --prompt "[INST] Write a love story about two stars that tragically ends in a type Ia supernova. Use a lot of emotional and dramatic language. [/INST]" -lcs /mnt/A2SDi/RAID1/Machine_Learning/lookup_cache/merged/mistral_0.1-7b-q8_0-100000x512.bin -lcd lookup-dynamic.bin

Lookup from the context, a static text corpus, and previous generations all improve the overall generation speed. The top speedup is ~1.4x. I will now focue on getting this PR into a state where it can be merged.

JohannesGaessler · 2024-03-17T19:55:18Z

I would now consider this PR ready for review; I am still generating data but from my end it should be fine to merge this PR.

I also noticed that I did the tests for 1-grams and 3-grams incorrectly (but this did not affect the conclusions). These are the updated plots:

slaren · 2024-03-17T20:08:21Z

examples/lookup/CMakeLists.txt

+set(TARGET lookup-create)
+add_executable(${TARGET} lookup.cpp)
+install(TARGETS ${TARGET} RUNTIME)
+target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
+target_compile_features(${TARGET} PRIVATE cxx_std_11)
+
+set(TARGET lookup-merge)
+add_executable(${TARGET} lookup.cpp)
+install(TARGETS ${TARGET} RUNTIME)
+target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
+target_compile_features(${TARGET} PRIVATE cxx_std_11)
+
+set(TARGET lookup-stats)
+add_executable(${TARGET} lookup.cpp)
+install(TARGETS ${TARGET} RUNTIME)
+target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
+target_compile_features(${TARGET} PRIVATE cxx_std_11)


All of these have the same source lookup.cpp.

Thanks, at some point I refactored this and forgot to adjust those lines.

slaren · 2024-03-17T20:19:54Z

common/common.h

+typedef uint64_t                                                llama_ngram;            // Each of the 4 16 bit sections represents a token id.
+typedef std::unordered_map<llama_token, int32_t>                llama_ngram_cache_part; // token -> number of times token has been seen


If this is an array of 4 16-bit tokens, why not define it as such instead of using bit manipulation? However 16-bit does not seem enough to store a token id, we already have models with n_vocab far above 65535, eg. baichuan2 has 125696.

The issue is that an array is not a hashable type. Because of this you cannot use it as the key in std::unordered_map without also defining e.g. a corresponding hashing function. I would have preferred not to do this since it adds complexity but if we already have models with larger vocabularies it cannot be helped.

ggerganov

I like the implementation. Regarding the C-style API - we will figure this out later. For now we can use the proposed C++ interface and try to create wider application of the speculative sampling approach

One minor thing to consider is to move the llama_ngram_cache functions into separate common/ngram-cache.h/.cpp files (similar to common/sampling.h/.cpp)

JohannesGaessler · 2024-03-20T18:04:00Z

I rebased on top of master and moved the new code to dedicated files. Since several things have been changed and I have generated additional data I will re-run my workflow for the static ngram cache soon. If I encounter no issues I will then merge.

examples/lookup/lookup-stats.cpp

…erganov#5479) * lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens

ggerganov reviewed Feb 13, 2024

View reviewed changes

JohannesGaessler force-pushed the lookup-static-5 branch from 8c900ab to 49e794f Compare February 14, 2024 22:43

ggerganov reviewed Feb 15, 2024

View reviewed changes

JohannesGaessler mentioned this pull request Feb 27, 2024

Implement stochastic speculative sampling #5625

Merged

JohannesGaessler force-pushed the lookup-static-5 branch from 2c70d79 to a908794 Compare March 10, 2024 17:49

JohannesGaessler marked this pull request as ready for review March 17, 2024 19:52

slaren reviewed Mar 17, 2024

View reviewed changes

ggerganov approved these changes Mar 20, 2024

View reviewed changes

JohannesGaessler force-pushed the lookup-static-5 branch from 0e9e8a4 to 8dcd771 Compare March 20, 2024 18:01

JohannesGaessler force-pushed the lookup-static-5 branch 2 times, most recently from e7d1e38 to 45cb565 Compare March 22, 2024 22:15

lookup: evaluation tools, use corpus/previous gens

a51a401

JohannesGaessler force-pushed the lookup-static-5 branch from 45cb565 to a51a401 Compare March 22, 2024 22:18

JohannesGaessler added 2 commits March 22, 2024 23:28

fixup! lookup: evaluation tools, use corpus/previous gens

5070d0a

fixup! lookup: evaluation tools, use corpus/previous gens

e2f6704

slaren reviewed Mar 22, 2024

View reviewed changes

examples/lookup/lookup-stats.cpp Outdated Show resolved Hide resolved

fixup! lookup: evaluation tools, use corpus/previous gens

e04cf1a

JohannesGaessler force-pushed the lookup-static-5 branch from acb5375 to e04cf1a Compare March 22, 2024 23:35

fixup! lookup: evaluation tools, use corpus/previous gens

63d03a9

JohannesGaessler merged commit 50ccaf5 into ggerganov:master Mar 23, 2024
56 checks passed

This was referenced Mar 23, 2024

Static lookup decoding #5398

Closed

lookup: use hashmaps, select most frequent tokens, abort draft early if no good candidates #5462

Closed

Fixed lookup compilation issues on Windows #6273

Merged

JohannesGaessler mentioned this pull request Apr 22, 2024

Server: enable lookup decoding #6828

Open

arnfaldur mentioned this pull request Apr 25, 2024

Implement ANPD (3x speedup, lossless) #6813

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lookup: complement data from context with general text statistics #5479

lookup: complement data from context with general text statistics #5479

JohannesGaessler commented Feb 13, 2024 •

edited

kalomaze commented Feb 13, 2024 •

edited

JohannesGaessler commented Feb 13, 2024

ggerganov Feb 13, 2024 •

edited

JohannesGaessler Feb 13, 2024

ggerganov Feb 13, 2024

JohannesGaessler Feb 13, 2024

ggerganov Feb 13, 2024

JohannesGaessler Feb 13, 2024

JohannesGaessler commented Feb 13, 2024 •

edited

JohannesGaessler commented Feb 13, 2024

JohannesGaessler commented Feb 14, 2024

JohannesGaessler commented Feb 14, 2024

slaren commented Feb 14, 2024

JohannesGaessler commented Feb 14, 2024

JohannesGaessler commented Feb 14, 2024

ggerganov Feb 15, 2024

JohannesGaessler Feb 15, 2024

JohannesGaessler commented Feb 15, 2024

JohannesGaessler commented Feb 18, 2024

JohannesGaessler commented Feb 18, 2024

JohannesGaessler commented Feb 20, 2024

JohannesGaessler commented Feb 21, 2024

JohannesGaessler commented Mar 10, 2024

JohannesGaessler commented Mar 16, 2024

JohannesGaessler commented Mar 17, 2024

slaren Mar 17, 2024

JohannesGaessler Mar 17, 2024

slaren Mar 17, 2024

JohannesGaessler Mar 17, 2024

ggerganov left a comment

JohannesGaessler commented Mar 20, 2024

		typedef uint64_t llama_ngram; // Each of the 4 16 bit sections represents a token id.
		typedef std::unordered_map<llama_token, int32_t> llama_ngram_cache_part; // token -> number of times token has been seen

lookup: complement data from context with general text statistics #5479

lookup: complement data from context with general text statistics #5479

Conversation

JohannesGaessler commented Feb 13, 2024 • edited

kalomaze commented Feb 13, 2024 • edited

JohannesGaessler commented Feb 13, 2024

ggerganov Feb 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JohannesGaessler commented Feb 13, 2024 • edited

JohannesGaessler commented Feb 13, 2024

JohannesGaessler commented Feb 14, 2024

JohannesGaessler commented Feb 14, 2024

slaren commented Feb 14, 2024

JohannesGaessler commented Feb 14, 2024

JohannesGaessler commented Feb 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JohannesGaessler commented Feb 15, 2024

JohannesGaessler commented Feb 18, 2024

JohannesGaessler commented Feb 18, 2024

JohannesGaessler commented Feb 20, 2024

JohannesGaessler commented Feb 21, 2024

JohannesGaessler commented Mar 10, 2024

JohannesGaessler commented Mar 16, 2024

JohannesGaessler commented Mar 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

JohannesGaessler commented Mar 20, 2024

JohannesGaessler commented Feb 13, 2024 •

edited

kalomaze commented Feb 13, 2024 •

edited

ggerganov Feb 13, 2024 •

edited

JohannesGaessler commented Feb 13, 2024 •

edited