llama_model_loader: support multiple split/shard GGUFs #6187

phymbert · 2024-03-20T21:31:49Z

Motivation

Since we support gguf-split CLI in #6135, it is the good time to load model with multiple (potentially distributed) GGUFs for weights. For example, we can expect the Grok-1 weights to not easily fit inside a single GGUF.

This change allows to load a model regardless if it is bundled inside a single or multiple GGUFs generated with gguf-split.

Changes

each file is memory mapped to a distinct address, tensors are not continuous anymore in memory
backends that support mmap like CPU and Metal, now have different backend buffer for each file
introduce llama_split_path and llama_split_prefix to allow downstream tool to generate their own GGUFs split using the same file name convention: "%s-%05d-of-%05d.gguf"
rename GGUF KV general.split to split.no, general.split_count to split.count and add split.tensors.count: the previous splits created will not be loaded here. Use gguf-split from d0d5de4 to merge first, then split again with master version

Tests

Download

cd models
../scripts/hf.sh --repo ggml-org/models --file phi-2/ggml-model-q4_0.gguf

Split

gguf-split --split --split-max-tensors 64 models/ggml-model-q4_0.gguf ggml-model-q4_0-split

Load

main --model models/ggml-model-q4_0-split-00001-of-00006.gguf -ngl 33 --random-prompt

You will notice the new: llama_model_loader: additional 6 GGUFs metadata loaded.

Merge it back (not necessary anymore)

gguf-split --merge models/ggml-model-q4_0-split-00001-of-00006.gguf models/ggml-model-q4_0-merge.gguf

Confirm single GGUF still work

main --model models/ggml-model-q4_0-merge.gguf -ngl 33 --random-prompt

References

CI Builds

CI
Server

Tasks

works on CPU backend
works on CUDA backend full layers offloaded
work on CUDA backend half layers offloaded
works on metal

Special thanks to @slaren and @ngxson for having supporting me in this effort.

examples/gguf-split/gguf-split.cpp

llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

examples/gguf-split/gguf-split.cpp

llama.cpp

- use only one gguf_context for metadata only - store all ggml_context in a vector as the files and mappings - store all weights in a vector along with the source tensor - rename ctx_gguf to meta - rename ctx_meta to contexts

ggerganov

There is something not right when mmap is enabled and -ngl > 0 (at least with Metal that is). Using LLaMA 7B Q4_0:

llm_load_tensors: ggml ctx size =    0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =   933.69 MiB, (  933.75 / 147456.00)
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =   933.69 MiB, ( 1867.44 / 147456.00)
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =   933.69 MiB, ( 2801.12 / 147456.00)
ggml_backend_metal_buffer_from_ptr: error: failed to allocate buffer, size =   933.69 MiB
ggml_backend_metal_buffer_from_ptr: error: failed to allocate buffer, size =   933.69 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:      Metal buffer size =   933.69 MiB
llm_load_tensors:      Metal buffer size =   933.69 MiB
llm_load_tensors:      Metal buffer size =   933.69 MiB
.................................................................llama_model_load: error loading model: vector
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './x-00001-of-00005.gguf'
main: error: unable to load model

llama.cpp

…nsor optional Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

phymbert · 2024-03-21T20:03:55Z

ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 933.69 MiB, ( 933.75 / 147456.00)
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 933.69 MiB, ( 1867.44 / 147456.00)
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 933.69 MiB, ( 2801.12 / 147456.00)
ggml_backend_metal_buffer_from_ptr: error: failed to allocate buffer, size = 933.69 MiB
ggml_backend_metal_buffer_from_ptr: error: failed to allocate buffer, size = 933.69 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU

So it does not manage to allocate 2 out 5 metal buffer, I think we should stop here. Then it tried to load the mapping to a buffer which does not exist.
@slaren Do you know why ggml_backend_metal_buffer_from_ptr can failed but just before ggml_backend_cpu_buffer_from_ptr succeded ?

phymbert · 2024-03-21T20:09:46Z

So it does not manage to allocate 2 out 5 metal buffer, I think we should stop here. Then it tried to load the mapping to a buffer which does not exist. @slaren Do you know why ggml_backend_metal_buffer_from_ptr can failed but just before ggml_backend_cpu_buffer_from_ptr succeded ?

@ggerganov Should we accept this case when we cannot allocate n_split metal buffer ? it means metal backend will not have all weights loaded, is it ok ?

slaren · 2024-03-21T20:16:24Z

I think there is something wrong. It should only require one CPU buffer, since there is only one tensor allocated in the CPU.

The first-last range is probably wrong, and it is causing a buffer overflow.

slaren · 2024-03-21T20:25:08Z

This should fix it:

diff --git a/llama.cpp b/llama.cpp
index cd20ad7a..2b6a5e9e 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -3199,6 +3199,9 @@ struct llama_model_loader {
         *addr = mapping->addr;
         for (ggml_tensor * tensor = ggml_get_first_tensor(ctx); tensor; tensor = ggml_get_next_tensor(ctx, tensor)) {
             const auto & w = get_weights(ggml_get_name(tensor));
+            if (w.idx != idx) {
+                continue;
+            }
             *first = std::min(*first, w.offs);
             *last  = std::max(*last, w.offs + ggml_nbytes(tensor));
         }
@@ -5145,6 +5148,9 @@ static bool llm_load_tensors(
                 void * addr = nullptr;
                 size_t first, last;
                 ml.get_mapping_range(&first, &last, &addr, file_no, ctx);
+                if (first >= last) {
+                    continue;
+                }
                 ggml_backend_buffer_t buf = ggml_backend_cpu_buffer_from_ptr((char *)addr + first, last - first);
                 if (buf != nullptr) {
                     bufs.push_back(buf);
@@ -5167,6 +5173,9 @@ static bool llm_load_tensors(
                 void * addr = nullptr;
                 size_t first, last;
                 ml.get_mapping_range(&first, &last, &addr, file_no, ctx);
+                if (first >= last) {
+                    continue;
+                }
                 ggml_backend_buffer_t buf = ggml_backend_metal_buffer_from_ptr((char *) addr + first, last - first, max_size);
                 if (buf != nullptr) {
                     bufs.push_back(buf);

Maybe we need to add a dummy NULL buffer in this case so that it does not mess with the indices of the vector?

llama.cpp

phymbert · 2024-03-21T20:34:07Z

@ggerganov can you please pull and retry ? I have applied the same logic as before: if the allocation failed, fallback to cpu only

slaren · 2024-03-21T20:34:52Z

@phymbert the logic is still wrong, it is asking Metal to map a buffer beyond its size. Please check the diff I posted above.

ggerganov · 2024-03-21T20:44:19Z

It works with the patch. Will take an extra look at the PR tomorrow. Thank you all for helping out

llama.cpp

…s in split, throw an exception instead of asserting

… add to the map.

…uffers in order to free them if an error occurs in the next allocation. Reserve the expected size.

…fore allocating backend buffer

…t dest max len. Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

ggml-ci

ggerganov

We can merge after slaren's approval

phymbert · 2024-03-22T09:08:44Z

We can merge after slaren's approval

Excellent, really proud of to have contribute until llama.h! Thanks all for your help, guidance and co-authoring this feature🧑‍🤝‍🧑👩🏽‍💻

phymbert · 2024-03-22T09:44:53Z

@ggerganov Could we add this line in Hot topics ?

- support loading sharded model split using `gguf-split` cli #6187

ggerganov · 2024-03-22T11:09:07Z

Yes, of course

ngxson

LGTM! Thanks for taking time to implement this functionality.

llama.cpp

… allocated Co-authored-by: slaren <slarengh@gmail.com>

phymbert · 2024-03-23T07:33:56Z

llama.cpp

+    // check if dest ends with postfix
+    int size_prefix = str_split_path.size() - str_postfix.size();
+    if (size_prefix > 0 && str_split_path.find(str_postfix, size_prefix) != std::string::npos) {
+        snprintf(dest, std::min((size_t) size_prefix, maxlen), "%s", split_path);


@ngxson It must be snprintf(dest, std::min((size_t) size_prefix + 1, maxlen), "%s", split_path);
I am fixing it in #6192

Yeah sorry I was quite rush this time, will be more careful. Thanks!

* split: support in llama_model_loader * avoid copying the entire vector Co-authored-by: slaren <slarengh@gmail.com> * split: move llama_tensor_offset to llama_model_loader * llama_model_loader: PR feedbacks: - use only one gguf_context for metadata only - store all ggml_context in a vector as the files and mappings - store all weights in a vector along with the source tensor - rename ctx_gguf to meta - rename ctx_meta to contexts * avoid copying the entire vector * Simplify this by making these optional, switch some layer creation tensor optional Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Handle optional tensors Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama_model_loader: fail if backend cannot allocate buffer * fix mmap buffer management * llama_model_loader: map file to backend buffer if the allocation succeeds only * llama_model_loader: only map tensors included in the context * llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast * llama_model_loader: fail if any of backend buffer cannot be allocated * spacing Co-authored-by: slaren <slarengh@gmail.com> * fix loop over pointer Co-authored-by: slaren <slarengh@gmail.com> * llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting * llama_model_loader: ensure mappings vector has the expected size * llama_model_loader: use at instead of operator[] if this should never add to the map. * llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size. * llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer * llama_model_loader: fix map -> unordered map * llama_split_prefix: use a clearer version, not pass split path len but dest max len. Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * llama : minor ggml-ci * llama : introduce some typedef helpers * docs: add model shard in hot topic * llama_model_loader: put mapping in a unique_ptr from the moment it is allocated Co-authored-by: slaren <slarengh@gmail.com> * fix llama_split_prefix --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

split: support in llama_model_loader

7c64fef

phymbert requested review from ggerganov, slaren and ngxson March 20, 2024 21:31

phymbert commented Mar 20, 2024

View reviewed changes

examples/gguf-split/gguf-split.cpp Show resolved Hide resolved

phymbert commented Mar 20, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

ngxson reviewed Mar 20, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

slaren reviewed Mar 21, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

phymbert and others added 2 commits March 21, 2024 04:36

Avoir copying the entire vector

b8feff4

Co-authored-by: slaren <slarengh@gmail.com>

split: move llama_tensor_offset to llama_model_loader

18ff6ca

phymbert mentioned this pull request Mar 21, 2024

common: llama_load_model_from_url split support #6192

Merged

ggerganov reviewed Mar 21, 2024

View reviewed changes

examples/gguf-split/gguf-split.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

phymbert added 2 commits March 21, 2024 11:48

Merge branch 'master' into hp/split/load-model

60a87ae

llama_model_loader: PR feedbacks:

1892ae7

- use only one gguf_context for metadata only - store all ggml_context in a vector as the files and mappings - store all weights in a vector along with the source tensor - rename ctx_gguf to meta - rename ctx_meta to contexts

phymbert requested review from slaren, ggerganov and ngxson March 21, 2024 18:12

avoid copying the entire vector

00381b0

ggerganov reviewed Mar 21, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

phymbert and others added 2 commits March 21, 2024 20:50

Simplify this by making these optional, switch some layer creation te…

c34a5de

…nsor optional Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Handle optional tensors

1c931f3

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

llama_model_loader: fail if backend cannot allocate buffer

d8b567d

ngxson reviewed Mar 21, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

slaren reviewed Mar 21, 2024

View reviewed changes

llama.cpp Show resolved Hide resolved

phymbert added 6 commits March 22, 2024 06:48

llama_model_loader: if n_tensors declared not equals to loaded tensor…

7cbe1ea

…s in split, throw an exception instead of asserting

llama_model_loader: ensure mappings vector has the expected size

9940df4

llama_model_loader: use at instead of operator[] if this should never…

ec372c6

… add to the map.

llama_model_loader: immediately add the backend buffer to the model b…

a9e88c6

…uffers in order to free them if an error occurs in the next allocation. Reserve the expected size.

llama_model_loader: be sure the model mappings has enough capacity be…

b19af36

…fore allocating backend buffer

llama_model_loader: fix map -> unordered map

4c04400

phymbert changed the title ~~llama_model_loader: support multiple split GGUFs~~ llama_model_loader: support multiple split/shard GGUFs Mar 22, 2024

llama_split_prefix: use a clearer version, not pass split path len bu…

e474e45

…t dest max len. Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

phymbert mentioned this pull request Mar 22, 2024

server: add tests with --split and --model-url #6223

Closed

ggerganov added 2 commits March 22, 2024 10:18

llama : minor

8326607

ggml-ci

llama : introduce some typedef helpers

dbc35ac

ggerganov approved these changes Mar 22, 2024

View reviewed changes

docs: add model shard in hot topic

f616b38

ngxson approved these changes Mar 22, 2024

View reviewed changes

slaren reviewed Mar 22, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

phymbert and others added 2 commits March 22, 2024 14:44

llama_model_loader: put mapping in a unique_ptr from the moment it is…

1f38759

… allocated Co-authored-by: slaren <slarengh@gmail.com>

fix llama_split_prefix

764c7af

slaren approved these changes Mar 22, 2024

View reviewed changes

phymbert merged commit dba1af6 into master Mar 22, 2024
58 checks passed

phymbert deleted the hp/split/load-model branch March 22, 2024 18:00

phymbert commented Mar 23, 2024

View reviewed changes

arki05 mentioned this pull request Mar 23, 2024

Add grok-1 support #6204

Merged

phymbert mentioned this pull request Apr 3, 2024

gguf-split add a default option to not include tensors data in first shard #6463

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama_model_loader: support multiple split/shard GGUFs #6187

llama_model_loader: support multiple split/shard GGUFs #6187

phymbert commented Mar 20, 2024 •

edited

ggerganov left a comment

phymbert commented Mar 21, 2024

phymbert commented Mar 21, 2024 •

edited

slaren commented Mar 21, 2024 •

edited

slaren commented Mar 21, 2024 •

edited

phymbert commented Mar 21, 2024

slaren commented Mar 21, 2024

ggerganov commented Mar 21, 2024

ggerganov left a comment

phymbert commented Mar 22, 2024

phymbert commented Mar 22, 2024

ggerganov commented Mar 22, 2024

ngxson left a comment

phymbert Mar 23, 2024 •

edited

ngxson Mar 23, 2024

llama_model_loader: support multiple split/shard GGUFs #6187

llama_model_loader: support multiple split/shard GGUFs #6187

Conversation

phymbert commented Mar 20, 2024 • edited

Motivation

Changes

Tests

References

CI Builds

Tasks

ggerganov left a comment

Choose a reason for hiding this comment

phymbert commented Mar 21, 2024

phymbert commented Mar 21, 2024 • edited

slaren commented Mar 21, 2024 • edited

slaren commented Mar 21, 2024 • edited

phymbert commented Mar 21, 2024

slaren commented Mar 21, 2024

ggerganov commented Mar 21, 2024

ggerganov left a comment

Choose a reason for hiding this comment

phymbert commented Mar 22, 2024

phymbert commented Mar 22, 2024

ggerganov commented Mar 22, 2024

ngxson left a comment

Choose a reason for hiding this comment

phymbert Mar 23, 2024 • edited

Choose a reason for hiding this comment

ngxson Mar 23, 2024

Choose a reason for hiding this comment

phymbert commented Mar 20, 2024 •

edited

phymbert commented Mar 21, 2024 •

edited

slaren commented Mar 21, 2024 •

edited

slaren commented Mar 21, 2024 •

edited

phymbert Mar 23, 2024 •

edited