Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split: allow --split-max-size option #6343

Merged
merged 7 commits into from Mar 29, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Mar 27, 2024

Closes #6259

I ended up re-write the split_strategy class. How it works now:

  1. In the constructor, it produces all the ctx_out that is used by the output file (one ctx_out per one split). These ctx_out are saved into a std::vector<struct gguf_context *> ctx_outs. This step only produce a split "plan", not the actual file.
  2. split_strategy.print_info() can be used to print out the "plan" and details for each split (n_tensors, size)
  3. Finally, when you're happy with the split "plan", call split_strategy.write() to actually write out to output files

Copy link
Collaborator

@phymbert phymbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that you start looking into this. I did some attempts but resigned. Please reuse the test script I prepared:

https://github.com/phymbert/llama.cpp/blob/hp/split/max-size/examples/gguf-split/tests.sh

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 27, 2024

@phymbert Thanks for the info, yeah I did managed to get it work, but didn't test it very carefully. I'll use the script that you provided.

Meanwhile, would you mind to test the version that I've pushed?

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 27, 2024

I can confirm that my version works with your test script @phymbert (thanks again for that) - except for the test case 4 and 5 which requires --no-tensor-in-metadata that I don't have in this version, I manually skipped it, but that's nice to have all these test cases from beginning.

@ngxson ngxson marked this pull request as ready for review March 27, 2024 14:51
@phymbert
Copy link
Collaborator

Ah yeah, I forgot this params. Just remove it. It's not necessary for now, and it's not possible to have no tensor in gguf AFAIK.

@ngxson ngxson requested review from slaren and phymbert March 27, 2024 14:52
@ngxson
Copy link
Collaborator Author

ngxson commented Mar 27, 2024

@phymbert With this PR, it maybe possible to have a split that does not have any tensors. I encountered this case (as a bug) if inside should_split I don't have i_tensor > 0 condition. Though, I don't know if llama_model_loader can handle this case.

fprintf(stderr, "\033[3Ddone\n");
void copy_file_to_file(std::ifstream & f_in, std::ofstream & f_out, const size_t in_offset, const size_t len) {
// TODO: detect OS and use copy_file_range() here for better performance
if (read_buf.size() < len) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Artefact2 With this refactoring, it will be trivial to add copy_file_range() as you suggested. The only thing that I'm not sure is how to detect if I can use copy_file_range() or not (i.e. based on OS or something else?). Do you have any idea on that? Thanks.

@phymbert
Copy link
Collaborator

It triggers a malloc of 0 size in ggml, with a warning.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 27, 2024

It triggers a malloc of 0 size in ggml, with a warning.

Oh OK I see, that's because we allocate one device buffer per file, so file with no tensor will have 0 size buffer. A quick hack is to add a dummy tensor with size of 1 byte element, but that's not a good solution so let's consider that later on..

@ngxson ngxson merged commit f7fc5f6 into ggerganov:master Mar 29, 2024
56 of 57 checks passed
@phymbert
Copy link
Collaborator

Was it urgent to merge ? unfortunately you did not add the test file, I think it is important to add it here and in the CI to show the complete process between gguf-split and llama.cpp as it is easy to break it we did here: 764c7af

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 30, 2024

I merge it because it has been open for review for a 3 days. I agree that the test is needed, but since I initally have no intend to add test on this PR, so I though that we will do in another PR. Would you mind to open a new PR @phymbert ? Thanks.

@phymbert
Copy link
Collaborator

You always can request for review, especially here I did the gguf-split example module, I feel normal to review changes on the matter.

I think tests are part of the devlopment and must be done in the same PR, by the Author, this is the standard development best practice. I do not plan to test other people code BTW.

The shard feature is widely used now on HF and we need to carefully test it and document it, taking into account some previous issue we did:

Is it complicated to add tests.sh ?

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 30, 2024

Sorry I think there’s some misunderstanding here:

  1. I added you as reviewer for this PR 3 days ago. Since we exchanged about the test file earlier, I though that you had a look on the code and have no further comments, so I just merge it - that was my fault to assume that. Next time I’ll explicitly ask you via comment if needed.
  2. The PR does not have tests because I ran the test locally. I have no idea if the test file you gave me is the final version or not. I also don’t say in the description or the title that there will be a test, so it’s to do for another PR.
  3. No blame here, we’re working hard and making errors (in term of communication) is something unavoidable. I hope that the next time we can be more clear on the scope of each PR.

I’ll open a new PR today when I finish my works in real-life. Will let you know when it’s done.

@phymbert
Copy link
Collaborator

Thanks for the clarification and your effort of course, I understand, merci. FYI I am uploading grok-1 to ggml-org in Q4_0 and I will open a demo discussion to explain the whole process, I will include --split-max-size you bring here.

@phymbert
Copy link
Collaborator

It would also be nice to update the README.md.

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* split by max size

* clean up arg parse

* split: ok

* add dry run option

* error on 0 tensors

* be positive

* remove next_metadata_size
@phymbert
Copy link
Collaborator

phymbert commented Apr 2, 2024

@ngxson it looks split.no is not present in the first file. Take a look at the HF model visualizer: https://huggingface.co/ggml-org/models?show_tensors=grok-1%2Fgrok-1-q4_0-00001-of-00009.gguf

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 2, 2024

I think there's a UI bug on huggingface. I downloaded the file and use gguf-py to inspect it, and I can confirm that split.no is there: https://colab.research.google.com/drive/1nvIVinbUTd8MeFSPz0ZvP-UhTW9MDPmb?usp=sharing

(My internet is quite laggy so some notebook cells ran twice, but it doesn't change the result anw)

image

Edit: gguf viewer on huggungface is quite weird. I thought that the metadata table is firstly decoded on backend and send to frontend. But turns out, they download the whole model into browser which caused my browser to crash.

@phymbert
Copy link
Collaborator

phymbert commented Apr 2, 2024

I think there's a UI bug on huggingface. I downloaded the file and use gguf-py to inspect it, and I can confirm that split.no

Ok, thanks for your double checks, the code is straightforward indeed:

gguf_set_val_u16(ctx_out, LLM_KV_SPLIT_NO, i_split);

Wait and see

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
* split by max size

* clean up arg parse

* split: ok

* add dry run option

* error on 0 tensors

* be positive

* remove next_metadata_size
@ggerganov
Copy link
Owner

But turns out, they download the whole model into browser which caused my browser to crash.

AFAIK the browser just fetches the GGUF headers - not the entire file

@julien-c
Copy link

julien-c commented Apr 3, 2024

I think there's a UI bug on huggingface. I downloaded the file and use gguf-py to inspect it, and I can confirm that split.no is there

can you take a look at this @mishig25?

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 3, 2024

AFAIK the browser just fetches the GGUF headers - not the entire file

It seem to be the case of Firefox - it only downloads 2MB of the file. But firefox does not have the required API to decode the file: Your browser does not support ArrayBuffer.resize. Find more information here.

I tried Chrome, it can read the file but does not stop downloading:

image

@julien-c @mishig25 If you need more information for debugging, please let me know

@julien-c
Copy link

julien-c commented Apr 3, 2024

@ngxson could i trouble you to open an issue in https://github.com/huggingface/huggingface.js with your comment? 🙏

we'll check asap

@ggerganov
Copy link
Owner

@phymbert and @ngxson

I remember we discussed somewhere to be able to make a split where the first shard is very small and contains primarily the metadata so that it can be downloaded quickly and then start the download of the other shards without waiting for the first to finish. I can't find an issue for this - it might be a good idea to track this and might be related to: huggingface/huggingface.js#601 (comment)

We can add extra meta data in the first file that describes all tensors in the shards for example

@phymbert
Copy link
Collaborator

phymbert commented Apr 3, 2024

I remember we discussed somewhere to be able to make a split where the first shard is very small and contains primarily the metadata so that it can be downloaded quickly and then start the download of the other shards without waiting for the first to finish.

yes here:

But I did not create an issue because of ggml_malloc is complaining with WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_malloc! in this case.

EDIT: issue created:

@phymbert phymbert mentioned this pull request Apr 13, 2024
tybalex pushed a commit to tybalex/function.cpp that referenced this pull request Apr 17, 2024
* split by max size

* clean up arg parse

* split: ok

* add dry run option

* error on 0 tensors

* be positive

* remove next_metadata_size
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

split: allow --split-max-size option
5 participants