Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control vectors in server #6289

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

trollkotze
Copy link

@trollkotze trollkotze commented Mar 24, 2024

This PR implements options for loading control-vectors in server.cpp same as it already exists in main.cpp on start-up (only adding cli options here for what was already implemented), as well as hot-swapping control vectors at runtime in server.cpp.

The PR also includes a fix for a memory allocation/freeing bug in llama_load_control_vector. Credit for that goes to anon@example.com for his contributions of commits 181879f, 9914014, and d0304f7.

New CLI options

In addition to the same CLI options that exist for main.cpp there are these additional options for server.cpp to specify named file or directory options from which control vectors can be loaded at runtime:

  1. option for specific gguf file:
    --control-vector-option name /path/to/vector.gguf
  2. option for specific directory:
    --control-vector-option name path/to/dir
    --control-vector-option name path/to/dir/

The server will decide based on the specified path string whether it is interpreted as a vector file or a directory. A slash will be automatically added to the name of a directory option.

New endpoints for the server

  1. GET /control-vector-options will return a json array of strings.
    So e.g. --control-vector-option vectors path/to/dir --control-vector-option victor path/to/some/vector.gguf will result in the listed options: ["vectors/", "victor"], meaning that the user would only be allowed to load vectors/anything/goes/here (except path traversal) or victor.
  2. GET /control-vectors will return the currently applied control vector combination and layer range parameters (the latter can only be specified on load for now, or will default to full range) like so:
{
  "vectors": [
    {
      "fname": "vectors/olga",
      "strength":2.0
    },
    {
      "fname": "vectors/some/levels/deeper/oleg",
      "strength":-3.0
    },
    {
      "fname": "victor",
      "strength": 1.0
    }
  ],
  "layer_start": 1,
  "layer_end": 80
}
  1. POST /control-vectors accepts an array of vector parameters in the same format, which will replace the currently applied control vector:
{
  "vectors": [
    {
      "fname": "miqu/happy"
      "strength": 3.1415
    },
    {
      "fname": "miqu/high-on-cocaine.gguf"
      "strength": -2
    },
  ]
}

Note that the .gguf here is optional for vectors inside allowed directories. It will be added to the file name to be loaded if not included. The assumption is that all control vector files end in .gguf.
Path traversal is guarded against to the extent I am aware of.

The original message below here is outdated and just left for historical context:

I want to add control vector support to the server example. However, I know quite little about C++, and something isn't quite right yet that I can't wrap my head around. So I will need some handholding here.

What is implemented: command line parameters just like for main

--control-vector <fname>
--control-vector-scaled <fname> <strength>
--control-vector-layer-range <n_start> <n_end>

This loads the specified control vectors during the initial model load, and it works.

What doesn't work and I don't understand why: applying new vectors later

It would be nice if one could switch out control vectors at runtime.
So I added http endpoints for that:
GET /control-vectors will return the currently applied control vector source and layer range parameters like so:

{
  "vectors": [
    {
      "fname": "vectors/miqu/happy.gguf","strength":2.0
    },
    {
      "fname": "vectors/miqu/Sentient.gguf",
      "strength": 0.0
    }
  ],
  "layer_start": 1,
  "layer_end": 80
}

POST /control-vectors accepts an array of vector parameters in the same format, which should replace the currently applied control vector:

{
  "vectors": [
    {
      "fname": "path/to/vector.gguf"
      "strength": 3.1415
    },
    {
      "fname": "path/to/another_vector.gguf"
      "strength": -2
    },
  ]
}

and will then return the same output as the GET route.
This does not work. The output becomes incoherent. And after a while I get a segmentation fault.
I tried to take the range parameters from the existing gpt_params. But for some reason that is always (-1, -1), so I just use the full layer range (same as is done in llama_init_from_gpt_params - I don't know why that function doesn't save the layer range as it is supposed to).

Edit: About the gpt_params, the reason was that I used the wrong "params". There's a "params" variable in main, and then the server struct saves a copy of that. One should probably use that copy then.
And about the incoherent behaviour: It seems when calling llama_control_vector_init on every new vector load again, it does work, and the new control vector shows its behaviour. But one has to remove assert guards for that, and surely there's a reason for those. And it segfaults much faster when I do that.

So I'm removing my clumsy attempt at that live vector swap functionality again. But I hope someone might be able to figure out how to do that, because that would be really nice to have.

@trollkotze trollkotze marked this pull request as draft March 24, 2024 21:54
Copy link
Collaborator

@phymbert phymbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to see control vectors implemented on the server. Thank you for that. But let's keep it it simple, no need for the get and post /vectors route at runtime.
Also, please implement the control_vectors.feature in the server test framework within the PR.
Then, we need to assess the impact of control vectors on the server performances.

examples/server/server.cpp Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
@ngxson
Copy link
Collaborator

ngxson commented Mar 25, 2024

I couldn't have much time to look into details, but can you try:

  1. Not to call llama_control_vector_init in POST /control-vectors? Because this function allocate devices buffer (i.e. GPU memory), it should be call only once (when the program starts)
  2. llama_control_vector_apply does not take into account scale or multiple vectors. Maybe for this first version, can we just replace the current vector, then later on, we will see how to merge 2 vectors and control their scales at runtime
  3. llama_control_vector_load should be called when the program starts, because it reads vectors from disk and save to RAM. It should not be called inside POST /control-vectors

@trollkotze
Copy link
Author

trollkotze commented Mar 25, 2024

@ngxson

I couldn't have much time to look into details, but can you try:

1. **Not** to call `llama_control_vector_init` in `POST /control-vectors`? Because this function allocate devices buffer (i.e. GPU memory), it should be call only once (when the program starts)

Yeah, that's what I initially tried. Originally the llama_control_vector_init was guarded by asserts, checking that its buffers and tensors are empty to make sure it doesn't get initialized twice. But after loading a control vector that way (without calling init again) the output does not reflect the behaviour of the control vector. Instead, when loading a vector that way it seems as if no control vector is applied anymore, or when doing it again and again, trying different vector combinations, it just becomes increasingly incoherent, with no recognizable relation to the normal control vector behaviour.

So it seems something has to be cleaned up / reinitialized when wanting to apply a different control vector later. But I don't know what exactly. I just don't understand all that lower-level stuff that's going on there. So I simply tried this brutish method in my last commit to see what happens. Removing the assert guards and forcefully calling llama_control_vector_init a second time does seem to show the correct behaviour of the newly applied control vector, but segfaults very soon.

Loading control vectors with the guards in place (no repeated call of llama_control_vector_init) also led to segfaults, but much later down the line. However, as mentioned, with no recognizable relation to the normal control vector behaviour.

2. `llama_control_vector_apply` does not take into account scale or multiple vectors. Maybe for this first version, can we just replace the current vector, then later on, we will see how to merge 2 vectors and control their scales at runtime

llama_control_vector_apply applies the combined control vector that has been assembled by llama_control_vector_load from an arbitrary number of vectors with coefficients. This functionality was not added by me, but in the earlier PR from @vgel. And it works just fine.

3. `llama_control_vector_load` should be called when the program starts, because it reads vectors from disk and save to RAM. It should not be called inside `POST /control-vectors`

But swapping out control vectors live and being able to change them akin to samplers was my main motivation.
It seems to work quite fast. Only problem is that it's broken. :D

@trollkotze
Copy link
Author

trollkotze commented Mar 25, 2024

@phymbert

Also, please implement the control_vectors.feature in the server test framework within the PR. Then, we need to assess the impact of control vectors on the server performances.

Sorry, I have no idea how that works, and when I try to install the python requirements I get errors thrown at me:

...
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1381, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1354, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1304, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1381, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1354, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1325, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 929, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 994, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/tmp/pip-build-env-49api5t8/overlay/lib/python3.12/site-packages/setuptools/__init__.py", line 10, in <module>
    import distutils.core
ModuleNotFoundError: No module named 'distutils'

trollkotze@schrotthaufen:~/llama/llama.cpp/examples/server/tests$ pip install distutils
ERROR: Could not find a version that satisfies the requirement distutils (from versions: none)
ERROR: No matching distribution found for distutils

Maybe you could take over from here? I'm afraid of snakes.

@trollkotze trollkotze marked this pull request as ready for review March 25, 2024 16:39
@trollkotze
Copy link
Author

trollkotze commented Mar 25, 2024

@phymbert I just noticed one peculiar behaviour: Sometimes the control vector has no effect. I think that happens when a request comes in before the initialization of the model, context + vector is completed. (The server does not respond in that case, but it affects its behaviour later on when everything is initialized and requests are served: The control vector specified through command line argument just has no effect then.)
Maybe this is a more general issue with some race condition that might also affect other behaviour, and then it might be out of scope of this PR. Not sure. But in any case it is outside the scope of my competence.

@phymbert
Copy link
Collaborator

Maybe you could take over from here? I'm afraid of snakes.

No sorry, IMHO adding this test is straight forward: add a new step and new server params and copy a completion scenario.

For python, just create a python venv.

@trollkotze trollkotze closed this Mar 25, 2024
@trollkotze
Copy link
Author

Reopening this PR again.
This PR without the initially attempted hot-reloading functionality is trivial. And I don't really care enough about it to follow up with additional requirements. So I closed it. I don't know about python venv stuff and how to use that test framework. And if someone says that it's straightforwad then he should just do it.

But some anon suggested I should simply provide a good argument why this should be merged as is.
Here it is: There are plenty of other command-line parameters like this, which simply use the same logic as in main.cpp with all the logic that is already in place for that, handled in llama_init_from_gpt_params, like e.g. for Lora adapters and whatnot. There are no test cases for most of these.

I see no fundamental difference here for these control vector arguments and why they should require special testing in server.cpp when the functionality is the same in main.cpp.

So take it or leave it. I think it makes sense to just merge this.

@trollkotze trollkotze reopened this Mar 26, 2024
@trollkotze
Copy link
Author

And about my original intention of adding hot-reloading for control vectors in server.cpp:
Some anon said the reason why my attempt did not work is a memory leak in llama_control_vector_load, and I think someone is already looking into that and maybe following up with a fix. And then maybe that will make it possible to actually add hot-reloading functionality for control vectors.
About the argument of not allowing access to arbitrary file paths for control vector loading on the server, one could add command line parameters to specify a set of named control vector options to choose from. That would be pretty cool.
But that's for another future PR then that hopefully someone will implement when the issues with llama_control_vector_load are fixed.

@phymbert phymbert added the demo Demonstrate some concept or idea, not intended to be merged label Mar 26, 2024
@phymbert
Copy link
Collaborator

Reopening this PR again. This PR without the initially attempted hot-reloading functionality is trivial. And I don't really care enough about it to follow up with additional requirements. So I closed it. I don't know about python venv stuff and how to use that test framework. And if someone says that it's straightforwad then he should just do it.

But some anon suggested I should simply provide a good argument why this should be merged as is. Here it is: There are plenty of other command-line parameters like this, which simply use the same logic as in main.cpp with all the logic that is already in place for that, handled in llama_init_from_gpt_params, like e.g. for Lora adapters and whatnot. There are no test cases for most of these.

I see no fundamental difference here for these control vector arguments and why they should require special testing in server.cpp when the functionality is the same in main.cpp.

So take it or leave it. I think it makes sense to just merge this.

I am sorry you feel it this way, adding this feature is great, but we aim to provide a production ready server, not a playground. A feature is nothing without proper test implementation in order to maintain it. You cannot ask people to do tests for you. As you do not "care enough", I am adding the demo label in case other people want to bring to master properly.

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any damage to adding a command line option to the server that already exists everywhere else on llama.cpp.

Copy link
Collaborator

@phymbert phymbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov Georgi, AFAIK there is no emergency to support control vectors in the server. I strongly believe adding feature even few lines of code without testing them has an impact on the overall long term software quality.
If we aim to develop an efficient LLM serving solution, testing is mandatory.

Either the test framework is not adapted and one should think to replace it with something more appropriate for the community either we should maintain it.

Also adding a test scenario will help people to understand how to configure the feature and what is expected.

If the author is not welling to do the test: I may have a look the day I need control vectors. Although it's not a real passion to test other people development.

I have added for traceability as good first issue:

@ggerganov
Copy link
Owner

Yes, there is no rush to add this functionality to server immediately. If you feel strongly about having proper tests before merging, we can do that - would be a good opportunity for new people to make a contribution

@phymbert
Copy link
Collaborator

If you feel strongly about having proper tests before merging, we can do that - would be a good opportunity for new people to make a contribution

Thanks for your understanding. Let's discuss this again if someone really needs this to master one day. Meanwhile, the linked issue is opened for contributions.

@trollkotze
Copy link
Author

Okay. Here is all I've got:

Anon shared a patch here that fixes the memory allocation/freeing bugs in llama_control_vector_load. So based on that, I do care again, because I could include the hot-swap functionality again on top of that patch, and it works now without segfault.

To address the security concern about accessing arbitrary control vector files through that hot-swapping endpoint, it is now necessary to provide predefined options for control vectors that can be loaded:

  1. option for specific gguf file:
    --control-vector-option name /path/to/vector.gguf
  2. option for specific directory:
    --control-vector-option name path/to/dir
    --control-vector-option name path/to/dir/

The server will decide based on the specified path string whether it is interpreted as a vector file or a directory. A slash will be automatically added to the name of a directory option.
There is a new endpoint to query the available vector options: GET /control-vector-options will return a json array of strings.
So e.g. --control-vector-option vectors path/to/dir --control-vector-option victor path/to/some/vector.gguf will result in the listed options: ["vectors/", "victor"]. So then the user would be allowed to load vectors/anything/goes/here (except path traversal) or victor.
Path traversal is checked and guarded against.

Hot-reloading works the same as before, but without segfaults.
And now you specify the given name instead of the file path.
For given directory options you can load anything like givendirectoryname/path/to/file or givendirectoryname/path/to/file.gguf (which are equivalent here).
Path traversal is guarded against to the extent I am aware of.

That is all I'm willing and able to do and what I will be using myself. If someone likes to fix it up with all the necessary compliance bullshit to merge it into master that would of course be very nice. Don't forget to give credit to anon@example.com for his contributions of commits 181879f, 9914014, and d0304f7.

@AsbjornOlling
Copy link
Contributor

I'm curious, how does this deal with parallel execution?

I assume that the applied control vector affects all "conversations" in a multi-user context?

@trollkotze
Copy link
Author

I'm curious, how does this deal with parallel execution?

I assume that the applied control vector affects all "conversations" in a multi-user context?

Yes, it does. I guess it would be possible to decouple user settings for control vectors by adding control vector options to the completion endpoint instead of just this "global" control vector endpoint. It looks a bit complicated because of the async stuff with the request queue going on in handle_completions. The appropriate control vector would need to be set when the request is actually executed.

@ngxson
Copy link
Collaborator

ngxson commented Apr 3, 2024

I assume that the applied control vector affects all "conversations" in a multi-user context?

Control vectors are context-independent, because it is applied to the embedding vector between layers, not the self-attention mechanism.

@vgel
Copy link
Contributor

vgel commented Apr 5, 2024

@trollkotze Thanks for picking this up (and catching the memory bugs in my original implementation 😅) I've been on vacation so haven't been checking progress, but I'm experienced w/ Python, would be happy to coordinate to get this over the line for merging! (I'm @vgel on Discord if you use it.)

@trollkotze
Copy link
Author

trollkotze commented Apr 7, 2024

@vgel Thanks! That would be great if you could help out here and do the python stuff or whatever else there is to do.
Credit for fixing the memory bug goes to Anon. I really still don't understand too much about these details.
I am quite busy at the moment and for the next weeks and am really not well-versed in the llama.cpp code, and even less in python, conda environments and the purpose of these testing frameworks, so I would be most happy if someone could just pick it up from here, either by adding to my PR or by just copying and reworking it, and changing whatever is necessary to get it merged.
I just added you as collaborator in my fork. Feel free to do whatever you want there. If anyone else is interested, let me know. But of course it migh be more convenient to just make your own fork and rework it there, and then I don't have to care about rights management in GitHub.
I also added you in Discord (Hotzenplotz), but I don't really use it. Usually GitHub notifications will reach me earlier that that.
There is also another newer branch "control-vector-server-unclean" in my llama.cpp forked repo where I did some other stuff, I forgot exactly what atm. I think mainly just loading all the control vector files in a directory and showing them listed individually under options when specifying a directory instead of a file for a control vector parameter, so that the user can see the files listed in the options instead of having to know what files under that dir exist. But that is only for Linux and I don't have Windows. Alternatively, of course, one could simply add a long list of all the individual control vector files as options that the user can then see, as is already possible with the code in this cleaner branch here.

@trollkotze
Copy link
Author

@phymbert I also invited you to my fork as collaborator in case you feel motivated and want to fiddle with it there.
But of course, as I said, whoever is interested, may feel free to just copy the code, fork it again, make a new PR. Whatever might be the more convenient way to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants