server: docs: `--threads` and `--threads`, `--ubatch-size`, `--log-disable` #6254

phymbert · 2024-03-23T10:01:47Z

Update server/README.md with actual params supported.

Closes server: comment --threads option behavior #6230

…sable`

ngxson · 2024-03-23T13:03:08Z

examples/server/README.md

@@ -26,7 +26,8 @@ The project is under active development, and we are [looking for feedback and co
 - `-ngl N`, `--n-gpu-layers N`: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
 - `-mg i, --main-gpu i`: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used. Requires cuBLAS.
 - `-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS.
- `-b N`, `--batch-size N`: Set the batch size for prompt processing. Default: `512`.
+- `-b N`, `--batch-size N`: Set the batch size for prompt processing. Default: `2048`.
+- `-ub N`, `--ubatch-size N`: physical maximum batch size. Default: `512`.


Should we also make it clear that ubatch should be enough when using for embeddings?

@slaren Could you please confirm the good combination between --ubatch-size and --batch-size for a bert model / --embedding ?

For embeddings, --ubatch-size must be greater than bert.context_length and --batch-size equals to --ubatch-size

Yes, there is no advantage to increasing n_batch above n_ubatch with embeddings models with pooling, because the entire batch must fit in a physical batch (ie. n_ubatch). n_batch is always >= n_ubatch.

@ngxson I think it is better to add this check this in server.cpp. i will create an issue to trace it, we will implement it later on.

The embeddings from multiple slots can go in a single batch. For example with n_batch = 2048 and n_ubatch = 512 we can process 4 full slots in one go

It could be implemented, but the batch splitting code does not take this into account. llama_decode will just fail if n_tokens > n_ubatch.

We can move this discussion in ?

server: exit failure if --embedding is set with an incoherent --ubatch-size #6263

Ah yes sorry, ignore my comment

…sable` (ggerganov#6254)

server: docs: --threads and --threads, --ubatch-size, `--log-di…

c534980

…sable`

phymbert requested a review from ggerganov March 23, 2024 10:01

phymbert mentioned this pull request Mar 23, 2024

server: bench: continuous performance testing #6233

Open

16 tasks

phymbert requested a review from ngxson March 23, 2024 11:59

ngxson reviewed Mar 23, 2024

View reviewed changes

ggerganov approved these changes Mar 23, 2024

View reviewed changes

phymbert merged commit 1997577 into master Mar 23, 2024
23 checks passed

phymbert mentioned this pull request Mar 23, 2024

server: exit failure if --embedding is set with an incoherent --ubatch-size #6263

Open

phymbert deleted the hp/server/doc branch March 23, 2024 17:15

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

server: docs: --threads and --threads, --ubatch-size, `--log-di…

b2c8795

…sable` (ggerganov#6254)

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024

server: docs: --threads and --threads, --ubatch-size, `--log-di…

b3c05e0

…sable` (ggerganov#6254)

tybalex pushed a commit to tybalex/function.cpp that referenced this pull request Apr 17, 2024

server: docs: --threads and --threads, --ubatch-size, `--log-di…

44816bc

…sable` (ggerganov#6254)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: docs: `--threads` and `--threads`, `--ubatch-size`, `--log-disable` #6254

server: docs: `--threads` and `--threads`, `--ubatch-size`, `--log-disable` #6254

phymbert commented Mar 23, 2024 •

edited

ngxson Mar 23, 2024

phymbert Mar 23, 2024

slaren Mar 23, 2024

phymbert Mar 23, 2024

ggerganov Mar 23, 2024 •

edited

slaren Mar 23, 2024

phymbert Mar 23, 2024

ggerganov Mar 23, 2024

server: docs: --threads and --threads, --ubatch-size, --log-disable #6254

server: docs: --threads and --threads, --ubatch-size, --log-disable #6254

Conversation

phymbert commented Mar 23, 2024 • edited

ngxson Mar 23, 2024

Choose a reason for hiding this comment

phymbert Mar 23, 2024

Choose a reason for hiding this comment

slaren Mar 23, 2024

Choose a reason for hiding this comment

phymbert Mar 23, 2024

Choose a reason for hiding this comment

ggerganov Mar 23, 2024 • edited

Choose a reason for hiding this comment

slaren Mar 23, 2024

Choose a reason for hiding this comment

phymbert Mar 23, 2024

Choose a reason for hiding this comment

ggerganov Mar 23, 2024

Choose a reason for hiding this comment

server: docs: `--threads` and `--threads`, `--ubatch-size`, `--log-disable` #6254

server: docs: `--threads` and `--threads`, `--ubatch-size`, `--log-disable` #6254

phymbert commented Mar 23, 2024 •

edited

ggerganov Mar 23, 2024 •

edited