Pulse · ggerganov/llama.cpp · GitHub

April 23, 2024 – April 30, 2024

Overview

77 Active pull requests

115 Active issues

35 Releases published by 1 person

b2717
published Apr 24, 2024
b2724
published Apr 24, 2024
b2727
published Apr 25, 2024
b2728
published Apr 25, 2024
b2729
published Apr 25, 2024
b2730
published Apr 25, 2024
b2731
published Apr 25, 2024
b2734
published Apr 25, 2024
b2735
published Apr 25, 2024
b2736
published Apr 25, 2024
b2737
published Apr 25, 2024
b2740
published Apr 26, 2024
b2746
published Apr 26, 2024
b2747
published Apr 26, 2024
b2748
published Apr 26, 2024
b2749
published Apr 26, 2024
b2750
published Apr 27, 2024
b2751
published Apr 27, 2024
b2753
published Apr 28, 2024
b2754
published Apr 28, 2024
b2755
published Apr 29, 2024
b2756
published Apr 29, 2024
b2757
published Apr 29, 2024
b2760
published Apr 29, 2024
b2761
published Apr 29, 2024
b2763
published Apr 29, 2024
b2764
published Apr 29, 2024
b2766
published Apr 30, 2024
b2767
published Apr 30, 2024
b2768
published Apr 30, 2024
b2769
published Apr 30, 2024
b2771
published Apr 30, 2024
b2772
published Apr 30, 2024
b2773
published Apr 30, 2024
b2774
published Apr 30, 2024

53 Pull requests merged by 26 people

hardcode error codes on metal
#7010 merged Apr 30, 2024
metal : remove deprecated error code
#7008 merged Apr 30, 2024
log more info when metal fails
#6987 merged Apr 30, 2024
ggml : add Flash Attention
#5021 merged Apr 30, 2024
convert : use utf8 encoding
#7000 merged Apr 30, 2024
Improve usability of --model-url & related flags
#6930 merged Apr 29, 2024
Extending grammar integration tests
#6644 merged Apr 29, 2024
main : fix typo in comment in main.cpp
#6985 merged Apr 29, 2024
build(cmake): simplify instructions (`cmake -B build && cmake --build build ...`)
#6964 merged Apr 29, 2024
ci : tmp disable gguf-split
#6983 merged Apr 29, 2024
ggml : fix __MSC_VER -> _MSC_VER
#6977 merged Apr 29, 2024
llava-cli: Add ability to analyze multiple images on a single command line without having to the reload the model
#6969 merged Apr 29, 2024
llama : improve BPE pre-processing + LLaMA 3 and Deepseek support
#6920 merged Apr 29, 2024
use std::random_device{}() for default random seed
#6962 merged Apr 29, 2024
Fix conversion of some BERT embedding models
#6937 merged Apr 29, 2024
make : change GNU make default CXX from g++ to c++
#6966 merged Apr 29, 2024
ci : add building in MSYS2 environments (Windows)
#6967 merged Apr 29, 2024
fix typo: LAMMAFILE -> LLAMAFILE
#6974 merged Apr 29, 2024
Fix more int overflow during quant (PPL/CUDA).
#6563 merged Apr 28, 2024
gguf : enforce that tensor names are unique
#6905 merged Apr 28, 2024
[SYCL] add device version in SYCL device list
#6959 merged Apr 28, 2024
nix: update flake.lock
#6952 merged Apr 28, 2024
Replace "alternative" boolean operator in conditional compilation directive
#6949 merged Apr 27, 2024
ci: server: tests python env on github container ubuntu latest / fix n_predict
#6935 merged Apr 27, 2024
Reset schedule earlier to allow overlap with ggml graph computation on device
#6933 merged Apr 26, 2024
`quantize`: add imatrix and dataset metadata in GGUF
#6658 merged Apr 26, 2024
add basic tensor data validation function
#6884 merged Apr 26, 2024
gguf : fix mismatch between alloc and free functions
#6929 merged Apr 26, 2024
llamafile : use 64-bit integers in sgemm
#6928 merged Apr 26, 2024
ci: server: fix python installation
#6925 merged Apr 26, 2024
server: stop generation at `n_ctx_train` if `n_predict` is not set
#6638 merged Apr 26, 2024
ci: server: fix python installation again
#6922 merged Apr 26, 2024
ci: server: fix python installation
#6918 merged Apr 26, 2024
ci: fix concurrency for pull_request_target (again)
#6917 merged Apr 26, 2024
bench: server add stop word for PHI-2
#6916 merged Apr 26, 2024
add support for moondream vision language model
#6899 merged Apr 25, 2024
llama : synchronize before get/set session data
#6911 merged Apr 25, 2024
update model list
#6908 merged Apr 25, 2024
llama : check that all the tensor data is in the model file
#6885 merged Apr 25, 2024
ggml : fix redefinition of vaddvq_f32 for 32-bit ARM
#6906 merged Apr 25, 2024
clip : rename lerp function to avoid conflict
#6894 merged Apr 25, 2024
ggml : fix MIN / MAX macros
#6904 merged Apr 25, 2024
tests : minor bash stuff
#6902 merged Apr 25, 2024
Implement '--keep-split' to quantize model into several shards
#6688 merged Apr 25, 2024
README: add graphic for matrix multiplication
#6881 merged Apr 24, 2024
add llama_get_pooling_type function
#6862 merged Apr 24, 2024
Server front-end: do not apply Markdown formatting in code sections
#6850 merged Apr 24, 2024
Fix: Revert showing control tokens by default for server OpenAI Chat completions
#6860 merged Apr 24, 2024
Server: fix seed for multiple slots
#6835 merged Apr 24, 2024
ggml : move 32-bit arm compat in ggml-impl.h
#6865 merged Apr 24, 2024
Add phi 3 chat template
#6857 merged Apr 24, 2024
add support of codeqwen due to tokenizer
#6707 merged Apr 24, 2024
add phi3 support
#6852 merged Apr 24, 2024

24 Pull requests opened by 22 people

convert : fix set_vocab_sentencepiece
#6866 opened Apr 24, 2024
ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend
#6869 opened Apr 24, 2024
Clamp out of range values in K quantizer
#6888 opened Apr 25, 2024
AVX Q4_0 and Q8_0 sgemm
#6891 opened Apr 25, 2024
Fix CORS for /health endpoint
#6892 opened Apr 25, 2024
Properly set `clamp_qkv` value in OLMo conversion
#6910 opened Apr 25, 2024
Draft Idea... CPU Inference... This seems to perform better?
#6915 opened Apr 26, 2024
support MiniCPM-V-2
#6919 opened Apr 26, 2024
fixed off by one error when context shifting in main.cpp example
#6921 opened Apr 26, 2024
main : don't print special tokens with --grammar
#6923 opened Apr 26, 2024
Fix clip build on windows + clang
#6934 opened Apr 26, 2024
perplexity: more statistics, added documentation
#6936 opened Apr 26, 2024
Implemented basic interface for llamacheck and link to weights, adapt…
#6940 opened Apr 27, 2024
Updated server_queue to delete tasks from queue when server is shutdown. Feature Request #6421
#6941 opened Apr 27, 2024
Option to split during conversion
#6942 opened Apr 27, 2024
Server: add test for num slots, fails on master
#6950 opened Apr 27, 2024
move ndk code to a new library
#6951 opened Apr 27, 2024
server: avoid breaking KV cache when prompt >= n_ctx
#6958 opened Apr 28, 2024
llama3 custom regex split
#6965 opened Apr 28, 2024
Attempt at OpenElm
#6986 opened Apr 29, 2024
new tokenizer-verifier tool to check gguf tokenizer parameters
#6988 opened Apr 29, 2024
add chatglm3-6b model support [help wanted]
#6999 opened Apr 30, 2024
Fix flash attention for ROCm
#7011 opened Apr 30, 2024
Update Server's README with undocumented options for RoPE, YaRN, and KV cache quantization
#7013 opened Apr 30, 2024

70 Issues closed by 23 people

nix build fails on apple silicon
#7009 closed Apr 30, 2024
Current state Llama3 & Mixtral 8x22b conversion
#7001 closed Apr 30, 2024
Can't offload layers to GPU
#6261 closed Apr 30, 2024
llama : revisit using flash attention for prompt processing (a.k.a. prefil) + GPU implementation
#3365 closed Apr 30, 2024
quantize.exe Bug(s) --token-embedding-type / --output-tensor-type and - Docu? Advanced Usage Context ?
#6776 closed Apr 30, 2024
Custom fine-tuned DeepSeek coder model unable to be quantized to Fp16
#5234 closed Apr 30, 2024
Problem connecting VSCode (Continue) to the server LlamaCpp
#5406 closed Apr 30, 2024
Gemma models quantized using llamacpp not working in lm studio
#5706 closed Apr 30, 2024
Add support for Vary-toy
#6054 closed Apr 30, 2024
How to Modify Hugging Face's Language Models?
#6057 closed Apr 30, 2024
GGML_ASSERT: ggml-quants.c:11615: besti1 >= 0 && besti2 >= 0 && best_shift != 0
#6067 closed Apr 30, 2024
KeyError: ('torch.nn.modules.sparse', 'Embedding')
#6071 closed Apr 30, 2024
Metal kernel mv_f16_f32_l4 performance issue for long contexts, too many threads
#6089 closed Apr 30, 2024
Can you please ask experts from the community to try to implement MobileLLM enhancement without training the model?
#6103 closed Apr 30, 2024
-mu without -m is... tricky
#6887 closed Apr 29, 2024
BPE Tokenizer: Multiple newlines doesn't merge into a single token
#6809 closed Apr 29, 2024
I finally found prove that server output can be different (and vs groq now) - model name : llama3 8b Instruct
#6955 closed Apr 29, 2024
Garbled output on Windows 11 Arm due to typo in ggml-impl.h file
#6976 closed Apr 29, 2024
Help: Batching the same request?
#6978 closed Apr 29, 2024
Command R Plus crashed on large context (~40K) with CUDA
#6948 closed Apr 29, 2024
gguf : enforce that tensor names are unique
#6836 closed Apr 28, 2024
support for openelm apple
#6960 closed Apr 28, 2024
[SYCL] fail to load llama.dll compiled by icx with -DBUILD_SHARED_LIBS=on flag on Windows
#6309 closed Apr 28, 2024
SYCL Hangs after ggml_backend_sycl_host_buffer_type
#6943 closed Apr 28, 2024
Does `"add_bos_token": false` in `tokenizer_config.json` cause no BOS to get output?
#6947 closed Apr 28, 2024
GPU NOT used during "normal generation" when ONE LAYER offloaded (But GPU used in prompt evaluation)
#3860 closed Apr 28, 2024
Save Chat History into New Prompts
#3985 closed Apr 28, 2024
Constrained decoding with BNF grammar fails to work with some tokens
#5599 closed Apr 28, 2024
tips for GPU op profile
#5865 closed Apr 28, 2024
Quantazation Questions - Odd bits
#6011 closed Apr 28, 2024
AVX512 support
#6024 closed Apr 28, 2024
Add Ascend NPU as a new backend
#6034 closed Apr 28, 2024
[SYCL] Failed when running llama.cpp on ARC770
#6036 closed Apr 28, 2024
-DCMAKE_BUILD_TYPE=Debug Does not work!
#6049 closed Apr 28, 2024
MSVC Main exits immediately on model load
#6932 closed Apr 27, 2024
Centos9 编译提示unsupported instruction `vpdpbusd'
#5316 closed Apr 27, 2024
Vocab problems converting QWEN 110b with convert.py
#6938 closed Apr 27, 2024
GGUF endianness cannot be determined from GGUF itself
#3957 closed Apr 27, 2024
Using convert.py with a fine tuned phi-2
#5009 closed Apr 27, 2024
Error when converting safe tensors to gguf
#5559 closed Apr 27, 2024
segmentation fault with on mac M3 Pro with llama-7b.Q4_0.gguf
#5983 closed Apr 27, 2024
convert.py incompatible with most new models, including salesforce/codegen models
#6030 closed Apr 27, 2024
GGUF writer reverses array (tensor) dimensions
#6040 closed Apr 27, 2024
Mingw64 compiler canot surport tihs code --> static const std::locale locale("en_US.UTF-8"); which in llama.cpp uint32_t to_lower(uint32_t code)
#6041 closed Apr 27, 2024
`quantize`: add imatrix and dataset metadata in GGUF
#6656 closed Apr 26, 2024
Is it normal that ROCm+HIPBLAS produces different results than on CPU or breaks completely?
#6841 closed Apr 26, 2024
main: crashing upon loading model since commit 83b72cb0 - Windows MSVC + CUDA
#6931 closed Apr 26, 2024
`llama_apply_lora_from_file_internal: bad file magic` when trying to load lora from `finetune`
#6926 closed Apr 26, 2024
server: index.html issue
#5788 closed Apr 26, 2024
CUDA error: an illegal memory access was encountered，This problem has been confusing me for days. Thank you.
#5991 closed Apr 26, 2024
Hope Support Emebdding Model Architectures: JinaBertModel
#6005 closed Apr 26, 2024
running clblas (opencl) slow speed on rk3588
#6008 closed Apr 26, 2024
Error while building for hipBLAS on Windows 11
#6514 closed Apr 25, 2024
How to fine tune LLaMA 3 in Google Colab (Pro)?
#6800 closed Apr 25, 2024
Truncated model files can cause llama.cpp to crash when using mmap
#6774 closed Apr 25, 2024
Re-quantization of a split gguf file produces "invalid split file"
#6548 closed Apr 25, 2024
Vulkan generated targets and shader organization
#5356 closed Apr 25, 2024
Low performance with Sycl Backend
#5480 closed Apr 25, 2024
if use MoE + Ternary, what's happen?
#5870 closed Apr 25, 2024
Fill in the token usage information in the usage object, and output it at the 'v1/embeddings' endpoint.
#5987 closed Apr 25, 2024
Design2Code
#5989 closed Apr 25, 2024
CUDA 12.4 released incompletely.
#5998 closed Apr 25, 2024
[Old models] Gibberish text at the end of chat/completion - server
#6847 closed Apr 25, 2024
Api llama_tokenize function problem
#6854 closed Apr 24, 2024
The model began to add </s > to each main and server response
#6872 closed Apr 24, 2024
OpenAI-Compatible Chat Completions API Endpoint Responses include EOS / stop tokens
#6859 closed Apr 24, 2024
server: recieving <|im_end|> in all responses of llama 3
#6873 closed Apr 24, 2024
Share a simpler Cmake methd to compile and run GPU accelerated version(openBLAS and CLBlast) on android Qualcomm Adreno device.
#2169 closed Apr 24, 2024
key file
#5972 closed Apr 24, 2024
how to set this chat_template in server?
#5974 closed Apr 24, 2024

45 Issues opened by 45 people

Llama 3 - Regression with apostrophes
#7006 opened Apr 30, 2024
server: self context extent broken
#7005 opened Apr 30, 2024
Pythonic way for quantization
#7003 opened Apr 30, 2024
LLamaCpp embedding returns an empty array for long text（While HuggingFaceEmbeddings works fine）
#6996 opened Apr 30, 2024
illegal instruction and crash when run llama-bench (build on android device not cross platform compilation )on android
#6995 opened Apr 30, 2024
Segmentation fault on finetune with -ngl > 0, Debian 12 stable
#6994 opened Apr 30, 2024
About dialogue training mode
#6993 opened Apr 30, 2024
Intel(R) Arc(TM) A770M Setting as default instead of Iris Xe Graphics
#6991 opened Apr 29, 2024
main Segfault using cmake & -march=armv8.4a flag
#6990 opened Apr 29, 2024
[feature] Support inference on raw text input in main and server.
#6982 opened Apr 29, 2024
Tokenizers questions and ... proposals?
#6980 opened Apr 29, 2024
Fast request make the server stuck
#6979 opened Apr 29, 2024
Metal doesn't work in x86 macos
#6975 opened Apr 29, 2024
cudaDeviceReset() not working?
#6973 opened Apr 29, 2024
Windows cmake failed compile for rocm
#6972 opened Apr 29, 2024
llava-cli fails to build on M2 due to symbol(s) not found for architecture arm64
#6963 opened Apr 28, 2024
llama_decode return logbits whose value are all nan
#6957 opened Apr 28, 2024
Regression/bug in Windows on ARM64 build between #7593639c and #4dba7e81
#6954 opened Apr 28, 2024
xcrun: error: unable to find utility "metal", not a developer tool or in PATH in B2479
#6946 opened Apr 27, 2024
failed to quantize: ios_base::clear: unspecified iostream_category error
#6945 opened Apr 27, 2024
llava-cli outputs gibberish
#6944 opened Apr 27, 2024
Help test CPUSet patch for Windows and Linux
#6927 opened Apr 26, 2024
Why does every answer end with <|img end|>?
#6924 opened Apr 26, 2024
Something might be wrong with either llama.cpp or the Llama 3 GGUFs
#6914 opened Apr 25, 2024
ggml : unified CMake build
#6913 opened Apr 25, 2024
main exe with deepseek-coder-1.3b-instruct.Q8_0.gguf not stopping correctly
#6912 opened Apr 25, 2024
Experiencing 2-3 GB GPU memory use increase compared to llama.cpp version a few weeks ago
#6909 opened Apr 25, 2024
ggml.c:2284:43: error: use of undeclared identifier 'cpu_set_t'
#6907 opened Apr 25, 2024
server: phi-3 end token not handled?
#6903 opened Apr 25, 2024
offload_kqv ONLY supported by python version?
#6900 opened Apr 25, 2024
I use python's llama-cpp package to run the code. There is a cuda environment and the contents of llama.cpp (compiled), but I still cannot use the GPU.
#6898 opened Apr 25, 2024
output from server service is not proper and there are many duplicate words
#6895 opened Apr 25, 2024
Fix CORS in `/health` endpoint
#6893 opened Apr 25, 2024
Add cmake option to build without CUDA VMM
#6889 opened Apr 25, 2024
Error Building llama.cpp on Intel MacBook Pro with Metal
#6886 opened Apr 24, 2024
[Performance] Llava-cli offloading image encoding to cuda
#6883 opened Apr 24, 2024
Generate control vector using llama.cpp
#6880 opened Apr 24, 2024
Add support to ArcticForCausalLM
#6877 opened Apr 24, 2024
Why Ollama is using VRAM Only insted of VRAM + RAM?
#6876 opened Apr 24, 2024
Getting "Bad CPU type in executable" on macos-x64 build
#6875 opened Apr 24, 2024
Vulkan: possible NaN propagation on llama-3 8B (more testing required)
#6874 opened Apr 24, 2024
Support for OpenELM of Apple
#6868 opened Apr 24, 2024
Support for Functionary-v2 chat template
#6867 opened Apr 24, 2024
Implement 4-bit quantized KV Cache for faster performance and to enable longer context
#6863 opened Apr 24, 2024
crash on llama_new_context_with_model: failed assertion `Buffer Validation
#6861 opened Apr 24, 2024

102 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

Introduction of CUDA Graphs to LLama.cpp
#6766 commented on Apr 30, 2024 • 44 new comments
Support for Phi-3 models
#6849 commented on Apr 30, 2024 • 27 new comments
added implementation of DRY sampler
#6839 commented on Apr 29, 2024 • 25 new comments
CPUSet support for Windows and Linux
#6832 commented on Apr 29, 2024 • 22 new comments
ggml : add RPC backend
#6829 commented on Apr 30, 2024 • 21 new comments
Server: enable lookup decoding
#6828 commented on Apr 29, 2024 • 16 new comments
grammars: x{min,max} repetition operator
#6640 commented on Apr 30, 2024 • 14 new comments
`grammars`: cache decoded token codepoints & early exit in candidates rejection (faster sampling)
#6811 commented on Apr 30, 2024 • 10 new comments
Custom quantization schemes
#6844 commented on Apr 26, 2024 • 8 new comments
split: include the option in ./convert.py and quantize
#6260 commented on Apr 27, 2024 • 7 new comments
Added server example themes support with two sample themes and a favicon.
#6848 commented on Apr 29, 2024 • 7 new comments
llamafile : improve moe prompt eval speed on cpu
#6840 commented on Apr 26, 2024 • 7 new comments
llama : add Deepseek support #5981
#6252 commented on Apr 26, 2024 • 6 new comments
convert.py: add python logging instead of print()
#6511 commented on Apr 30, 2024 • 6 new comments
main chat using simple json based template which drives in-prefix, in-suffix and reverse-prompt and a generic chat-apply-template helper driven by flags from same json
#6834 commented on Apr 30, 2024 • 6 new comments
Introduce bfloat16 support
#6412 commented on Apr 29, 2024 • 5 new comments
feat: add potential to run Jina Embeddings architecture
#6826 commented on Apr 30, 2024 • 5 new comments
kubernetes example
#6546 commented on Apr 27, 2024 • 4 new comments
[Feature request] Any plans for AMD XDNA AI Engine support on Ryzen 7x40 processors?
#1499 commented on Apr 25, 2024 • 4 new comments
Refactor convert.py and add support for Metas official Llama 3 model
#6819 commented on Apr 25, 2024 • 4 new comments
[Feature Request] Dynamic temperature sampling for better coherence / creativity
#3483 commented on Apr 27, 2024 • 3 new comments
off topic: linking two Mac Studio together to fit larger models
#6390 commented on Apr 29, 2024 • 3 new comments
Performance decreated between tag b1500 and b2581 on Windows ARM64 PC
#6417 commented on Apr 29, 2024 • 3 new comments
Support CoreML like whisper.cpp?
#1714 commented on Apr 25, 2024 • 3 new comments
When I tried to convert the Qwen-VL-chat model to gguf, an error occurred: `Can not map tensor ‘transformer.visual.positional_embedding’. What is the reason?
#5331 commented on Apr 28, 2024 • 3 new comments
Refactor chat template API
#6822 commented on Apr 24, 2024 • 3 new comments
Support speculative decoding in `server` example
#5877 commented on Apr 30, 2024 • 3 new comments
Implement (properly) different chat templates in main.cpp
#6391 commented on Apr 24, 2024 • 3 new comments
Subtle Vulkan shader compilation bug when running on Adreno GPUs (Samsung Galaxy S23 Ultra)
#5186 commented on Apr 29, 2024 • 2 new comments
[CANN] Add Ascend NPU backend (Part 1)
#6035 commented on Apr 29, 2024 • 2 new comments
Server CUDA Infill Segmentation Fault
#6672 commented on Apr 30, 2024 • 2 new comments
Python 3.12 support
#6422 commented on Apr 27, 2024 • 2 new comments
Windows ROCm Build.
#2843 commented on Apr 30, 2024 • 2 new comments
Server: add function calling API
#5588 commented on Apr 30, 2024 • 2 new comments
Support for InternVL
#6803 commented on Apr 27, 2024 • 2 new comments
server: avoid full prompt eval when 'prompt >= ctx'
#6855 commented on Apr 28, 2024 • 2 new comments
common : fix parallel shard download interleaving output
#6831 commented on Apr 29, 2024 • 2 new comments
vulkan backend failed to load models vk::Device::createComputePipeline: ErrorUnknown
#6843 commented on Apr 26, 2024 • 2 new comments
can llama.cpp/convert.py support tokenizer rather than 'spm', 'bpe', 'hfft'
#6690 commented on Apr 25, 2024 • 2 new comments
MobileVLM convert.py error
#6087 commented on Apr 29, 2024 • 1 new comment
Running convert fails with BadZipFile (Bad CRC-32)
#4365 commented on Apr 29, 2024 • 1 new comment
llama : add T5 (encoder-decoder) support
#5763 commented on Apr 30, 2024 • 1 new comment
llava-cli process_prompt bug
#6823 commented on Apr 24, 2024 • 1 new comment
Server: possibility of customizable chat template?
#5922 commented on Apr 28, 2024 • 1 new comment
The tensor shape is different during gemma-2b model conversion, resulting in loading errors during inference. Repeat python convert.py ./models/gemma-2b After multiple conversions, the tensor shape is still different, resulting in loading errors during inference.
#6437 commented on Apr 28, 2024 • 1 new comment
Add support for OPTForCausalLM
#6473 commented on Apr 24, 2024 • 1 new comment
error
#6601 commented on Apr 24, 2024 • 1 new comment
Implement ANPD (3x speedup, lossless)
#6813 commented on Apr 25, 2024 • 1 new comment
Added dependency needed for numa in numactl mode
#6784 commented on Apr 24, 2024 • 1 new comment
qwen 1.5 Beta 1.8B output incoherently
#5459 commented on Apr 25, 2024 • 1 new comment
wrong number of tensors for AdaptLLM/medicine-chat
#6490 commented on Apr 25, 2024 • 1 new comment
truly opensource model called olmo
#6712 commented on Apr 25, 2024 • 1 new comment
For CUDA versions < 11.7 a target CUDA architecture must be explicitly provided via CUDA_DOCKER_ARCH
#5976 commented on Apr 25, 2024 • 1 new comment
Support for Alibaba-NLP/gte-large-en-v1.5 Embedding Model
#6821 commented on Apr 27, 2024 • 1 new comment
How can i get log probs in create_chat_completions in llama-cpp , I'm using logprobs=True as an attribute but still not getting Log Probabilities.
#6423 commented on Apr 26, 2024 • 1 new comment
Multi-GPU support for AMD?
#3051 commented on Apr 27, 2024 • 1 new comment
Temperature slider not working
#6676 commented on Apr 26, 2024 • 1 new comment
New optimization from NVIDIA to use CUDA Graphs in llama.cpp
#6763 commented on Apr 26, 2024 • 1 new comment
adding support for linux binaries
#5106 commented on Apr 30, 2024 • 1 new comment
[WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python)
#6389 commented on Apr 30, 2024 • 1 new comment
Support for 2-bit Quantized Llama-2-7b-chat-hf_2bitgs8_hqq Model
#6368 commented on Apr 30, 2024 • 0 new comments
[User] Insert summary of your issue or enhancement..
#1471 commented on Apr 30, 2024 • 0 new comments
llama : add phixtral support
#4912 commented on Apr 29, 2024 • 0 new comments
llama : compute BERT graph with F16 K, V
#5891 commented on Apr 29, 2024 • 0 new comments
Make tokenize CLI tool have nicer command line arguments.
#6188 commented on Apr 25, 2024 • 0 new comments
[SYCL] refactor
#6408 commented on Apr 30, 2024 • 0 new comments
Server: Unix Socket Support
#6413 commented on Apr 23, 2024 • 0 new comments
The MLX Challenge
#6539 commented on Apr 24, 2024 • 0 new comments
How to activate BLAS?
#627 commented on Apr 26, 2024 • 0 new comments
Bring back multimodal support for server
#6168 commented on Apr 26, 2024 • 0 new comments
Add a new `llama_load_model_from_buffer()` method to compliment `llama_load_model_from_file()`
#6311 commented on Apr 26, 2024 • 0 new comments
Adding MistralForCausalLM architecture to convert-hf-to-gguf.py
#4463 commented on Apr 25, 2024 • 0 new comments
Add support for Jais architecture, both Jais-13B and Jais-30B shares the same architecture.
#6227 commented on Apr 25, 2024 • 0 new comments
I have a specific question regarding qwen1.8b and qwen1.8b-chat, for which I am eagerly seeking your assistance.
#6228 commented on Apr 25, 2024 • 0 new comments
Quantization fails for low bit quants (Including Q2_K_S) of pass-through merged (Frankensteined) models with error: Missing importance matrix for tensor blk.40.ffn_down.weight in a very low-bit quantization. The result will be garbage, so bailing out
#6249 commented on Apr 25, 2024 • 0 new comments
Phind-CodeLlama-34b-v2
#6306 commented on Apr 25, 2024 • 0 new comments
CUDA error: invalid device function when compiling and running for amd gfx 1032
#4762 commented on Apr 24, 2024 • 0 new comments
Finetune from text
#5170 commented on Apr 24, 2024 • 0 new comments
Excessively slow prompt processing time with 70B partially offloaded in SYCL
#5272 commented on Apr 24, 2024 • 0 new comments
New IQ1_S somehow much worse than previous version
#5996 commented on Apr 24, 2024 • 0 new comments
Model Request for BAAI/bge-m3 (XLMRoberta-based Multilingual Embedding Model)
#6007 commented on Apr 24, 2024 • 0 new comments
GGML_ASSERT: ../llama.cpp/ggml-quants.c:10340: grid_index >= 0
#6018 commented on Apr 24, 2024 • 0 new comments
GGML_ASSERT: llama.cpp:3817: unicode_cpts_from_utf8(word).size() > 0
#6132 commented on Apr 24, 2024 • 0 new comments
MiniCPM Chat Template
#6236 commented on Apr 24, 2024 • 0 new comments
Need help in extracting logits (token + probabilities)!
#6285 commented on Apr 24, 2024 • 0 new comments
1-2 Tesla P40 plus a powerful graphics card, does it make sense?
#6386 commented on Apr 30, 2024 • 0 new comments
Kompute backend: add support for Vulkan devices that do not have storageBuffer8BitAccess
#6401 commented on Apr 30, 2024 • 0 new comments
Lack of documentation regarding RoPE scaling
#2402 commented on Apr 29, 2024 • 0 new comments
Incomplete instruction for https://github.com/ggerganov/llama.cpp/blob/master/README-sycl.md#intel-gpu
#6318 commented on Apr 29, 2024 • 0 new comments
Add full support for OpenCL
#6362 commented on Apr 29, 2024 • 0 new comments
POST to server takes forever
#2572 commented on Apr 28, 2024 • 0 new comments
[User] AMD GPU slower than CPU
#3422 commented on Apr 28, 2024 • 0 new comments
When I used the tool to quantify the chatglm model, the following error was reported
#3808 commented on Apr 28, 2024 • 0 new comments
corruption on slot context shift
#6002 commented on Apr 28, 2024 • 0 new comments
Metal failure after early March versions of server startup loading the model
#6020 commented on Apr 28, 2024 • 0 new comments
Working Fine-Tune Example?
#6361 commented on Apr 28, 2024 • 0 new comments
May we remove the big loop which runs > 10000 times everytime.
#6375 commented on Apr 28, 2024 • 0 new comments
error: implicit declaration of function ‘getcpu’
#5537 commented on Apr 27, 2024 • 0 new comments
Mixtral 8x7b QLora not able to convert to gguf after training
#5905 commented on Apr 27, 2024 • 0 new comments
“'token_embd.weight' has wrong shape” when loading deepseek-coder-1.3b-base.Q8_0.gguf
#5910 commented on Apr 27, 2024 • 0 new comments
Using OpenCL on Adreno & Mali GPUs is slower than CPU
#5965 commented on Apr 27, 2024 • 0 new comments
Add Support for cognitivecomputations/dolphin-phi-2-kensho and other models that are configured like it which follow Meta's paper "MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases" (https://arxiv.org/abs/2402.14905)
#6333 commented on Apr 27, 2024 • 0 new comments