Today was a Fill In the Middle day. llama.cpp is custom built, ready to get some Zed's edit predictions locally. Sweep Next-Edit is quiet fast, indeed (based on Qwen2.5 coder). huggingface.co/sweepai/swee...
Currently, there are many options for hosted FIM autocompletion. So far, I've used Mistral's Codestral, which remains free.
That said I continue to prepare for independence and privacy. I bet that the free tier of all platforms will become much smaller and eventually disappear in a short period.
llama.cpp
Build
These are the minimum dependencies required for llama.cpp, with support of two common backends (CPU/GPU). CUDA for NVIDIA cards (e.g., my NVIDIA GeForce RTX 4050 Laptop GPU with 6 GB VRAM), and Vulkan for Linux systems, which also support NVIDIA cards and CPU.
sudo apt-get install \
build-essential \
ccache \
cmake \
glslc \
libvulkan-dev \
nvidia-cuda-dev \
nvidia-cuda-toolkitllama.cpp can now be compiled. I rarely prefer static builds, but in that case, llama.cpp depends on too many dependencies to be easily portable, IMO.
That's why we'll build a big-bytes-static-binary. GitHub Actions will be used to rebuild the binaries on a regular basis.
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CCACHE=ON \ # build cache
-DBUILD_SHARED_LIBS=OFF \ # no dynamic libraries
-DGGML_CUDA=ON \ # cuda backend
-DGGML_VULKAN=ON \ # vulkan backend
-DGGML_NATIVE=ON \ # only current platform hw
-DCMAKE_CUDA_ARCHITECTURES="86;89" \ # rtx arch
-DGGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \ # ram + vram
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_EXAMPLES=ON \
-DLLAMA_BUILD_SERVER=ON
cmake --build build \
--target llama-server llama-bench llama-cli llama-quantize \
--config Release \
--parallel $(nproc)
cp build/bin/* ~/.local/bin/
cmake --build build --target cleanRun
Now that llama-server is available in $PATH, the API used for FIM can be launched.
Fine-grained device selection is doable with a prior listing, llama.cpp has an helper for that.
$ llama-server --list-devices
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 5772 MiB):
Device 0: NVIDIA GeForce RTX 4050 Laptop GPU, compute capability 8.9, VMM: yes, VRAM: 5772 MiB
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (RPL-P) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = NVIDIA GeForce RTX 4050 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
Available devices:
CUDA0: NVIDIA GeForce RTX 4050 Laptop GPU (5772 MiB, 2523 MiB free)
Vulkan0: Intel(R) Graphics (RPL-P) (23402 MiB, 14217 MiB free)
Vulkan1: NVIDIA GeForce RTX 4050 Laptop GPU (6141 MiB, 2523 MiB free)Let's go.
$ llama-server \
-ngl 99 \ # offload all layers to GPU
--ubatch-size 1024 --batch-size 1024 \
--ctx-size 0 \ # shrink context
--cache-reuse 256 \
--fit off \ # no automatic adjustment
--direct-io \
--mlock --no-mmap \
--temp 0 \ # no creativity
--no-webui \ # no http(s) server
--kv-unified \
--predict 512 \
--device CUDA0 \
--verbose \
-hf sweepai/sweep-next-edit-1.5BIf you are curious about the models GGUF size on disk, it's by default in your cache (default to ~/.cache/huggingface for Linux).
du -hs .cache/huggingface/hub/*
4.1G .cache/huggingface/hub/models--google--codegemma-2b-GGUF
4.0G .cache/huggingface/hub/models--JetBrains--Mellum-4b-sft-all-gguf
1.5G .cache/huggingface/hub/models--sweepai--sweep-next-edit-1.5B
941M .cache/huggingface/hub/models--unsloth--Qwen2.5-Coder-1.5B-Instruct-GGUF
2.0G .cache/huggingface/hub/models--unsloth--Qwen2.5-Coder-3B-Instruct-GGUFModels
For FIM features, the models used must be built for that purpose. They are decode-only architectures with coding-oriented datasets for training.
Here is my nearly exhaustive selection of collections, the one I tried locally.
Sweep Next-Edit: https://huggingface.co/collections/sweepai/sweep-next-edit
sweepai/sweep-next-edit-1.5B
Qwen2.5 Coder: https://huggingface.co/collections/Qwen/qwen25-coder
Qwen2.5 Coder (Unsloth): https://huggingface.co/collections/unsloth/qwen-25-coder
unsloth/Qwen2.5-Coder-3B-Instruct-GGUF:Q4_K_M
codegemma: https://huggingface.co/collections/google/codegemma-release
google/codegemma-2b-GGUFNote:
Mellum: https://huggingface.co/collections/JetBrains/mellum
JetBrains/Mellum-4b-sft-all-gguf
NB: you must be authenticated to fetch gated models like codegemma; it's as simple as export HF_TOKEN or using -hft flag for llama.cpp.
See https://huggingface.co/settings/tokens.
Zed
Zed configuration is pretty straightforward, follow the Edit Predictions feature from the GUI settings or this text based config.
{
"edit_predictions": {
"open_ai_compatible_api": {
"api_url": "http://localhost:8080/v1/completions",
"model": "sweepai/sweep-next-edit-1.5B",
"max_output_tokens": 64,
"prompt_format": "qwen"
},
"disabled_globs": [
"**/.env"
],
"provider": "open_ai_compatible_api",
"mode": "eager",
},
"edit_predictions_disabled_in": [
"comment",
"string",
],
"languages": {
"JSONC": {
"show_edit_predictions": false
},
"Plain Text": {
"show_edit_predictions": false
}
}
...
}See https://zed.dev/docs/ai/edit-prediction#self-hosted-openai-compatible-servers
See you in a bit ✌︎㋡