grumpycat tech stories.

Local first Fill-in-the-Middle (FIM) with llama.cpp

March 25, 2026

grumpycat (leslie-alexandre d.)'s avatar
grumpycat (leslie-alexandre d.)
1mo

Today was a Fill In the Middle day. llama.cpp is custom built, ready to get some Zed's edit predictions locally. Sweep Next-Edit is quiet fast, indeed (based on Qwen2.5 coder). huggingface.co/sweepai/swee...

sweepai/sweep-next-edit-1.5B · Hugging Face

sweepai/sweep-next-edit-1.5B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.


https://huggingface.co/sweepai/sweep-next-edit-1.5B

Currently, there are many options for hosted FIM autocompletion. So far, I've used Mistral's Codestral, which remains free.

That said I continue to prepare for independence and privacy.  I bet that the free tier of all platforms will become much smaller and eventually disappear in a short period.

llama.cpp

Build

These are the minimum dependencies required for llama.cpp, with support of two common backends (CPU/GPU). CUDA for NVIDIA cards (e.g., my NVIDIA GeForce RTX 4050 Laptop GPU with 6 GB VRAM), and Vulkan for Linux systems, which also support NVIDIA cards and CPU.

sudo apt-get install \
  build-essential \
  ccache \
  cmake \
  glslc \
  libvulkan-dev \
  nvidia-cuda-dev \
  nvidia-cuda-toolkit

llama.cpp can now be compiled. I rarely prefer static builds, but in that case, llama.cpp depends on too many dependencies to be easily portable, IMO.

That's why we'll build a big-bytes-static-binary. GitHub Actions will be used to rebuild the binaries on a regular basis.

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CCACHE=ON \ # build cache
  -DBUILD_SHARED_LIBS=OFF \ # no dynamic libraries
  -DGGML_CUDA=ON \ # cuda backend
  -DGGML_VULKAN=ON \ # vulkan backend
  -DGGML_NATIVE=ON \ # only current platform hw
  -DCMAKE_CUDA_ARCHITECTURES="86;89" \ # rtx arch
  -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \ # ram + vram
  -DLLAMA_BUILD_TESTS=OFF \
  -DLLAMA_BUILD_EXAMPLES=ON \
  -DLLAMA_BUILD_SERVER=ON
  
cmake --build build \
  --target llama-server llama-bench llama-cli llama-quantize \
  --config Release \
  --parallel $(nproc)

cp build/bin/* ~/.local/bin/

cmake --build build --target clean

Run

Now that llama-server is available in $PATH, the API used for FIM can be launched.

Fine-grained device selection is doable with a prior listing, llama.cpp has an helper for that.

$ llama-server --list-devices

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 5772 MiB):
  Device 0: NVIDIA GeForce RTX 4050 Laptop GPU, compute capability 8.9, VMM: yes, VRAM: 5772 MiB
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (RPL-P) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = NVIDIA GeForce RTX 4050 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
Available devices:
  CUDA0: NVIDIA GeForce RTX 4050 Laptop GPU (5772 MiB, 2523 MiB free)
  Vulkan0: Intel(R) Graphics (RPL-P) (23402 MiB, 14217 MiB free)
  Vulkan1: NVIDIA GeForce RTX 4050 Laptop GPU (6141 MiB, 2523 MiB free)

Let's go.

$ llama-server \
  -ngl 99 \ # offload all layers to GPU
  --ubatch-size 1024 --batch-size 1024 \
  --ctx-size 0 \ # shrink context
  --cache-reuse 256 \
  --fit off \ # no automatic adjustment
  --direct-io \
  --mlock --no-mmap \
  --temp 0 \ # no creativity
  --no-webui \ # no http(s) server
  --kv-unified \
  --predict 512 \
  --device CUDA0 \
  --verbose \
    -hf sweepai/sweep-next-edit-1.5B

If you are curious about the models GGUF size on disk, it's by default in your cache (default to ~/.cache/huggingface for Linux).

du -hs .cache/huggingface/hub/*
4.1G    .cache/huggingface/hub/models--google--codegemma-2b-GGUF
4.0G    .cache/huggingface/hub/models--JetBrains--Mellum-4b-sft-all-gguf
1.5G    .cache/huggingface/hub/models--sweepai--sweep-next-edit-1.5B
941M    .cache/huggingface/hub/models--unsloth--Qwen2.5-Coder-1.5B-Instruct-GGUF
2.0G    .cache/huggingface/hub/models--unsloth--Qwen2.5-Coder-3B-Instruct-GGUF

Models

For FIM features, the models used must be built for that purpose. They are decode-only architectures with coding-oriented datasets for training.

Here is my nearly exhaustive selection of collections, the one I tried locally.

  • Sweep Next-Edit: https://huggingface.co/collections/sweepai/sweep-next-edit

    • sweepai/sweep-next-edit-1.5B


  • Qwen2.5 Coder: https://huggingface.co/collections/Qwen/qwen25-coder

  • Qwen2.5 Coder (Unsloth): https://huggingface.co/collections/unsloth/qwen-25-coder

    • unsloth/Qwen2.5-Coder-3B-Instruct-GGUF:Q4_K_M


  • codegemma: https://huggingface.co/collections/google/codegemma-release

    • google/codegemma-2b-GGUF

    • Note:


  • Mellum: https://huggingface.co/collections/JetBrains/mellum

    • JetBrains/Mellum-4b-sft-all-gguf

NB: you must be authenticated to fetch gated models like codegemma; it's as simple as export HF_TOKEN or using -hft flag for llama.cpp.

See https://huggingface.co/settings/tokens.

Zed

Zed configuration is pretty straightforward, follow the Edit Predictions feature from the GUI settings or this text based config.

{
 "edit_predictions": {
    "open_ai_compatible_api": {
      "api_url": "http://localhost:8080/v1/completions",
      "model": "sweepai/sweep-next-edit-1.5B",
      "max_output_tokens": 64,
      "prompt_format": "qwen"
    },
    "disabled_globs": [
      "**/.env"
    ],
    "provider": "open_ai_compatible_api",
    "mode": "eager",
  },
  "edit_predictions_disabled_in": [
    "comment",
    "string",
  ],
  "languages": {
    "JSONC": {
      "show_edit_predictions": false
    },
    "Plain Text": {
      "show_edit_predictions": false
    }
  }
...
}

See https://zed.dev/docs/ai/edit-prediction#self-hosted-openai-compatible-servers


See you in a bit ✌︎㋡


llm
LocalLLaMA

grumpycat tech stories.