AI-Assisted Llama.cpp Optimization

Running large language models locally has become increasingly accessible thanks to projects like llama.cpp and Ollama, which uses llama.cpp under the hood. By default, inference on these engines is pretty good. However, getting optimal performance requires careful tuning, especially when working with multiple GPUs of different capabilities like what I have in my workstation. With the recent release of the Qwen 3 models, I've been exploring their capabilities for locally hosted agents. Unsloth.ai provides awesome guides on running and fine-tuning bleeding edge open source models like Qwen 3, so I downloaded their dynamic 4-bit quant for Qwen 3 30B-A3B. Check these guys out, their work is incredible.

I pulled the model from Huggingface and it ran decently well using Unsloth's instructions, but I knew from looking at the metrics on my hardware there was performance still on the table. This led me to the thought: How well can a frontier model optimize another AI model's inference engine?

The Dream Team: Claude & I

My role:

Execute the suggested llama.cpp commands
Provide performance metrics to Claude
Make sure the train stays on the rails

Claude Sonnet 3.7's role and capabilities:

Analyze hardware specifications and performance metrics
Suggest optimal parameters and explain their purpose
Iteratively refine configurations based on observed results
Interpret the performance data to identify bottlenecks
Use tools such as web search MCP servers to look up real-time information

Hardware & Software Setup

My workstation configuration for this experiment was:

GPUs: RTX 3090 (24GB VRAM) + RTX 2060 (6GB VRAM)
CPU: AMD Ryzen 9 7950X3D
RAM: 128 GiB DDR5-5200
Model: Unsloth Qwen3-30B-A3B-UD-Q4_K_XL.gguf
Software: Latest main branch of llama.cpp

Baseline Performance

The initial server command I used was very similar to one in the Unsloth documentation:

./build/bin/llama-server \
    --model unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-Q4_K_XL.gguf \
    --host 192.168.1.170 \
    --port 10000 \
    --n-gpu-layers 99 \
    --ctx-size 4096 \
    --threads 32 \

I provided Claude with the output from both nvtop and llama.cpp:

Baseline GPU Utilization (nvtop):

RTX 3090: 57% GPU utilization, 14.8GB/24GB VRAM used
RTX 2060: 39% GPU utilization, 3.9GB/6GB VRAM used

Baseline Performance (llama.cpp metrics):

Prompt processing: 128.71 tokens per second
Generation speed: 77.51 tokens per second

Optimization Step 1: Claude's Initial Analysis

After examining my baseline metrics, Claude identified several opportunities for improvement:

Threads were excessively high (32) for GPU-accelerated inference
No tensor split was specified to balance the workload between GPUs
Missing batch size optimization and continuous batching
No memory locking to prevent swapping

Claude suggested this first optimization:

./build/bin/llama-server \
    --model unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-Q4_K_XL.gguf \
    --host 192.168.1.170 \
    --port 10000 \
    --n-gpu-layers 99 \
    --ctx-size 4096 \
    --threads 1 \
    --tensor-split 0.8,0.2 \
    --batch-size 512 \
    --ubatch-size 512 \
    --parallel 4 \
    --cont-batching \
    --mlock \
    --split-mode layer

Claude's reasoning:

Reduced threads from 32 to 1 (counter-intuitive but effective for GPU acceleration)
Added tensor-split (0.8,0.2) to distribute workload proportionally to GPU capability
Set batch-size and ubatch-size for optimal throughput
Added continuous batching and memory locking for efficiency

After executing this command, I shared the new metrics with Claude:

Prompt processing: 185.00 tokens per second (43.7% improvement)
Generation speed: 78.33 tokens per second (1.1% improvement)

Optimization Step 2: Claude Recommends Larger Context Window

The Unsloth folks recommend an optimal context window of 32,768 tokens. So I asked Claude to increase the window to that amount and continue optimizing. Based on my request, Claude suggested a new configuration:

./build/bin/llama-server \
    --model unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-Q4_K_XL.gguf \
    --host 192.168.1.170 \
    --port 10000 \
    --n-gpu-layers 99 \
    --ctx-size 32768 \
    --threads 2 \
    --tensor-split 0.75,0.25 \
    --batch-size 1024 \
    --ubatch-size 512 \
    --parallel 4 \
    --cont-batching \
    --mlock \
    --split-mode row \
    --flash-attn

Claude's key insights:

Added Flash Attention, essential for efficient large context handling
Increased context size to 32,768 tokens as recommended by Unsloth
Adjusted tensor split ratio to try improving load distribution
Changed to row-based splitting to test if it works better with Flash Attention

Claude was somewhat surprised by the results:

Generation speed dropped to 47.52 tokens per second
RTX 2060 was nearly maxed out at 5.4GB/6GB VRAM used

I expected this to happen the first run after increasing the context window. I asked Claude to keep going.

Optimization Step 3: Claude Identifies and Resolves A VRAM Bottleneck

After seeing the performance regression and some encouragement, Claude immediately identified the issue: the 2060's limited VRAM had become a bottleneck with the larger context size. Without any prompting on my part, Claude suggested a crucial adjustment:

./build/bin/llama-server \
    --model unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-Q4_K_XL.gguf \
    --host 192.168.1.170 \
    --port 10000 \
    --n-gpu-layers 99 \
    --ctx-size 32768 \
    --threads 2 \
    --tensor-split 0.9,0.1 \
    --batch-size 1024 \
    --ubatch-size 512 \
    --parallel 4 \
    --cont-batching \
    --mlock \
    --split-mode layer \
    --flash-attn

Claude's solution:

Changed tensor split from 0.75,0.25 to 0.9,0.1 to relieve VRAM pressure on the 2060
Switched back to layer-based splitting which works better for asymmetric setups
Kept the 32K context window and other optimizations

The results were impressive:

Generation speed: 82.69 tokens per second (74% improvement from previous step)
RTX 2060 VRAM usage dropped to 4.5GB/6GB, resolving the bottleneck

Optimization Step 4: Claude's Final Tuning

After I mentioned there was still memory headroom on the 3090, Claude suggested one final optimization:

./build/bin/llama-server \
    --model unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-Q4_K_XL.gguf \
    --host 192.168.1.170 \
    --port 10000 \
    --n-gpu-layers 99 \
    --ctx-size 32768 \
    --threads 2 \
    --tensor-split 0.97,0.03 \
    --batch-size 1024 \
    --ubatch-size 512 \
    --parallel 4 \
    --cont-batching \
    --mlock \
    --split-mode layer \
    --flash-attn

Claude made a precise, strategic adjustment:

Changed tensor split from 0.9,0.1 to 0.97,0.03 to maximize the 3090's contribution

The final results were exceptional:

RTX 3090: 80% GPU utilization, 20.3GB/24GB VRAM used
RTX 2060: 14% GPU utilization, 1GB/6GB VRAM used
Prompt processing: 252.36 tokens per second
Generation speed: 96.68 tokens per second

Final Results

Configuration	Context Size	Generation Speed	Step Improvement
Baseline	4,096	77.51 tokens/sec	-
Step 1	4,096	78.33 tokens/sec	+1.1%
Step 2	32,768	47.52 tokens/sec	-39.3%
Step 3	32,768	82.69 tokens/sec	+74.0%
Step 4	32,768	96.68 tokens/sec	+16.9%

From our baseline to final configuration, Claude achieved:

24.7% faster generation (77.51 → 96.68 tokens/sec)
8x larger context window (4,096 → 32,768 tokens)
96% faster prompt processing (128.71 → 252.36 tokens/sec)

Key Optimization Insights

I knew a fair amount about this process, but Claude taught me a few things along the way:

Counter-intuitive thread optimization - Claude knew that reducing threads from 32 to 2 would improve performance for GPU-accelerated models, something many human users might not try.
Asymmetric load balancing - Claude calculated the precise tensor-split ratios needed for our asymmetric GPU setup, eventually landing on the 97/3 split that perfectly matched GPU capabilities. It took a little bit for Claude to realize these GPUs are different in capability, but once it knew, it fixed it.
VRAM bottleneck identification - Claude immediately spotted that the 2060's VRAM had become a bottleneck when we increased context size and knew exactly how to resolve it.
Parameter compatibility awareness - When invalid parameters were encountered, Claude quickly adapted and suggested compatible alternatives. Sometimes Claude would halluciate llama-server arguments, but telling it to use the MCP servers solved that problem.
Flash Attention knowledge - Claude understood the importance of Flash Attention for large context windows and incorporated it at the right time. Flash attention is still a subject I'm learning more about, so this was a great suggestion. I don't think I would have made this change myself.

What's next: AI-Optimizing-AI Agents

While I still served an important role as a moderator in this experiment, I think the future of this will be agentic. I see no reason why Claude couldn't have a MCP server pulling those real time performance metrics rather than having me provide them over the chat interface manually.

In fact, when I asked Claude what it thought of my role in this experiment, it described me as being "similar to the role of a lab technician conducting experiments designed by a research scientist."

For a while it may make sense for humans to still be in the loop, but this process could be repeated very quickly for every model that you are trying to run inference on. The end result no matter what is a positive one: you get the fastest AI your hardware can provide you and you are saving wasted energy.