llama.cpp Hits 100k Stars — the Engine Behind Local AI
Georgi Gerganov's C++ inference engine reaches 100,000 GitHub stars in under three years, powering the local LLM revolution from Raspberry Pi to multi-GPU servers.
Three years ago, a Bulgarian engineer named Georgi Gerganov wanted to see if Meta's LLaMA weights could run in pure C++ — no Python, no PyTorch, no dependency sprawl. The first version ran a 7B model on a MacBook CPU. This week, the project crossed 100,000 stars on GitHub, making it one of the fastest open-source AI projects to reach that milestone.
What Happened
llama.cpp hit 100k stars around March 30, joining an elite club that includes PyTorch and TensorFlow — both of which took roughly twice as long to get there. The project now has over 700 contributors, 15,000 forks, and merged 3,800 pull requests in 2025 alone. That's roughly three times the PR throughput of NVIDIA's fully funded TensorRT-LLM.
Gerganov marked the occasion with a characteristically bold prediction on X: "Now that 90% of the code worldwide is being written by AI agents, I predict that within 3-6 months, 90% of all AI agents will be running via llama.cpp." Hugging Face CTO Julien Chaumond echoed the sentiment, predicting that within 18 months most AI agents will run locally.
The downstream ecosystem tells its own story. Ollama (110k+ stars), LM Studio, GPT4All, and Jan.ai all use llama.cpp as their inference backend. Over 60% of quantized models on Hugging Face ship in llama.cpp's GGUF format — more than GPTQ, AWQ, and EXL2 combined. When people talk about running Qwen 3.5 on a MacBook Air or old AMD MI50 GPUs, it's llama.cpp doing the heavy lifting.
Why This Matters
The economics are stark. Local inference through llama.cpp costs roughly $0.002 per million tokens in electricity. Cloud APIs charge $2.50 to $15.00 for the same volume — a gap of up to 1,000x. For privacy-sensitive workloads that need to satisfy GDPR, HIPAA, or ITAR requirements, the calculus is even simpler: if data never leaves the device, compliance is inherent.
What started as a weekend hack has become critical infrastructure for the local AI movement. The project supports Apple Metal, NVIDIA CUDA, AMD ROCm, Intel SYCL, and Vulkan — everything from a Raspberry Pi to an 8-GPU server. And with developments like TurboQuant and dedicated ASIC cards pushing local inference speeds ever higher, the engine powering it all just proved it has staying power.


