Pay-per-token APIs. No hardware required. Immediate access to the latest and largest models. Data leaves your environment — not suitable for sensitive or regulated content. Best for prototyping, variable workloads, and applications needing the highest capability ceiling.
Run on your own hardware using tools like Ollama or llama.cpp. Zero per-token cost after hardware. Full data privacy — nothing leaves your machine. Requires upfront GPU investment. Best for high-volume, privacy-sensitive, or offline use cases.
Cloud wins on raw capability and zero setup. Local wins on privacy and long-run cost. The gap has narrowed — a 70B local model now competes with mid-tier cloud models for most tasks. The decision usually comes down to data sensitivity and volume.
| Model | Type |
|---|
RTX 3080, 4070, M2/M3 Mac (16GB). Runs 7B–8B models at full precision or 13B models quantised. Good for development and personal use.
RTX 3090, 4090, A5000. Runs 13B–30B models comfortably. Can handle 70B models at 4-bit quantisation if paired with system RAM offloading.
A6000, H100, or dual 3090s. Runs 70B models at full precision. Required for reliable production use of the largest open-source models.
llama.cpp and Ollama run on CPU or Apple Silicon using unified memory. M2 Ultra (192 GB) can run 70B models well. Slower than GPU but zero VRAM limit.