Zing — Industry Resources

Local vs Cloud AI Models

Open-source models you run yourself compared to commercial cloud APIs — context, pricing, capabilities and hardware requirements

☁️ Cloud Models

Pay-per-token APIs. No hardware required. Immediate access to the latest and largest models. Data leaves your environment — not suitable for sensitive or regulated content. Best for prototyping, variable workloads, and applications needing the highest capability ceiling.

🖥️ Local Models

Run on your own hardware using tools like Ollama or llama.cpp. Zero per-token cost after hardware. Full data privacy — nothing leaves your machine. Requires upfront GPU investment. Best for high-volume, privacy-sensitive, or offline use cases.

⚖️ Key Trade-off

Cloud wins on raw capability and zero setup. Local wins on privacy and long-run cost. The gap has narrowed — a 70B local model now competes with mid-tier cloud models for most tasks. The decision usually comes down to data sensitivity and volume.

Type

Capability

— models · — cloud · — local Click column headers to sort

Model ↕	Type ↕	Context ↓	Input / 1M tokens ↕	Output / 1M tokens ↕	Chat	Coding	Vision	Tool use	Reasoning	Privacy ↕	Min VRAM / RAM ↕

Running Models Locally — Hardware Guide

Consumer GPU (8–12 GB VRAM)

RTX 3080, 4070, M2/M3 Mac (16GB). Runs 7B–8B models at full precision or 13B models quantised. Good for development and personal use.

Prosumer GPU (24 GB VRAM)

RTX 3090, 4090, A5000. Runs 13B–30B models comfortably. Can handle 70B models at 4-bit quantisation if paired with system RAM offloading.

Workstation / Multi-GPU (48–80 GB)

A6000, H100, or dual 3090s. Runs 70B models at full precision. Required for reliable production use of the largest open-source models.

CPU / Apple Silicon

llama.cpp and Ollama run on CPU or Apple Silicon using unified memory. M2 Ultra (192 GB) can run 70B models well. Slower than GPU but zero VRAM limit.

About quantisation: Local model VRAM figures assume 4-bit quantisation (Q4_K_M), the most common trade-off between quality and memory. Full precision (FP16) requires roughly 2× the VRAM. Tools like Ollama and llama.cpp handle quantisation automatically. Cloud model costs are list prices as of June 2026 — see the AI Price Comparison page for live pricing.