Self-hosting an LLM is more than running a model. It includes inference server, API gateway, model registry, observability, security patching, and ongoing ops. This page gives you the reference architecture and a sizing model.
Minimum hardware
- 1× H100 80GB (18B models) or 4× A100 40GB (70B models)
- 128GB RAM, 8TB NVMe
- 25GbE internal fabric
- Hardware HSM for key custody
Reference stack
- Inference: vLLM / TensorRT-LLM / SGLang
- API: OpenAI-compatible gateway (Plugsky or open-source)
- Observability: OpenTelemetry, Prometheus, Grafana
- Vector store: pgvector / Qdrant / Pinecone
- Model registry: MLflow or OCI registry with signed bundles
Sizing model
Use the GPU capacity calculator to size for your workload.