On-Prem LLM

On-prem LLM deployment — full reference architecture

Reference architecture, hardware sizing, and operational model for self-hosted LLMs. From a single H100 to a multi-node cluster.

Start Free → See pricing Read the docs

Self-hosting an LLM is more than running a model. It includes inference server, API gateway, model registry, observability, security patching, and ongoing ops. This page gives you the reference architecture and a sizing model.

Minimum hardware

1× H100 80GB (18B models) or 4× A100 40GB (70B models)
128GB RAM, 8TB NVMe
25GbE internal fabric
Hardware HSM for key custody

Reference stack

Inference: vLLM / TensorRT-LLM / SGLang
API: OpenAI-compatible gateway (Plugsky or open-source)
Observability: OpenTelemetry, Prometheus, Grafana
Vector store: pgvector / Qdrant / Pinecone
Model registry: MLflow or OCI registry with signed bundles

Sizing model

Use the GPU capacity calculator to size for your workload.

Get started

See the full pricing table and start a trial.

Start trial → Enterprise plans