On-Prem LLM

On-prem LLM deployment — full reference architecture

Reference architecture, hardware sizing, and operational model for self-hosted LLMs. From a single H100 to a multi-node cluster.

Self-hosting an LLM is more than running a model. It includes inference server, API gateway, model registry, observability, security patching, and ongoing ops. This page gives you the reference architecture and a sizing model.

Minimum hardware

  • 1× H100 80GB (18B models) or 4× A100 40GB (70B models)
  • 128GB RAM, 8TB NVMe
  • 25GbE internal fabric
  • Hardware HSM for key custody

Reference stack

  • Inference: vLLM / TensorRT-LLM / SGLang
  • API: OpenAI-compatible gateway (Plugsky or open-source)
  • Observability: OpenTelemetry, Prometheus, Grafana
  • Vector store: pgvector / Qdrant / Pinecone
  • Model registry: MLflow or OCI registry with signed bundles

Sizing model

Use the GPU capacity calculator to size for your workload.

Get started

See the full pricing table and start a trial.

Start trial → Enterprise plans