GPUs & Models
Add gpu to your deployment and get production LLMs instantly — no CUDA setup, no model downloads, no cold-start waits. You chose the model, Nexlayer launches your environment, and your GPU is already set, with NXL_INFERENCE_URL injected into your pod's env before the container starts.
Declarative
Set gpu.model in YAML, scheduler handles allocation, weights, warm-up.
Fast
Sub-50ms TTFT on small models. Weights pre-pinned. No per-request warm-up.
Priced sanely
$0.50 / $1.25 / $2.50 per hour. Beats hosted APIs at volume.
The gpu block
application:
name: my-chat-backend
pods:
- name: api
path: /api
image: myorg/chat-backend:v1
servicePorts: [8000]
gpu:
enabled: true
model: llama-3.3-70b # catalog slug or "auto" or "custom"
priority: inference # interactive | inference | training | batch
vars:
DATABASE_URL: postgresql://app:pass@db.pod:5432/chatFields:
gpu.enabled— must betruewhen the block is present. Omit the block to disable GPU.gpu.model— catalog slug, literalauto(Free plan only, scheduler picks a small shared model), or literalcustom(Enterprise Mode-1 bring-your-own).gpu.priority— QoS tier driving preemption order and billing. Defaults toinference.gpu.memoryGB— required only whenmodel: custom. Max 96 on the current fleet.
Three modes, one syntax
Mode 3 — Shared pinned
$0.50 / hrSmall and mid models (≤48 GB VRAM). Packed with other tenants. Sub-50ms TTFT.
Triggered by catalog slugs like llama-3.1-8b, qwen-2.5-coder-7b, phi-3.5-mini, nomic-embed.
Mode 2 — Large pinned
$1.25 / hr70B-class. Dedicated slot on a 96GB RTX PRO 6000.
Triggered by llama-3.3-70b, deepseek-r1-distill-llama-70b-fp8, qwen-2.5-coder-32b-fp8.
Mode 1 — Dedicated (Enterprise)
$2.50 / hrRaw card. Your own model server (vLLM, TGI, custom CUDA).
Triggered by model: custom + memoryGB: 96. Enterprise plan only.
All rates at coefficient 1.0 · 1,000 credits = $1 · billed via the k8s meter every 5 minutes.
Using the injected inference endpoint
For Mode 2/3 the scheduler injects NXL_INFERENCE_URL before your container starts. It speaks the OpenAI protocol:
import os
from openai import OpenAI
client = OpenAI(
base_url=os.environ["NXL_INFERENCE_URL"],
api_key="not-used", # URL is already owner-scoped
)
resp = client.chat.completions.create(
model="llama-3.3-70b", # ignored; the slug in nexlayer.yaml wins
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)The model param in the call is ignored — your pod is already pinned, the URL only accepts the pinned slug. Listing model: "auto" or the exact slug both work; there is no cross-routing.
Bringing your own model (Mode 1)
- name: trainer
image: myorg/custom-trainer:v1
servicePorts: [8080]
gpu:
enabled: true
model: custom
priority: training
memoryGB: 96The scheduler allocates a full RTX PRO 6000 on a node labelled nexlayer.ai/gpu-mode=dedicated. Your container gets NXL_GPU_INDEX=0, full access to nvidia.com/gpu: 1, and no model server is started on your behalf. You bring vLLM, TGI, raw CUDA, a training loop, whatever. Enterprise plan only.
Plan access matrix
| Plan | GPU access | Included hours/mo |
|---|---|---|
| Free — $0 | gpu.model: auto only | ~10h Mode 3 |
| Pro — $29/mo | Any small Mode 3 slug | 60h Mode 3 |
| Scale — $299/mo | All Mode 2 + Mode 3 models (70B-class unlocked) | 240h Mode 2 / 600h Mode 3 |
| Enterprise — $2,999+/mo | Everything + model: custom Mode 1 | Custom contract |
Full catalog with slugs, use cases, and mode/VRAM footprint lives on the pricing page. Or call nexlayer_list_models from any MCP-connected agent to get the per-plan availability matrix in-context.
Antipatterns
| Wrong | Right |
|---|---|
| model: llama-3.3-70b (at pod root) | gpu: {enabled: true, model: llama-3.3-70b, priority: inference} |
| gpu: {model: llama-3.3-70b} (no enabled) | gpu: {enabled: true, model: llama-3.3-70b} |
| MODEL_NAME env var | gpu.model — scheduler can't see env at admit time |
| useGPU: true (deprecated) | gpu: {enabled: true, model: custom, memoryGB: 96} |
Validation
nexlayer_validate_yaml (and the deploy endpoint) enforces:
gpu.enabled: truerequired when the block is present.gpu.modelis a known catalog slug, orauto, orcustom.gpu.priorityis one ofinteractive,inference,training,batch.gpu.memoryGBpresent iffmodel: custom, and ≤ 96.- Per-plan entitlement check — Free can only use
auto, Pro can't pick 70B models, and so on.