GPUs & Models

Add gpu to your deployment and get production LLMs instantly — no CUDA setup, no model downloads, no cold-start waits. You chose the model, Nexlayer launches your environment, and your GPU is already set, with NXL_INFERENCE_URL injected into your pod's env before the container starts.

Declarative

Set gpu.model in YAML, scheduler handles allocation, weights, warm-up.

Fast

Sub-50ms TTFT on small models. Weights pre-pinned. No per-request warm-up.

Priced sanely

$0.50 / $1.25 / $2.50 per hour. Beats hosted APIs at volume.

The gpu block

nexlayer.yaml
application:
  name: my-chat-backend
  pods:
    - name: api
      path: /api
      image: myorg/chat-backend:v1
      servicePorts: [8000]
      gpu:
        enabled: true
        model: llama-3.3-70b    # catalog slug or "auto" or "custom"
        priority: inference     # interactive | inference | training | batch
      vars:
        DATABASE_URL: postgresql://app:pass@db.pod:5432/chat

Fields:

  • gpu.enabled — must be true when the block is present. Omit the block to disable GPU.
  • gpu.model — catalog slug, literal auto (Free plan only, scheduler picks a small shared model), or literal custom (Enterprise Mode-1 bring-your-own).
  • gpu.priority — QoS tier driving preemption order and billing. Defaults to inference.
  • gpu.memoryGB — required only when model: custom. Max 96 on the current fleet.

Three modes, one syntax

Mode 3 — Shared pinned

$0.50 / hr

Small and mid models (≤48 GB VRAM). Packed with other tenants. Sub-50ms TTFT.

Triggered by catalog slugs like llama-3.1-8b, qwen-2.5-coder-7b, phi-3.5-mini, nomic-embed.

Mode 2 — Large pinned

$1.25 / hr

70B-class. Dedicated slot on a 96GB RTX PRO 6000.

Triggered by llama-3.3-70b, deepseek-r1-distill-llama-70b-fp8, qwen-2.5-coder-32b-fp8.

Mode 1 — Dedicated (Enterprise)

$2.50 / hr

Raw card. Your own model server (vLLM, TGI, custom CUDA).

Triggered by model: custom + memoryGB: 96. Enterprise plan only.

All rates at coefficient 1.0 · 1,000 credits = $1 · billed via the k8s meter every 5 minutes.

Using the injected inference endpoint

For Mode 2/3 the scheduler injects NXL_INFERENCE_URL before your container starts. It speaks the OpenAI protocol:

api/chat.py
import os
from openai import OpenAI

client = OpenAI(
    base_url=os.environ["NXL_INFERENCE_URL"],
    api_key="not-used",  # URL is already owner-scoped
)

resp = client.chat.completions.create(
    model="llama-3.3-70b",   # ignored; the slug in nexlayer.yaml wins
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

The model param in the call is ignored — your pod is already pinned, the URL only accepts the pinned slug. Listing model: "auto" or the exact slug both work; there is no cross-routing.

Bringing your own model (Mode 1)

nexlayer.yaml — Mode 1 BYO
- name: trainer
  image: myorg/custom-trainer:v1
  servicePorts: [8080]
  gpu:
    enabled: true
    model: custom
    priority: training
    memoryGB: 96

The scheduler allocates a full RTX PRO 6000 on a node labelled nexlayer.ai/gpu-mode=dedicated. Your container gets NXL_GPU_INDEX=0, full access to nvidia.com/gpu: 1, and no model server is started on your behalf. You bring vLLM, TGI, raw CUDA, a training loop, whatever. Enterprise plan only.

Plan access matrix

PlanGPU accessIncluded hours/mo
Free — $0gpu.model: auto only~10h Mode 3
Pro — $29/moAny small Mode 3 slug60h Mode 3
Scale — $299/moAll Mode 2 + Mode 3 models (70B-class unlocked)240h Mode 2 / 600h Mode 3
Enterprise — $2,999+/moEverything + model: custom Mode 1Custom contract

Full catalog with slugs, use cases, and mode/VRAM footprint lives on the pricing page. Or call nexlayer_list_models from any MCP-connected agent to get the per-plan availability matrix in-context.

Antipatterns

WrongRight
model: llama-3.3-70b (at pod root)gpu: {enabled: true, model: llama-3.3-70b, priority: inference}
gpu: {model: llama-3.3-70b} (no enabled)gpu: {enabled: true, model: llama-3.3-70b}
MODEL_NAME env vargpu.model — scheduler can't see env at admit time
useGPU: true (deprecated)gpu: {enabled: true, model: custom, memoryGB: 96}

Validation

nexlayer_validate_yaml (and the deploy endpoint) enforces:

  • gpu.enabled: true required when the block is present.
  • gpu.model is a known catalog slug, or auto, or custom.
  • gpu.priority is one of interactive, inference, training, batch.
  • gpu.memoryGB present iff model: custom, and ≤ 96.
  • Per-plan entitlement check — Free can only use auto, Pro can't pick 70B models, and so on.