GPUs & Models

Add gpu to your deployment and get production LLMs instantly — no CUDA setup, no model downloads, no cold-start waits. You chose the model, Nexlayer launches your environment, and your GPU is already set, with NXL_INFERENCE_URL injected into your pod's env before the container starts.

Declarative

Set gpu.model in YAML, scheduler handles allocation, weights, warm-up.

Fast

Sub-50ms TTFT on small models. Weights pre-pinned. No per-request warm-up.

Priced sanely

$0.50 / $1.25 / $2.50 per hour. Beats hosted APIs at volume.

Availability: Hosted GPUs — all three modes below — are an Enterprise feature. On Free and Pro, you bring your own frontier API key (Anthropic, OpenAI, or Google) and Nexlayer routes requests as passthrough at $0 margin. Need GPUs co-located with your app for low RTT and no egress? That's Enterprise.

The gpu block

nexlayer.yaml

application:
  name: my-chat-backend
  pods:
    - name: api
      path: /api
      image: myorg/chat-backend:v1
      servicePorts: [8000]
      gpu:
        enabled: true
        model: llama-3.3-70b    # catalog slug or "auto" or "custom"
        priority: inference     # interactive | inference | training | batch
      vars:
        DATABASE_URL: postgresql://app:pass@db.pod:5432/chat

Fields:

gpu.enabled — must be true when the block is present. Omit the block to disable GPU.
gpu.model — catalog slug, literal auto (scheduler picks a small shared model), or literal custom (Mode 1 bring-your-own). The gpu block requires an Enterprise plan — see the plan matrix below.
gpu.priority — QoS tier driving preemption order and billing. Defaults to inference.
gpu.memoryGB — required only when model: custom. Max 96 on the current fleet.

Three modes, one syntax

Mode 3 — Shared pinned

$0.50 / hr

Small and mid models (≤48 GB VRAM). Packed with other tenants. Sub-50ms TTFT.

Triggered by catalog slugs like llama-3.1-8b, qwen-2.5-coder-7b, phi-3.5-mini, nomic-embed.

Mode 2 — Large pinned

$1.25 / hr

70B-class. Dedicated slot on a 96GB RTX PRO 6000.

Triggered by llama-3.3-70b, deepseek-r1-distill-llama-70b-fp8, qwen-2.5-coder-32b-fp8.

Mode 1 — Dedicated (Enterprise)

$2.50 / hr

Raw card. Your own model server (vLLM, TGI, custom CUDA).

Triggered by model: custom + memoryGB: 96. Enterprise plan only.

All rates at coefficient 1.0 · 1,000 credits = $1 · billed via the k8s meter every 5 minutes.

Using the injected inference endpoint

For Mode 2/3 the scheduler injects NXL_INFERENCE_URL before your container starts. It speaks the OpenAI protocol:

api/chat.py

import os
from openai import OpenAI

client = OpenAI(
    base_url=os.environ["NXL_INFERENCE_URL"],
    api_key="not-used",  # URL is already owner-scoped
)

resp = client.chat.completions.create(
    model="llama-3.3-70b",   # ignored; the slug in nexlayer.yaml wins
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

The model param in the call is ignored — your pod is already pinned, the URL only accepts the pinned slug. Listing model: "auto" or the exact slug both work; there is no cross-routing.

Bringing your own model (Mode 1)

nexlayer.yaml — Mode 1 BYO

- name: trainer
  image: myorg/custom-trainer:v1
  servicePorts: [8080]
  gpu:
    enabled: true
    model: custom
    priority: training
    memoryGB: 96

The scheduler allocates a full RTX PRO 6000 on a node labelled nexlayer.ai/gpu-mode=dedicated. Your container gets NXL_GPU_INDEX=0, full access to nvidia.com/gpu: 1, and no model server is started on your behalf. You bring vLLM, TGI, raw CUDA, a training loop, whatever. Enterprise plan only.

Plan access matrix

Plan	GPU access	Included credits/mo
Free — $0/mo	No GPU	5,000 credits
Pro — $20/mo	No GPU · BYO key	20,000 credits
Enterprise — Custom (min $25K/yr)	GPUs — shared, dedicated, and BYO `model: custom` Mode 1	2M+ credits

Antipatterns

Wrong	Right
model: llama-3.3-70b (at pod root)	gpu: {enabled: true, model: llama-3.3-70b, priority: inference}
gpu: {model: llama-3.3-70b} (no enabled)	gpu: {enabled: true, model: llama-3.3-70b}
MODEL_NAME env var	gpu.model — scheduler can't see env at admit time
useGPU: true (deprecated)	gpu: {enabled: true, model: custom, memoryGB: 96}

Validation

nexlayer_validate_yaml (and the deploy endpoint) enforces:

gpu.enabled: true required when the block is present.
gpu.model is a known catalog slug, or auto, or custom.
gpu.priority is one of interactive, inference, training, batch.
gpu.memoryGB present iff model: custom, and ≤ 96.
Per-plan entitlement check — the gpu block requires Enterprise; Free and Pro deploys can't enable it (bring your own frontier API key instead).