Generalcustom_containerapache-2.0Added May 26, 2026

Qwen3 30B A3B Instruct 2507 FP8

FP8 vLLM 0.21.0 CUDA 13· Default BF16 vLLM 0.21.0 CUDA 13· Testing

Qwen3 30B A3B Instruct 2507 FP8 is a public Apache-2.0 non-thinking MoE chat model for general instruction following, coding, multilingual knowledge, long-context understanding, and tool-use style prompts.

text→text

llmchatinstructionagentic+11

Try in playground ↓Deploy Serverless ↓Open Hugging Face model ↗API docs

Recommended targetH200 in eu-north2Picked from the fastest verified GPU/region for this model version. Playground and API docs links are pinned to this route.

Context window

32768

VRAM needed

40.3 GB

Observed working set on a supported GPU.

FP8 vLLM 0.21.0 CUDA 13 · Testing · Default version
Qwen3 30B A3B Instruct 2507 FP8 served by the official vLLM OpenAI-compatible CUDA 13 Docker image as a hidden non-default Forge onboarding candidate. A 2026-05-26T11Z partial matrix validated H200, L40S, and RTX6000; the 2026-05-26T12Z B300 probe is supported after disabling FlashInfer FP8 MoE and forcing DeepGEMM. B200 remains disabled after the 2026-05-26T13Z retained diagnostic isolated the terminal failure to Torch/vLLM CUTLASS scaled_mm dispatch during startup profile_run. Forge uses this version when requests omit an explicit model version.

API route: POST /v1/chat/completions
Weights dtype: FP8
Pulled image size: 8.1 GB

After the last request a backend stays warm on its GPU for about 15 minutes, then frees the GPU. The next request triggers a fresh cold start.

Status

cold

Not running.

API target

qwen-qwen3-30b-a3b-instruct-2507 version fp8-vllm-0-21-0-cuda13 on H200 in eu-north2

Use the model field with the OpenAI-compatible SDK or API docs curl snippets.

Open docs Try target

API route

/v1/chat/completions

HTTP method

POST

Model field

qwen-qwen3-30b-a3b-instruct-2507

Version field

model_version: fp8-vllm-0-21-0-cuda13

GPU field

gpu_type: H200

Region field

region: eu-north2

1Verify targetRuns the auth guard and selected endpoint/model/routing check.
2Validate targetConfirm the selected GPU or region is still verified, or print copyable best-target exports.
3Estimate runValidate warm and first-cold request cost before prewarming or first traffic.
4Check runtimeConfirm whether the selected version is warm or starting.
5Prewarm targetStart the selected version on its pinned GPU or region before latency-sensitive traffic.
6Open docsUse the selected target snippets for the first request.Open docs

One-block API check

Terminal-ready smoke test for this selected target.

View command

set -euo pipefail
# Forge API smoke test
# Forge selected target: route=/v1/chat/completions model=qwen-qwen3-30b-a3b-instruct-2507 version=fp8-vllm-0-21-0-cuda13 gpu=H200 region=eu-north2
FORGE_API_BASE=${FORGE_API_BASE:-'https://YOUR_FORGE_HOST'}
export MODEL_OR_FAMILY_SLUG=${MODEL_OR_FAMILY_SLUG:-'qwen-qwen3-30b-a3b-instruct-2507'}
export FORGE_MODEL_VERSION=${FORGE_MODEL_VERSION:-'fp8-vllm-0-21-0-cuda13'}
export FORGE_GPU_TYPE=${FORGE_GPU_TYPE:-'H200'}
export FORGE_REGION=${FORGE_REGION:-'eu-north2'}
case "${FORGE_API_KEY:-}" in
  ""|replace-with-your-forge-api-key)
    echo 'Set FORGE_API_KEY to a real Forge API key before running this snippet; browser SSO sessions are not sent to copied curl or SDK clients.' >&2
    exit 1
    ;;
esac
forge_api_url() {
  endpoint="$1"
  base="${FORGE_API_BASE%/}"
  case "$base:$endpoint" in
    */v1:/v1|*/v1:/v1/*|*/v1:/v1\?*) printf '%s%s\n' "$base" "${endpoint#/v1}" ;;
    *) printf '%s%s\n' "$base" "$endpoint" ;;
  esac
}
python3 - <<'PY' |
import json
import os

payload = {
    "model": os.environ["MODEL_OR_FAMILY_SLUG"],
    "messages": [
        {"role": "user", "content": "Write a one sentence status update."},
    ],
}
model_version = os.environ.get("FORGE_MODEL_VERSION")
if model_version:
    payload["model_version"] = model_version
gpu_type = os.environ.get("FORGE_GPU_TYPE", "H200")
if gpu_type:
    payload["gpu_type"] = gpu_type
region = os.environ.get("FORGE_REGION", "eu-north2")
if region:
    payload["region"] = region
print(json.dumps(payload))
PY
curl -sS --fail-with-body "$(forge_api_url '/v1/chat/completions')" \
  --max-time "${FORGE_REQUEST_TIMEOUT_SECONDS:-600}" \
  -X POST \
  -H "Authorization: Bearer ${FORGE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @- | \
python3 -c 'import json, sys

data = json.load(sys.stdin)
message = (((data.get("choices") or [{}])[0].get("message") or {}).get("content"))
if message:
    print(message)
else:
    print(json.dumps(data, indent=2))'

Client fit

OpenAI SDK route · Best for chat, completions, embeddings, and clients that already support an OpenAI-compatible base URL.

Routing pinned

Copied snippets include gpu_type and region, so the first request targets this verified GPU and region. Remove those fields to let Forge choose another compatible target.

Target availability

8 free GPUs · Live capacity for H200 in eu-north2.

Request URL

https://YOUR_FORGE_HOST/v1/chat/completions

Authentication

Client auth: Set FORGE_API_KEY to a real Forge API key before running copied curl, fetch, or SDK snippets. Browser SSO only authenticates this web session.

Open Account

Authorization: Bearer $FORGE_API_KEY

Pinned setup

export FORGE_API_BASE='https://YOUR_FORGE_HOST'
export FORGE_API_KEY="${FORGE_API_KEY:-replace-with-your-forge-api-key}"
export FORGE_REQUEST_TIMEOUT_SECONDS="${FORGE_REQUEST_TIMEOUT_SECONDS:-600}"
export FORGE_API_ROUTE='/v1/chat/completions'
export FORGE_OPENAI_BASE_URL='https://YOUR_FORGE_HOST/v1'
export MODEL_OR_FAMILY_SLUG='qwen-qwen3-30b-a3b-instruct-2507'
export FORGE_MODEL_VERSION='fp8-vllm-0-21-0-cuda13'
export FORGE_GPU_TYPE='H200'
export FORGE_REGION='eu-north2'

Project .env

Copy these values into a local .env file when moving the selected target into an app or SDK client.

# Forge selected target: route=/v1/chat/completions model=qwen-qwen3-30b-a3b-instruct-2507 version=fp8-vllm-0-21-0-cuda13 gpu=H200 region=eu-north2
FORGE_API_BASE="https://YOUR_FORGE_HOST"
FORGE_API_ROUTE="/v1/chat/completions"
FORGE_OPENAI_BASE_URL="https://YOUR_FORGE_HOST/v1"
FORGE_API_KEY="replace-with-your-forge-api-key"
FORGE_REQUEST_TIMEOUT_SECONDS="600"
MODEL_OR_FAMILY_SLUG="qwen-qwen3-30b-a3b-instruct-2507"
FORGE_MODEL_VERSION="fp8-vllm-0-21-0-cuda13"
FORGE_GPU_TYPE="H200"
FORGE_REGION="eu-north2"

Project .gitignore

Add these rules before replacing the placeholder API key so local Forge secrets stay out of commits while .env.example can remain tracked.

# Forge local API secrets
.env
.env.*
!.env.example

Preflight URLs and commands

Run estimate URL

https://YOUR_FORGE_HOST/v1/models/qwen-qwen3-30b-a3b-instruct-2507/run-estimate?model_version=fp8-vllm-0-21-0-cuda13&gpu_type=H200&region=eu-north2

Selected target reliability

set -euo pipefail
# Forge selected target: route=/v1/chat/completions model=qwen-qwen3-30b-a3b-instruct-2507 version=fp8-vllm-0-21-0-cuda13 gpu=H200 region=eu-north2
FORGE_API_BASE=${FORGE_API_BASE:-'https://YOUR_FORGE_HOST'}
export MODEL_OR_FAMILY_SLUG=${MODEL_OR_FAMILY_SLUG:-'qwen-qwen3-30b-a3b-instruct-2507'}
export FORGE_MODEL_VERSION=${FORGE_MODEL_VERSION:-'fp8-vllm-0-21-0-cuda13'}
export FORGE_GPU_TYPE=${FORGE_GPU_TYPE:-'H200'}
export FORGE_REGION=${FORGE_REGION:-'eu-north2'}
case "${FORGE_API_KEY:-}" in
  ""|replace-with-your-forge-api-key)
    echo 'Set FORGE_API_KEY to a real Forge API key before running this snippet; browser SSO sessions are not sent to copied curl or SDK clients.' >&2
    exit 1
    ;;
esac
forge_api_url() {
  endpoint="$1"
  base="${FORGE_API_BASE%/}"
  case "$base:$endpoint" in
    */v1:/v1|*/v1:/v1/*|*/v1:/v1\?*) printf '%s%s\n' "$base" "${endpoint#/v1}" ;;
    *) printf '%s%s\n' "$base" "$endpoint" ;;
  esac
}
reliability_path="$(python3 -c 'import os
from urllib.parse import quote, urlencode

model = os.environ.get("MODEL_OR_FAMILY_SLUG", "").strip()
if not model:
    raise SystemExit("Set MODEL_OR_FAMILY_SLUG from search or route finder output before checking reliability.")
params = {}
model_version = os.environ.get("FORGE_MODEL_VERSION", "").strip()
if model_version:
    params["model_version"] = model_version
gpu_type = os.environ.get("FORGE_GPU_TYPE", "").strip()
if gpu_type:
    params["gpu_type"] = gpu_type
region = os.environ.get("FORGE_REGION", "").strip()
if region:
    params["region"] = region
path = "/v1/models/" + quote(model, safe="") + "/reliability"
if params:
    path += "?" + urlencode(params)
print(path)')"
curl -sS --fail-with-body "$(forge_api_url "$reliability_path")" \
  --max-time "${FORGE_REQUEST_TIMEOUT_SECONDS:-600}" \
  -H "Authorization: Bearer ${FORGE_API_KEY}" | \
python3 -c 'import json, shlex, sys

payload = json.load(sys.stdin)
print(
    f"{payload.get('\''slug'\'')} reliability={payload.get('\''reliability_status'\'')} "
    f"supported={payload.get('\''supported_rows'\'', 0)}/{payload.get('\''total_rows'\'', 0)}"
)
filters = payload.get("filters") or {}
if filters:
    print("filters: " + ", ".join(f"{key}={value}" for key, value in filters.items()))

def describe_target(target):
    details = []
    request_ms = target.get("request_ms_p50") or target.get("request_ms")
    if request_ms is not None:
        details.append(f"p50={request_ms}ms")
    warm_cost = target.get("estimated_warm_request_cost_usd")
    if warm_cost is not None:
        details.append(f"warm_cost_usd={warm_cost}")
    elif target.get("cost_per_gpu_hour_usd") is not None:
        details.append(f"gpu_hour_usd={target['\''cost_per_gpu_hour_usd'\'']}")
    success_rate = target.get("observed_success_rate")
    if isinstance(success_rate, (int, float)):
        details.append(f"success={success_rate:.0%}")
    return ", ".join(details) or target.get("status") or "supported"

exports = {}
for label, key in (
    ("fastest supported", "fastest_supported_target"),
    ("lowest-cost supported", "lowest_cost_supported_target"),
):
    target = payload.get(key) or {}
    gpu_type = target.get("gpu_type")
    if not gpu_type:
        continue
    identity = (str(gpu_type), str(target.get("region") or ""))
    exports.setdefault(identity, {"labels": [], "target": target})["labels"].append(label)

if not exports:
    print("No supported GPU/region target returned.", file=sys.stderr)
    print(json.dumps({
        "status_counts": payload.get("status_counts", {}),
        "failure_reason_counts": payload.get("failure_reason_counts", {}),
    }, indent=2))
    raise SystemExit(1)

for (gpu_type, region), entry in exports.items():
    assignments = [f"FORGE_GPU_TYPE={shlex.quote(gpu_type)}"]
    if region:
        assignments.append(f"FORGE_REGION={shlex.quote(region)}")
    labels = " + ".join(entry["labels"])
    details = describe_target(entry["target"])
    print(f"export {'\'' '\''.join(assignments)}  # {labels}: {details}")'

Runtime status URL

https://YOUR_FORGE_HOST/v1/model-families/qwen-qwen3-30b-a3b-instruct-2507/status?version=fp8-vllm-0-21-0-cuda13

Runtime warmup command

set -euo pipefail
# Forge selected target: route=/v1/chat/completions model=qwen-qwen3-30b-a3b-instruct-2507 version=fp8-vllm-0-21-0-cuda13 gpu=H200 region=eu-north2
FORGE_API_BASE=${FORGE_API_BASE:-'https://YOUR_FORGE_HOST'}
export MODEL_OR_FAMILY_SLUG=${MODEL_OR_FAMILY_SLUG:-'qwen-qwen3-30b-a3b-instruct-2507'}
export FORGE_MODEL_VERSION=${FORGE_MODEL_VERSION:-'fp8-vllm-0-21-0-cuda13'}
export FORGE_GPU_TYPE=${FORGE_GPU_TYPE:-'H200'}
export FORGE_REGION=${FORGE_REGION:-'eu-north2'}
export FORGE_KEEP_WARM=${FORGE_KEEP_WARM:-false}
case "${FORGE_API_KEY:-}" in
  ""|replace-with-your-forge-api-key)
    echo 'Set FORGE_API_KEY to a real Forge API key before running this snippet; browser SSO sessions are not sent to copied curl or SDK clients.' >&2
    exit 1
    ;;
esac
forge_api_url() {
  endpoint="$1"
  base="${FORGE_API_BASE%/}"
  case "$base:$endpoint" in
    */v1:/v1|*/v1:/v1/*|*/v1:/v1\?*) printf '%s%s\n' "$base" "${endpoint#/v1}" ;;
    *) printf '%s%s\n' "$base" "$endpoint" ;;
  esac
}
runtime_start_path="$(python3 -c 'import os
from urllib.parse import quote

model = os.environ.get("MODEL_OR_FAMILY_SLUG", "").strip()
if not model:
    raise SystemExit("Set MODEL_OR_FAMILY_SLUG from the model picker output")
print("/v1/model-families/" + quote(model, safe="") + "/start")')"
python3 -c 'import json, os

def env_value(name):
    value = os.environ.get(name, "").strip()
    return value or None

payload = {}
version = env_value("FORGE_MODEL_VERSION")
if version:
    payload["version"] = version
gpu_type = env_value("FORGE_GPU_TYPE")
if gpu_type:
    payload["gpu_type"] = gpu_type
region = env_value("FORGE_REGION")
if region:
    payload["region"] = region
keep_warm = env_value("FORGE_KEEP_WARM")
payload["run_until_stopped"] = (keep_warm or "").lower() in {"1", "true", "yes", "on"}
print(json.dumps(payload))' | \
curl -sS --fail-with-body "$(forge_api_url "$runtime_start_path")" \
  --max-time "${FORGE_REQUEST_TIMEOUT_SECONDS:-600}" \
  -X POST \
  -H "Authorization: Bearer ${FORGE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @- | \
python3 -c 'import json, sys

payload = json.load(sys.stdin)
slug = payload.get("slug") or "runtime"
gpu_type = payload.get("gpu_type") or "scheduler-selected GPU"
region = payload.get("region") or "scheduler-selected region"
startup_ms = payload.get("startup_ms")
state = "cold-started" if payload.get("was_cold_start") else "already warm"
suffix = f"; startup_ms={startup_ms}" if startup_ms is not None else ""
print(f"{slug} {state} on {gpu_type} in {region}{suffix}; keep_warm={payload.get('\''keep_warm'\'')}")'

OpenAI base URL

https://YOUR_FORGE_HOST/v1

GPU performance

Pick a verified target for repeatable runs. Failed or pending details appear on the status hover.

Try selected target

Runs on · FP8 vLLM 0.21.0 CUDA 13

30.5 B params · weights FP8 · floor 48 GB

Target readiness

4 verified targets

4/6 verified1 awaiting probe1 unavailable

Fastest verified

H200 in eu-north2

Use in playground

Lowest warm model time among verified targets: 877 ms p50 warm model time across 10 samples.

Model time: 877 msp50 warm model time
p95 919 ms · p99 929 ms · 10 samples
Cold start: 4m 10s

Most affordable

RTX6000 in us-central1

Use in playground

Lowest estimated GPU price among verified targets: $1.80/GPU-hr; 1.2s p50 warm model time across 10 samples.

Model time: 1.2sp50 warm model time
p95 1.3s · p99 1.3s · 10 samples
Cold start: 11m 21s

GPU	Region	Status	VRAM	Cold start	Model time	Relative	Tokens/s	Est. $/GPU-hr	Target
B200	us-central1	incompatible	28.4 GB	—	—	—	—	$7.15	—
B300	uk-south1	works	241.4 GB	2m 43s	1.5sp50 warm model time p95 1.6s · p99 1.6s · 10 samples	60% · -40%	168	$7.85	Use in playground
H100	—	not probed	—	—	—	—	—	—	—
H200fastest	eu-north2	works	126.2 GB	4m 10s	877 msp50 warm model time p95 919 ms · p99 929 ms · 10 samples	100%	296	$4.50	Use in playground
L40S	eu-north1	works	40.3 GB	3m 0s	1.7sp50 warm model time p95 1.8s · p99 1.8s · 10 samples	53% · -47%	146	$1.82	Use in playground
RTX6000	us-central1	works	85.9 GB	11m 21s	1.2sp50 warm model time p95 1.3s · p99 1.3s · 10 samples	70% · -30%	206	$1.80	Use in playground

How we measure

Model time uses the p50 warm model-reported execution time when available, then falls back to the latest probe time; p95/p99 and sample count appear when there is enough probe history. Cold start excludes the first (uncached) run. VRAM is the peak GPU memory seen during the probe. Relative compares each row's model time to the highlighted baseline (fastest row by default; hover any row to re-root). The fastest chip marks only verified supported GPU-region rows. Estimated on-demand GPU price (Nebius pay-as-you-go); shown for performance/price comparison. Configured minimum GPU memory: 48 GB.

Try it out

cold·General

Compare GPUs

Open Account

Leave GPU on “Any available GPU” to use a warm or verified backend automatically.API docs for this target

Inputs

Prompt · string

Temperature · number0.2

Top P · number0.8

Max Tokens · number512

API examples

Use the API

API docs

Snippet target: qwen-qwen3-30b-a3b-instruct-2507 version fp8-vllm-0-21-0-cuda13 using scheduler-selected GPU/region.

Client auth: Set FORGE_API_KEY to a real Forge API key before running copied curl, fetch, or SDK snippets. Browser SSO only authenticates this web session.

Open Account

import os

from openai import OpenAI

api_base = os.environ.get("FORGE_API_BASE", "https://YOUR_FORGE_HOST").rstrip("/")
openai_base = os.environ.get("FORGE_OPENAI_BASE_URL", "").strip().rstrip("/")
if not openai_base:
    openai_base = api_base if api_base.endswith("/v1") else f"{api_base}/v1"
request_timeout_seconds = float(os.environ.get("FORGE_REQUEST_TIMEOUT_SECONDS", "600"))
api_key = os.environ.get("FORGE_API_KEY")
if not api_key or api_key == "replace-with-your-forge-api-key":
    raise SystemExit("Set FORGE_API_KEY to a real Forge API key before running this snippet; browser SSO sessions are not sent to copied curl or SDK clients.")


client = OpenAI(
    api_key=api_key,
    base_url=openai_base,
    timeout=request_timeout_seconds,
)

response = client.chat.completions.create(
    model="qwen-qwen3-30b-a3b-instruct-2507",
    top_p=0.8,
    stream=True,
    messages=[
        {
            "role": "user",
            "content": "You are reviewing a model deployment plan. Summarize the main risk, list two validation steps, and give one concise recommendation:\\n\\nA new 30B-A3B FP8 MoE chat model will reuse an existing vLLM CUDA 13 image, hydrate public safetensors from Hugging Face into the shared cache, and start with a 32K context cap before broader long-context probes."
        }
    ],
    max_tokens=512,
    temperature=0.2,
    extra_body={
        "model_version": "fp8-vllm-0-21-0-cuda13"
    },
)
for chunk in response:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
print()

Setup & .env

Install for OpenAI SDK

Copy setup before the request when moving this snippet into a fresh shell. The default 600 second timeout is intentional for GPU cold starts and can be overridden with FORGE_REQUEST_TIMEOUT_SECONDS.

python3 -m pip install --upgrade openai
export FORGE_API_BASE='https://YOUR_FORGE_HOST'
export FORGE_API_KEY="${FORGE_API_KEY:-replace-with-your-forge-api-key}"
export FORGE_REQUEST_TIMEOUT_SECONDS="${FORGE_REQUEST_TIMEOUT_SECONDS:-600}"

Project .env

Copy these values into a local .env file when moving the selected target into an app or SDK client.

# Forge selected target: route=/v1/chat/completions model=qwen-qwen3-30b-a3b-instruct-2507 version=fp8-vllm-0-21-0-cuda13
FORGE_API_BASE="https://YOUR_FORGE_HOST"
FORGE_API_ROUTE="/v1/chat/completions"
FORGE_OPENAI_BASE_URL="https://YOUR_FORGE_HOST/v1"
FORGE_API_KEY="replace-with-your-forge-api-key"
FORGE_REQUEST_TIMEOUT_SECONDS="600"
MODEL_OR_FAMILY_SLUG="qwen-qwen3-30b-a3b-instruct-2507"
FORGE_MODEL_VERSION="fp8-vllm-0-21-0-cuda13"

Output

Run a request to see output here.

Deploy to Nebius Serverless

Run a dedicated, autoscaling endpoint in your own Nebius account. The endpoint runs under your account and billing — Forge just pre-fills the configuration for you.

Deploy in your Nebius account ↗

Opens the Nebius Console with the image pre-filled for Qwen3 30B A3B Instruct 2507 FP8 (Forge version FP8 vLLM 0.21.0 CUDA 13).

Need a throughput- and cost-optimized build tuned for specific Nebius GPUs? Nebius Token Factory is coming soon — contact your Nebius account team for early access.