Qwen3 30B A3B Instruct 2507 FP8 is a public Apache-2.0 non-thinking MoE chat model for general instruction following, coding, multilingual knowledge, long-context understanding, and tool-use style prompts.
Observed working set on a supported GPU.
After the last request a backend stays warm on its GPU for about 15 minutes, then frees the GPU. The next request triggers a fresh cold start.
Use the model field with the OpenAI-compatible SDK or API docs curl snippets.
/v1/chat/completionsPOSTqwen-qwen3-30b-a3b-instruct-2507model_version: fp8-vllm-0-21-0-cuda13gpu_type: H200region: eu-north2Terminal-ready smoke test for this selected target.
set -euo pipefail
# Forge API smoke test
# Forge selected target: route=/v1/chat/completions model=qwen-qwen3-30b-a3b-instruct-2507 version=fp8-vllm-0-21-0-cuda13 gpu=H200 region=eu-north2
FORGE_API_BASE=${FORGE_API_BASE:-'https://YOUR_FORGE_HOST'}
export MODEL_OR_FAMILY_SLUG=${MODEL_OR_FAMILY_SLUG:-'qwen-qwen3-30b-a3b-instruct-2507'}
export FORGE_MODEL_VERSION=${FORGE_MODEL_VERSION:-'fp8-vllm-0-21-0-cuda13'}
export FORGE_GPU_TYPE=${FORGE_GPU_TYPE:-'H200'}
export FORGE_REGION=${FORGE_REGION:-'eu-north2'}
case "${FORGE_API_KEY:-}" in
""|replace-with-your-forge-api-key)
echo 'Set FORGE_API_KEY to a real Forge API key before running this snippet; browser SSO sessions are not sent to copied curl or SDK clients.' >&2
exit 1
;;
esac
forge_api_url() {
endpoint="$1"
base="${FORGE_API_BASE%/}"
case "$base:$endpoint" in
*/v1:/v1|*/v1:/v1/*|*/v1:/v1\?*) printf '%s%s\n' "$base" "${endpoint#/v1}" ;;
*) printf '%s%s\n' "$base" "$endpoint" ;;
esac
}
python3 - <<'PY' |
import json
import os
payload = {
"model": os.environ["MODEL_OR_FAMILY_SLUG"],
"messages": [
{"role": "user", "content": "Write a one sentence status update."},
],
}
model_version = os.environ.get("FORGE_MODEL_VERSION")
if model_version:
payload["model_version"] = model_version
gpu_type = os.environ.get("FORGE_GPU_TYPE", "H200")
if gpu_type:
payload["gpu_type"] = gpu_type
region = os.environ.get("FORGE_REGION", "eu-north2")
if region:
payload["region"] = region
print(json.dumps(payload))
PY
curl -sS --fail-with-body "$(forge_api_url '/v1/chat/completions')" \
--max-time "${FORGE_REQUEST_TIMEOUT_SECONDS:-600}" \
-X POST \
-H "Authorization: Bearer ${FORGE_API_KEY}" \
-H "Content-Type: application/json" \
-d @- | \
python3 -c 'import json, sys
data = json.load(sys.stdin)
message = (((data.get("choices") or [{}])[0].get("message") or {}).get("content"))
if message:
print(message)
else:
print(json.dumps(data, indent=2))'OpenAI SDK route · Best for chat, completions, embeddings, and clients that already support an OpenAI-compatible base URL.
Copied snippets include gpu_type and region, so the first request targets this verified GPU and region. Remove those fields to let Forge choose another compatible target.
8 free GPUs · Live capacity for H200 in eu-north2.
https://YOUR_FORGE_HOST/v1/chat/completionsClient auth: Set FORGE_API_KEY to a real Forge API key before running copied curl, fetch, or SDK snippets. Browser SSO only authenticates this web session.
Open AccountAuthorization: Bearer $FORGE_API_KEYexport FORGE_API_BASE='https://YOUR_FORGE_HOST'
export FORGE_API_KEY="${FORGE_API_KEY:-replace-with-your-forge-api-key}"
export FORGE_REQUEST_TIMEOUT_SECONDS="${FORGE_REQUEST_TIMEOUT_SECONDS:-600}"
export FORGE_API_ROUTE='/v1/chat/completions'
export FORGE_OPENAI_BASE_URL='https://YOUR_FORGE_HOST/v1'
export MODEL_OR_FAMILY_SLUG='qwen-qwen3-30b-a3b-instruct-2507'
export FORGE_MODEL_VERSION='fp8-vllm-0-21-0-cuda13'
export FORGE_GPU_TYPE='H200'
export FORGE_REGION='eu-north2'Copy these values into a local .env file when moving the selected target into an app or SDK client.
# Forge selected target: route=/v1/chat/completions model=qwen-qwen3-30b-a3b-instruct-2507 version=fp8-vllm-0-21-0-cuda13 gpu=H200 region=eu-north2
FORGE_API_BASE="https://YOUR_FORGE_HOST"
FORGE_API_ROUTE="/v1/chat/completions"
FORGE_OPENAI_BASE_URL="https://YOUR_FORGE_HOST/v1"
FORGE_API_KEY="replace-with-your-forge-api-key"
FORGE_REQUEST_TIMEOUT_SECONDS="600"
MODEL_OR_FAMILY_SLUG="qwen-qwen3-30b-a3b-instruct-2507"
FORGE_MODEL_VERSION="fp8-vllm-0-21-0-cuda13"
FORGE_GPU_TYPE="H200"
FORGE_REGION="eu-north2"Add these rules before replacing the placeholder API key so local Forge secrets stay out of commits while .env.example can remain tracked.
# Forge local API secrets
.env
.env.*
!.env.examplehttps://YOUR_FORGE_HOST/v1/models/qwen-qwen3-30b-a3b-instruct-2507/run-estimate?model_version=fp8-vllm-0-21-0-cuda13&gpu_type=H200®ion=eu-north2set -euo pipefail
# Forge selected target: route=/v1/chat/completions model=qwen-qwen3-30b-a3b-instruct-2507 version=fp8-vllm-0-21-0-cuda13 gpu=H200 region=eu-north2
FORGE_API_BASE=${FORGE_API_BASE:-'https://YOUR_FORGE_HOST'}
export MODEL_OR_FAMILY_SLUG=${MODEL_OR_FAMILY_SLUG:-'qwen-qwen3-30b-a3b-instruct-2507'}
export FORGE_MODEL_VERSION=${FORGE_MODEL_VERSION:-'fp8-vllm-0-21-0-cuda13'}
export FORGE_GPU_TYPE=${FORGE_GPU_TYPE:-'H200'}
export FORGE_REGION=${FORGE_REGION:-'eu-north2'}
case "${FORGE_API_KEY:-}" in
""|replace-with-your-forge-api-key)
echo 'Set FORGE_API_KEY to a real Forge API key before running this snippet; browser SSO sessions are not sent to copied curl or SDK clients.' >&2
exit 1
;;
esac
forge_api_url() {
endpoint="$1"
base="${FORGE_API_BASE%/}"
case "$base:$endpoint" in
*/v1:/v1|*/v1:/v1/*|*/v1:/v1\?*) printf '%s%s\n' "$base" "${endpoint#/v1}" ;;
*) printf '%s%s\n' "$base" "$endpoint" ;;
esac
}
reliability_path="$(python3 -c 'import os
from urllib.parse import quote, urlencode
model = os.environ.get("MODEL_OR_FAMILY_SLUG", "").strip()
if not model:
raise SystemExit("Set MODEL_OR_FAMILY_SLUG from search or route finder output before checking reliability.")
params = {}
model_version = os.environ.get("FORGE_MODEL_VERSION", "").strip()
if model_version:
params["model_version"] = model_version
gpu_type = os.environ.get("FORGE_GPU_TYPE", "").strip()
if gpu_type:
params["gpu_type"] = gpu_type
region = os.environ.get("FORGE_REGION", "").strip()
if region:
params["region"] = region
path = "/v1/models/" + quote(model, safe="") + "/reliability"
if params:
path += "?" + urlencode(params)
print(path)')"
curl -sS --fail-with-body "$(forge_api_url "$reliability_path")" \
--max-time "${FORGE_REQUEST_TIMEOUT_SECONDS:-600}" \
-H "Authorization: Bearer ${FORGE_API_KEY}" | \
python3 -c 'import json, shlex, sys
payload = json.load(sys.stdin)
print(
f"{payload.get('\''slug'\'')} reliability={payload.get('\''reliability_status'\'')} "
f"supported={payload.get('\''supported_rows'\'', 0)}/{payload.get('\''total_rows'\'', 0)}"
)
filters = payload.get("filters") or {}
if filters:
print("filters: " + ", ".join(f"{key}={value}" for key, value in filters.items()))
def describe_target(target):
details = []
request_ms = target.get("request_ms_p50") or target.get("request_ms")
if request_ms is not None:
details.append(f"p50={request_ms}ms")
warm_cost = target.get("estimated_warm_request_cost_usd")
if warm_cost is not None:
details.append(f"warm_cost_usd={warm_cost}")
elif target.get("cost_per_gpu_hour_usd") is not None:
details.append(f"gpu_hour_usd={target['\''cost_per_gpu_hour_usd'\'']}")
success_rate = target.get("observed_success_rate")
if isinstance(success_rate, (int, float)):
details.append(f"success={success_rate:.0%}")
return ", ".join(details) or target.get("status") or "supported"
exports = {}
for label, key in (
("fastest supported", "fastest_supported_target"),
("lowest-cost supported", "lowest_cost_supported_target"),
):
target = payload.get(key) or {}
gpu_type = target.get("gpu_type")
if not gpu_type:
continue
identity = (str(gpu_type), str(target.get("region") or ""))
exports.setdefault(identity, {"labels": [], "target": target})["labels"].append(label)
if not exports:
print("No supported GPU/region target returned.", file=sys.stderr)
print(json.dumps({
"status_counts": payload.get("status_counts", {}),
"failure_reason_counts": payload.get("failure_reason_counts", {}),
}, indent=2))
raise SystemExit(1)
for (gpu_type, region), entry in exports.items():
assignments = [f"FORGE_GPU_TYPE={shlex.quote(gpu_type)}"]
if region:
assignments.append(f"FORGE_REGION={shlex.quote(region)}")
labels = " + ".join(entry["labels"])
details = describe_target(entry["target"])
print(f"export {'\'' '\''.join(assignments)} # {labels}: {details}")'https://YOUR_FORGE_HOST/v1/model-families/qwen-qwen3-30b-a3b-instruct-2507/status?version=fp8-vllm-0-21-0-cuda13set -euo pipefail
# Forge selected target: route=/v1/chat/completions model=qwen-qwen3-30b-a3b-instruct-2507 version=fp8-vllm-0-21-0-cuda13 gpu=H200 region=eu-north2
FORGE_API_BASE=${FORGE_API_BASE:-'https://YOUR_FORGE_HOST'}
export MODEL_OR_FAMILY_SLUG=${MODEL_OR_FAMILY_SLUG:-'qwen-qwen3-30b-a3b-instruct-2507'}
export FORGE_MODEL_VERSION=${FORGE_MODEL_VERSION:-'fp8-vllm-0-21-0-cuda13'}
export FORGE_GPU_TYPE=${FORGE_GPU_TYPE:-'H200'}
export FORGE_REGION=${FORGE_REGION:-'eu-north2'}
export FORGE_KEEP_WARM=${FORGE_KEEP_WARM:-false}
case "${FORGE_API_KEY:-}" in
""|replace-with-your-forge-api-key)
echo 'Set FORGE_API_KEY to a real Forge API key before running this snippet; browser SSO sessions are not sent to copied curl or SDK clients.' >&2
exit 1
;;
esac
forge_api_url() {
endpoint="$1"
base="${FORGE_API_BASE%/}"
case "$base:$endpoint" in
*/v1:/v1|*/v1:/v1/*|*/v1:/v1\?*) printf '%s%s\n' "$base" "${endpoint#/v1}" ;;
*) printf '%s%s\n' "$base" "$endpoint" ;;
esac
}
runtime_start_path="$(python3 -c 'import os
from urllib.parse import quote
model = os.environ.get("MODEL_OR_FAMILY_SLUG", "").strip()
if not model:
raise SystemExit("Set MODEL_OR_FAMILY_SLUG from the model picker output")
print("/v1/model-families/" + quote(model, safe="") + "/start")')"
python3 -c 'import json, os
def env_value(name):
value = os.environ.get(name, "").strip()
return value or None
payload = {}
version = env_value("FORGE_MODEL_VERSION")
if version:
payload["version"] = version
gpu_type = env_value("FORGE_GPU_TYPE")
if gpu_type:
payload["gpu_type"] = gpu_type
region = env_value("FORGE_REGION")
if region:
payload["region"] = region
keep_warm = env_value("FORGE_KEEP_WARM")
payload["run_until_stopped"] = (keep_warm or "").lower() in {"1", "true", "yes", "on"}
print(json.dumps(payload))' | \
curl -sS --fail-with-body "$(forge_api_url "$runtime_start_path")" \
--max-time "${FORGE_REQUEST_TIMEOUT_SECONDS:-600}" \
-X POST \
-H "Authorization: Bearer ${FORGE_API_KEY}" \
-H "Content-Type: application/json" \
-d @- | \
python3 -c 'import json, sys
payload = json.load(sys.stdin)
slug = payload.get("slug") or "runtime"
gpu_type = payload.get("gpu_type") or "scheduler-selected GPU"
region = payload.get("region") or "scheduler-selected region"
startup_ms = payload.get("startup_ms")
state = "cold-started" if payload.get("was_cold_start") else "already warm"
suffix = f"; startup_ms={startup_ms}" if startup_ms is not None else ""
print(f"{slug} {state} on {gpu_type} in {region}{suffix}; keep_warm={payload.get('\''keep_warm'\'')}")'https://YOUR_FORGE_HOST/v1Pick a verified target for repeatable runs. Failed or pending details appear on the status hover.
4 verified targets
Lowest warm model time among verified targets: 877 ms p50 warm model time across 10 samples.
Lowest estimated GPU price among verified targets: $1.80/GPU-hr; 1.2s p50 warm model time across 10 samples.
| GPU | Region | Status | VRAM | Cold start | Model time | Relative | Tokens/s | Est. $/GPU-hr | Target |
|---|---|---|---|---|---|---|---|---|---|
| B200 | us-central1 | incompatible | 28.4 GB | — | — | — | — | $7.15 | — |
| B300 | uk-south1 | works | 241.4 GB | 2m 43s | 1.5sp50 warm model time p95 1.6s · p99 1.6s · 10 samples | 60% · -40% | 168 | $7.85 | Use in playground |
| H100 | — | not probed | — | — | — | — | — | — | — |
| H200fastest | eu-north2 | works | 126.2 GB | 4m 10s | 877 msp50 warm model time p95 919 ms · p99 929 ms · 10 samples | 100% | 296 | $4.50 | Use in playground |
| L40S | eu-north1 | works | 40.3 GB | 3m 0s | 1.7sp50 warm model time p95 1.8s · p99 1.8s · 10 samples | 53% · -47% | 146 | $1.82 | Use in playground |
| RTX6000 | us-central1 | works | 85.9 GB | 11m 21s | 1.2sp50 warm model time p95 1.3s · p99 1.3s · 10 samples | 70% · -30% | 206 | $1.80 | Use in playground |
Model time uses the p50 warm model-reported execution time when available, then falls back to the latest probe time; p95/p99 and sample count appear when there is enough probe history. Cold start excludes the first (uncached) run. VRAM is the peak GPU memory seen during the probe. Relative compares each row's model time to the highlighted baseline (fastest row by default; hover any row to re-root). The fastest chip marks only verified supported GPU-region rows. Estimated on-demand GPU price (Nebius pay-as-you-go); shown for performance/price comparison. Configured minimum GPU memory: 48 GB.
/v1/chat/completionsModelqwen-qwen3-30b-a3b-instruct-2507Versionfp8-vllm-0-21-0-cuda13GPUautomaticRegionautomaticSnippet target: qwen-qwen3-30b-a3b-instruct-2507 version fp8-vllm-0-21-0-cuda13 using scheduler-selected GPU/region.
Client auth: Set FORGE_API_KEY to a real Forge API key before running copied curl, fetch, or SDK snippets. Browser SSO only authenticates this web session.
Open Accountimport os
from openai import OpenAI
api_base = os.environ.get("FORGE_API_BASE", "https://YOUR_FORGE_HOST").rstrip("/")
openai_base = os.environ.get("FORGE_OPENAI_BASE_URL", "").strip().rstrip("/")
if not openai_base:
openai_base = api_base if api_base.endswith("/v1") else f"{api_base}/v1"
request_timeout_seconds = float(os.environ.get("FORGE_REQUEST_TIMEOUT_SECONDS", "600"))
api_key = os.environ.get("FORGE_API_KEY")
if not api_key or api_key == "replace-with-your-forge-api-key":
raise SystemExit("Set FORGE_API_KEY to a real Forge API key before running this snippet; browser SSO sessions are not sent to copied curl or SDK clients.")
client = OpenAI(
api_key=api_key,
base_url=openai_base,
timeout=request_timeout_seconds,
)
response = client.chat.completions.create(
model="qwen-qwen3-30b-a3b-instruct-2507",
top_p=0.8,
stream=True,
messages=[
{
"role": "user",
"content": "You are reviewing a model deployment plan. Summarize the main risk, list two validation steps, and give one concise recommendation:\\n\\nA new 30B-A3B FP8 MoE chat model will reuse an existing vLLM CUDA 13 image, hydrate public safetensors from Hugging Face into the shared cache, and start with a 32K context cap before broader long-context probes."
}
],
max_tokens=512,
temperature=0.2,
extra_body={
"model_version": "fp8-vllm-0-21-0-cuda13"
},
)
for chunk in response:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
print()Copy setup before the request when moving this snippet into a fresh shell. The default 600 second timeout is intentional for GPU cold starts and can be overridden with FORGE_REQUEST_TIMEOUT_SECONDS.
python3 -m pip install --upgrade openai
export FORGE_API_BASE='https://YOUR_FORGE_HOST'
export FORGE_API_KEY="${FORGE_API_KEY:-replace-with-your-forge-api-key}"
export FORGE_REQUEST_TIMEOUT_SECONDS="${FORGE_REQUEST_TIMEOUT_SECONDS:-600}"Copy these values into a local .env file when moving the selected target into an app or SDK client.
# Forge selected target: route=/v1/chat/completions model=qwen-qwen3-30b-a3b-instruct-2507 version=fp8-vllm-0-21-0-cuda13
FORGE_API_BASE="https://YOUR_FORGE_HOST"
FORGE_API_ROUTE="/v1/chat/completions"
FORGE_OPENAI_BASE_URL="https://YOUR_FORGE_HOST/v1"
FORGE_API_KEY="replace-with-your-forge-api-key"
FORGE_REQUEST_TIMEOUT_SECONDS="600"
MODEL_OR_FAMILY_SLUG="qwen-qwen3-30b-a3b-instruct-2507"
FORGE_MODEL_VERSION="fp8-vllm-0-21-0-cuda13"Run a request to see output here.
Run a dedicated, autoscaling endpoint in your own Nebius account. The endpoint runs under your account and billing — Forge just pre-fills the configuration for you.
Opens the Nebius Console with the image pre-filled for Qwen3 30B A3B Instruct 2507 FP8 (Forge version FP8 vLLM 0.21.0 CUDA 13).
Prefer to create the endpoint from the CLI, or self-manage the container image? Use the commands below.
The image is hosted on cr.eu-north1.nebius.cloud; you may need registry credentials in the Console form. The CLI below includes placeholders.
The links use Forge’s eu-north1 Nebius Container Registry mirror. If your project can’t pull that private mirror, add pull credentials or a registry secret.
# Runs in YOUR Nebius account (you own + pay for the endpoint).
# platform/preset must exist in your project — list them with:
# nebius compute platform list
export ENDPOINT_NAME="qwen-qwen3-30b-a3b-instruct-2507-fp8-vllm-cuda13-private"
export AUTH_TOKEN=$(openssl rand -hex 32)
export SUBNET_ID=$(nebius vpc subnet list --format jsonpath='{.items[0].metadata.id}')
export REGISTRY_USERNAME="YOUR_REGISTRY_USERNAME"
export REGISTRY_PASSWORD="YOUR_REGISTRY_PASSWORD"
# Note: the --image above points at Forge's regional Nebius CR mirror.
# Serverless AI can pull Container Registry images without credentials
# only when the image is public or in the same project. For a private
# mirror in another project, provide pull credentials or a MysteryBox
# registry secret with REGISTRY_USERNAME and REGISTRY_PASSWORD.
nebius ai endpoint create \
--name "$ENDPOINT_NAME" \
--image "cr.eu-north1.nebius.cloud/e00h91c5sa606xfwpj/models/vllm-vllm-openai:v0.21.0@sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9" \
--registry-username "$REGISTRY_USERNAME" \
--registry-password "$REGISTRY_PASSWORD" \
--container-port 8000 \
--auth token \
--token "$AUTH_TOKEN" \
--subnet-id "$SUBNET_ID"
export ENDPOINT_ID=$(nebius ai endpoint get-by-name --name "$ENDPOINT_NAME" --format jsonpath='{.metadata.id}')
nebius ai endpoint get "$ENDPOINT_ID"Need a throughput- and cost-optimized build tuned for specific Nebius GPUs? Nebius Token Factory is coming soon — contact your Nebius account team for early access.