TrainingRuns in your Nebius account

Fine-tune on Nebius

See fine-tune performance per GPU for catalog models, measured on Forge’s own GPUs — then run the training in your own Nebius account, on your own data via a Nebius Jobs hand-off. You own the resulting weights and you pay for the GPUs; Forge doesn’t run training or touch your data.

Your accountYour dataYou own + pay

Models

Models benchmarked or being benchmarked for fine-tuning.

Onboarded

Have at least one measured per-GPU benchmark.

Benchmarking

Benchmarks in progress; numbers land soon.

Where it runs

Your account

Training runs via Nebius Jobs in your project.

Trainable models

Browse models benchmarked for fine-tuning. Open one to see per-GPU performance and start a job in your own account.

⌕

Workload

47 models

bigcode-starcoder2-7b · GeneralBenchmarking

BigCode StarCoder2 7B

BigCode StarCoder2 7B packaged as an NVIDIA NIM and mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave2. The playground uses the completion endpoint because this code model tokenizer does not provide a chat template.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

deepseek-ai-deepseek-r1-0528-qwen3-8b · GeneralBenchmarking

DeepSeek R1 0528 Qwen3 8B

DeepSeek R1 0528 Qwen3 8B is a public, non-gated MIT reasoning-oriented text-generation model distilled from DeepSeek-R1-0528 into a Qwen3 8B base model. Hugging Face metadata refreshed on 2026-05-28T14Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, license=mit, revision 6e8885a6ff5c1dc5201574c8fd700323f23c25fa, Qwen3ForCausalLM architecture, 131,072 max positions, and 8,190,735,360 BF16 safetensors parameters. The model card documents vLLM serving for deepseek-ai/DeepSeek-R1-0528-Qwen3-8B and reports the 8B distillation as state-of-the-art among open-source models on AIME 2024 at release time. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps served context to 32,768 tokens for bounded first probes, enables the vLLM deepseek_r1 reasoning parser, clears unnecessary Hugging Face token env for this public ungated artifact, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The 2026-05-28T11Z hidden matrix probe validated B200, B300, H200, and L40S with 10 measured warm requests per cell; RTX6000 remains disabled because no current RTX6000 inventory cell was available. The 2026-05-28T14Z publication pass switched the playground to non-streaming JSON output so vLLM responses with message.reasoning and message.content=null remain visible until the web streaming parser grows first-class reasoning-delta rendering.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

deepseek-ai-deepseek-r1-distill-qwen-14b · GeneralBenchmarking

DeepSeek R1 Distill Qwen 14B

DeepSeek R1 Distill Qwen 14B is a public reasoning-oriented text-generation model derived from Qwen2.5-14B and fine-tuned with DeepSeek-R1 samples. Hugging Face metadata refreshed on 2026-05-25T16Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, license=mit, revision 1df8507178afcc1bef68cd8c393f61a886323761, Qwen2ForCausalLM architecture, 131072 max positions, and 14,770,033,664 BF16 safetensors parameters. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps served context to 32,768 tokens, enables the vLLM deepseek_r1 reasoning parser, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. B200/us-central1, B300/uk-south1, H200/us-central1, H200/eu-north1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 have supported probe evidence with at least 10 measured requests per cell. The profile is published as active/testing and kept default-ineligible while customer-facing reasoning quality, cost, and cold-start behavior are reviewed.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

mistralai-devstral-small-2507 · GeneralBenchmarking

Devstral Small 1.1 2507

Devstral Small 1.1 2507 is a public Apache-2.0 text-only coding and software-engineering agent model from Mistral AI and All Hands AI. Hugging Face API metadata checked on 2026-06-02 reports private=false, gated=false, disabled=false, library_name=mistral-common, license:apache-2.0, revision bd165ab26cebbcc2eea2c4ecbfc07f3ac42b3c39, lastModified 2025-08-18T08:14:29Z, and 23,572,403,200 BF16 safetensors parameters. The upstream card describes Devstral as an agentic coding model with Mistral function-calling support, 24-language coverage, a 128K context window, and recommended vLLM serving via tokenizer_mode/config_format/load_format mistral plus tool-call-parser mistral and tensor-parallel-size 2. This active testing non-default Forge profile uses the mirrored official vLLM 0.22.0 CUDA 13 image digest sha256:0fec7ec5f3e6bc168e54899935fb0557da908a4832a1dbc88e2debcf2f889416, replacing the earlier vLLM 0.21.0 draft image so the path is above the CVE-2026-48746 / GHSA-94f4-hr76-p5j6 affected range before publication. It caps served context to 32768 tokens, clears Hugging Face token environment variables because the artifact is public and ungated, and stores Hugging Face, Transformers, vLLM, and XDG caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache. The 2026-06-02T22Z worker live-upserted the prior TP2 draft as hidden onboarding and verified DB readback, but B200/us-central1 and H200/eu-north2 TP2 probes were blocked by current two-GPU placement capacity. The 2026-06-02T23Z worker mirrored and dry-run validated the vLLM 0.22.0 manifest update. The 2026-06-03T00Z worker rechecked compute inventory and found current B300/H200/RTX6000 cells are one-GPU node pools, so this row uses an explicit tensor-parallel-size 1 fallback. The 2026-06-03T01Z H200/eu-north2 TP1 probe completed startup, 3 warmups, and 10 measured coding-chat requests on the v0.22.0 digest. The 2026-06-03T02Z remaining-cell probe added supported B300/uk-south1 and RTX6000/us-central1 rows on the same digest. The 2026-06-03T03Z B200/us-central1 reprobe succeeded on a different B200 node and persisted a supported row; the same cycle classified L40S/eu-north1 as one-GPU BF16 CUDA OOM during model load. The 2026-06-03T04Z no-persist B200 tool-call smoke forced the Mistral parser to call a read_file function and the response preview contained a structured OpenAI tool_calls entry with arguments {"path":"src/retry.py"}.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

black-forest-labs-flux-1-dev · GeneralOnboarded

FLUX.1 Dev

Black Forest Labs FLUX.1 Dev image generation packaged as NVIDIA Visual GenAI NIM. Supports text-to-image and guided image variants; this Forge entry starts with the single-GPU base text-to-image path.

flux-lora

Best throughput: 1.54 img/son B200
GPUs benchmarked: 6

black-forest-labs-flux-1-kontext-dev · GeneralOnboarded

FLUX.1 Kontext Dev

FLUX.1 Kontext Dev NIM for prompt-driven in-context image editing on a single GPU.

flux-kontext-lora

Best throughput: 1.2 img/son B200
GPUs benchmarked: 6

black-forest-labs-flux-1-schnell · GeneralOnboarded

FLUX.1 Schnell

Distilled FLUX.1 Schnell image generation NIM for fast single-GPU text-to-image experiments.

flux-lora

Best throughput: 1.51 img/son B200
GPUs benchmarked: 6

black-forest-labs-flux-2-dev · GeneralOnboarded

FLUX.2 Dev

FLUX.2 Dev is a gated, non-commercial Black Forest Labs image generation and editing model served through a Forge-owned Diffusers Flux2Pipeline BF16 wrapper. Hugging Face API metadata checked on 2026-06-03T09:05:09Z reports private=false, gated=auto, disabled=false, pipeline_tag=image-to-image, library_name=diffusers, revision 26afe3a78bb242c0a8bb181dcc8937bb16e5c66c, lastModified 2026-02-17T15:56:06.000Z, downloads=320657, likes=1720, license=other, and license_name=flux-non-commercial-license. The model has large safetensors artifacts including flux2-dev.safetensors at 64,446,596,128 bytes plus sharded transformer and text-encoder weights, so the selected wrapper writes Hugging Face, Diffusers, Torch, and XDG caches under /mnt/data. The container build based on pytorch/pytorch:2.9.1-cuda12.8-cudnn9-runtime and Diffusers commit 4c77dcdbac6c75c8a1fdfaa1657c70f8930c8f3e succeeded with image ID sha256:7608fd3102154b7d8086337266a0c9c3e1f5d00d6b31bae5f6cb06bf829ebee7, size 8,296,349,976 bytes, and mirrored digest sha256:2a033f066904c01d1377b00aa28773fdea0e131da03f51b01ab0bde448acb918 in all active Forge regional registries. The 2026-06-03 GPU matrix supports B200, B300, H200, H100, and RTX6000, with L40S marked OOM. B200 and B300 are the preferred runtime targets for user-facing generation latency.

flux-lora

Best throughput: 0.64 img/son H200
GPUs benchmarked: 6

black-forest-labs-flux-2-klein-4b · GeneralOnboarded

FLUX.2 Klein 4B

Compact FLUX.2 Klein 4B Visual GenAI NIM for efficient text-to-image and image-editing workloads. The NIM is optimized around a hard 1-4 step generation window; Forge exposes the valid prompt, aspect ratio, seed, and step controls for this runtime.

flux-lora

Best throughput: 3.86 img/son B300
GPUs benchmarked: 6

ibm-granite-granite-3-3-8b-instruct · GeneralBenchmarking

Granite 3.3 8B Instruct

IBM Granite 3.3 8B Instruct is a public Apache-2.0 text-generation model for enterprise-style instruction following, coding, reasoning, structured summaries, and long-context document or meeting summarization. Hugging Face API metadata refreshed on 2026-05-27T17Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision 51dd4bc2ade4059a6bd87649d68aa11e4fb2529b, and 8,170,864,640 BF16 safetensors parameters. The model config reports GraniteForCausalLM, model_type=granite, BF16 dtype, 40 layers, 32 attention heads, 8 KV heads, hidden size 4096, intermediate size 12800, vocab size 49159, and 131072 max positions. The model card documents 128K context, improved reasoning and instruction-following capabilities, coding and long-context use cases, permissively licensed and synthetic training data sources, and Apache 2.0 licensing. NVIDIA NIM model tables list ibm-granite/granite-3.3-8b-instruct version 1.8.4, but direct OCI inspection of nvcr.io/nim/ibm-granite/granite-3.3-8b-instruct:1.8.4 returned authentication required during this cycle, so this Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9. The profile caps served context to 32768 tokens for first validation, clears unnecessary Hugging Face token env for this public ungated artifact, and stores Hugging Face and vLLM caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

ibm-granite-granite-4-1-8b · GeneralBenchmarking

Granite 4.1 8B

IBM Granite 4.1 8B is a public Apache-2.0 dense decoder-only general instruction model released on 2026-04-29 for enterprise assistant, RAG, coding, multilingual dialog, structured JSON, and tool-use workflows. Hugging Face API metadata checked on 2026-06-02T18Z reports private=false, gated=false, disabled=false, library_name=transformers, commit 1504002f650e656a0a3789d99574df12e3e94ed0, 112122 monthly downloads, 184 likes, safetensors BF16 parameters=8791592960, and license:apache-2.0. The model config reports GraniteForCausalLM, model_type=granite, BF16 dtype, 131072 max positions, 40 layers, 32 attention heads, 8 KV heads, hidden size 4096, intermediate size 12800, vocab size 100352, and RoPE theta 10000000. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, because IBM documents Granite 4.1 tool calling, vLLM documents GraniteForCausalLM support, and the selected image is already mirrored to all active Forge regions. The served context is capped at 32768 tokens for bounded one-GPU probes even though the artifact advertises 131072 positions. Tool calling is enabled with the vLLM granite4 parser because IBM and vLLM document Granite 4 tool-call support. The 2026-06-01T08Z hidden live upsert succeeded, retained Forge probes support B200/us-central1 and B300/uk-south1 with 3 warmups plus 10 measured chat requests per cell, and a no-persist B200 tool-parser smoke returned an OpenAI tool_calls response. The 2026-06-01T16Z retained H200/eu-north2 probe also passed with 3 warmups plus 10 measured chat requests, startup 436250 ms, p50 794 ms, p95 874 ms, p99 880 ms, 238.04 tokens/s, 120.07 GB VRAM, and vLLM logs showing FLASH_ATTN with FlashAttention 3. The 2026-06-02T18Z retained L40S/eu-north1 probe passed with 3 warmups plus 10 measured chat requests, startup 105888 ms, p50 2935 ms, 64.74 tokens/s, and 37.86 GB VRAM. The manifest advances to active/testing/non-default for B200, B300, H200, and L40S while RTX6000 stays disabled because current public inventory did not advertise an RTX6000 cell.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

nvidia-healthcare-text2sql · Healthcare / Life ScienceBenchmarking

Llama 3.1 Nemotron Nano 8B Healthcare Text2SQL

NVIDIA's Llama 3.1 Nemotron Nano 8B Healthcare Text2SQL NIM translates natural-language healthcare analytics questions plus DDL into SQL. This Forge candidate is useful for nonclinical self-service analytics and research data exploration over de-identified clinical schemas. It uses digest-pinned regional mirrors and must remain hidden and default-ineligible until GPU probes pass and healthcare product safety review approves nonclinical positioning. It must not be used for medical advice, clinical decision-making, diagnosis, treatment, triage, or patient-specific record interpretation.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

meta-llama-3.1-70b-instruct · GeneralBenchmarking

Meta Llama 3.1 70B Instruct

Flagship general-purpose chat model for instruction following and structured generation.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

meta-llama-3-1-8b-instruct · GeneralBenchmarking

Meta Llama 3.1 8B Instruct

General-purpose Llama 3.1 8B chat model for instruction following, summarization, and lightweight agent workflows.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

meta-llama-3-2-1b-instruct · GeneralBenchmarking

Meta Llama 3.2 1B Instruct

Small, low-latency Llama 3.2 chat model packaged as an NVIDIA NIM for fast instruction-following examples.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

meta-llama-3-2-3b-instruct · GeneralBenchmarking

Meta Llama 3.2 3B Instruct

Compact Llama 3.2 instruction model with a stronger quality/latency tradeoff than the 1B variant.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

microsoft-phi-3-mini-4k-instruct · GeneralBenchmarking

Microsoft Phi-3 Mini 4K Instruct

Microsoft Phi-3 Mini 4K Instruct packaged as an NVIDIA NIM and mirrored into Forge regional registries.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

microsoft-phi-4-mini-instruct · GeneralBenchmarking

Microsoft Phi-4 Mini Instruct

Microsoft Phi-4 Mini Instruct is a public MIT-licensed 3.8B-parameter small language model for instruction following, multilingual chat, code-oriented prompts, and tool/function-calling formats. Microsoft and Hugging Face metadata identify the artifact as text-generation, safetensors, and MIT licensed, while NVIDIA NIM documents the self-hosted nvcr.io/nim/microsoft/phi-4-mini-instruct:latest container and OpenAI-compatible /v1/chat/completions usage. This Forge onboarding entry reuses the already mirrored NVIDIA NIM image digest sha256:e5ea112a599102ddd0dba60aea603a5e631ed16d2b7e4c467fb4c33c3e30ff7d in all four Forge regional registries, routes the NIM cache to the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights, and has live 10-run warm probes supporting B200, H200, L40S, and RTX6000. B300/uk-south1 remains disabled after 2026-05-23T11Z triage showed the pinned NIM 1.12.0 image is not SM103-ready: default startup fails on PyTorch/TensorRT-LLM arch detection for 10.3+PTX, and a TORCH_CUDA_ARCH_LIST fallback progresses to model load but aborts in Triton codegen for sm_103.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

microsoft-phi-4-mini-reasoning · GeneralBenchmarking

Microsoft Phi-4 Mini Reasoning

Microsoft Phi-4 Mini Reasoning is a public MIT-licensed 3.8B-parameter dense text-generation model for multi-step mathematical reasoning, symbolic problem solving, formal proof-style prompts, and compact reasoning use cases where latency and GPU footprint matter. Hugging Face API metadata refreshed on 2026-05-27 reports private=false, gated=false, disabled=false, library_name=transformers, pipeline_tag=text-generation, license=mit, revision 0e3b1e2d02ee478a3743abe3f629e9c0cb722e0a, Phi3ForCausalLM architecture, 131,072 native max positions, and 3,836,021,760 BF16 safetensors parameters. The Hugging Face model page exposes generated vLLM instructions using `vllm serve "microsoft/Phi-4-mini-reasoning"` and OpenAI-compatible `/v1/chat/completions`. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps served context to 32,768 tokens for bounded probes, clears unnecessary Hugging Face token env for this public artifact, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by Forge's shared /mnt/data cache. The 2026-05-27T01Z live matrix supports B200/us-central1, B300/uk-south1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 with 3 warmups and 10 measured chat requests per cell. Keep the model active/testing, non-default, and 32K-capped until product review validates whether this math-specialized Phi profile should be exposed in broader general routing or receive a separate 128K long-context profile.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

openbmb-minicpm4-8b · GeneralBenchmarking

MiniCPM4 8B

MiniCPM4 8B is a public Apache-2.0 OpenBMB text-generation model for English/Chinese chat, instruction following, long-context summarization, and efficient edge-oriented general LLM workloads. Hugging Face API metadata checked on 2026-05-29T07Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, license=apache-2.0, revision bb2ae14cf59d4ca769c4e42ece54cc3b82a58ef7, lastModified 2025-10-24T08:32:25Z, and 8,185,253,888 BF16 safetensors parameters. The artifact config reports MiniCPMForCausalLM, model_type=minicpm, BF16 dtype, 32 layers, 32 attention heads, 2 KV heads, hidden size 4096, 73,448 vocab entries, 32,768 max positions, and an auto_map remote-code contract. The upstream model card documents OpenAI-compatible vLLM serving for openbmb/MiniCPM4-8B, notes that vLLM chat should send add_special_tokens=true, and states that MiniCPM4 natively supports 32,768-token context with optional longer-context RoPE modifications. This Forge profile intentionally uses the native 32K context only, enables --trust-remote-code, reuses the already mirrored official vLLM 0.21.0 CUDA 13 image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, clears unnecessary Hugging Face token env for this public ungated artifact, stores Hugging Face hub and vLLM caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache, and moves Transformers dynamic modules to a writable /tmp path after the 08Z B200 diagnostic found PermissionError under /opt/nim/.cache/huggingface/modules. The 10Z L40S reprobe confirmed that cache mitigation and completed a five-cell hidden support matrix; keep hidden onboarding and default-ineligible until publication review.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

openbmb-minicpm5-1b · GeneralBenchmarking

MiniCPM5 1B

MiniCPM5 1B is a public Apache-2.0 compact causal language model from OpenBMB for local assistants, coding agents, tool-use style prompts, hybrid reasoning, English/Chinese chat, and long-context summarization. Hugging Face metadata checked on 2026-05-29T18Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision 4e9de7a0778dc1c362e983e6858f0e77542cbdca, and Apache-2.0 card metadata. The model card reports 1,080,632,832 parameters, 24 layers, 16 query heads, 2 KV heads, 131,072-token native context, standard LlamaForCausalLM architecture, a vLLM quickstart requiring vLLM >= 0.21, and separate Think/No Think chat-template modes. This Forge profile reuses the already mirrored vLLM 0.21.0 CUDA 13 image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps first serving to 32,768 tokens for bounded GPU probes, leaves tool-calling disabled because upstream recommends SGLang's MiniCPM5 parser for native tool-call conversion, clears unnecessary Hugging Face token environment variables, and stores Hugging Face/vLLM caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache. The 2026-05-28T17Z L40S/eu-north1 smoke passed with 3 warmups and 10 measured chat requests; the 2026-05-28T18Z remaining schedulable-cell probe validated B300/uk-south1 and H200/eu-north2; the 2026-05-28T22Z missing-cell probe validated B200/us-central1; and the 2026-05-29T18Z L40S emptyDir fallback probe validated the exposed Thinking mode control with 3 warmups and 10 measured requests. Keep active/testing and non-default until RTX6000 inventory returns and separate long-context/tool-use gates pass.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

mistralai-ministral-3-3b-instruct-2512 · GeneralBenchmarking

Ministral 3 3B Instruct 2512

Ministral 3 3B Instruct 2512 is a public Apache-2.0 compact instruction model from Mistral AI for lightweight chat, multilingual generation, JSON-style extraction, and agentic prompts. Hugging Face metadata refreshed on 2026-06-05T02Z reports private=false, gated=false, disabled=false, library_name=vllm, license=apache-2.0, revision cfcb068fa7c44114cf77a462357c6cdcd2c304b4, safetensors total parameters 3,849,090,048, and tags for vllm, mistral3, mistral-common, fp8, and safetensors. The model card describes an FP8 instruct checkpoint with a 3.4B language model plus 0.4B vision encoder, multilingual support, native function calling and JSON outputting, and a 256k context window. This active Forge testing profile intentionally exposes only text chat, caps served context to 32,768 tokens for the validated product surface, leaves tool and image inputs disabled until separate validation, clears unnecessary Hugging Face token variables for this public artifact, and uses the already mirrored vLLM 0.22.0 CUDA 13 image digest sha256:0fec7ec5f3e6bc168e54899935fb0557da908a4832a1dbc88e2debcf2f889416 with versioned Hugging Face and vLLM caches under the reviewed /opt/nim/.cache mount. The 2026-06-04T09Z live row includes explicit non-root runtime UID/GID 65532 so the stock vLLM image can satisfy Forge runAsNonRoot policy, and B300/uk-south1 passed a persisted hostPath probe with 3 warmups and 10 measured requests. The 2026-06-04T12Z remaining standard hostPath matrix persisted B200/us-central1, H200/us-central1, H200/eu-north1, H200/eu-north2, and L40S/eu-north1 as supported for the text-chat profile with 3 warmups and 10 measured requests per cell. RTX6000 was requested but not present in discovered inventory. The 2026-06-05T02Z publication pass promotes this text-chat surface to active/testing/non-default while keeping default routing, RTX6000, image input, and tool calling blocked pending separate validation.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

mistralai-mistral-7b-instruct-v0-3 · GeneralBenchmarking

Mistral 7B Instruct v0.3

Mistral 7B Instruct v0.3 packaged as an NVIDIA NIM and mirrored into Forge regional registries.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

nvidia-llama-3-1-nemoguard-8b-content-safety · GeneralBenchmarking

NVIDIA Llama 3.1 NemoGuard 8B Content Safety

NVIDIA Llama 3.1 NemoGuard 8B Content Safety packaged as an NVIDIA NIM and mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave2.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

nvidia-llama-3-1-nemoguard-8b-topic-control · GeneralBenchmarking

NVIDIA Llama 3.1 NemoGuard 8B Topic Control

NVIDIA Llama 3.1 NemoGuard 8B Topic Control packaged as an NVIDIA NIM and mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave2.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

nvidia-llama-3-1-nemotron-nano-8b-v1 · GeneralBenchmarking

NVIDIA Llama 3.1 Nemotron Nano 8B v1

NVIDIA Llama 3.1 Nemotron Nano 8B v1 packaged as an NVIDIA NIM and mirrored into Forge regional registries.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

nvidia-nemotron-nano-9b-v2 · GeneralBenchmarking

NVIDIA Nemotron Nano 9B v2

NVIDIA Nemotron Nano 9B v2 packaged as an NVIDIA NIM and mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave2.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

allenai-olmo-3-7b-instruct · GeneralBenchmarking

OLMo 3 7B Instruct

AllenAI OLMo 3 7B Instruct is a public, ungated Apache-2.0 English instruction-following model in the fully open OLMo 3 family, aimed at chat, tool-use style prompts, coding, math, general reasoning, and long-context workflows. Live Hugging Face metadata checked on 2026-05-30T14Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision 096bb5469fe34348bc88d851a69edb3bf6f40df4, license:apache-2.0, and 7,298,011,136 BF16 safetensors parameters. The artifact config reports Olmo3ForCausalLM, model_type=olmo3, 65,536 native max positions with YaRN rope scaling from 8,192, 32 layers, 32 attention heads, 32 KV heads, hidden size 4096, intermediate size 11008, vocab size 100278, and no auto_map. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps first serving to 32,768 tokens for bounded probes, clears unnecessary Hugging Face token env for this public artifact, and stores Hugging Face/vLLM caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache. The 2026-05-30T15Z live matrix validated B200, B300, H200, L40S, and RTX6000 with 3 warmups and 10 measured warm chat-completion requests per cell, so this source manifest is advanced to active/testing/non-default while leaving default routing unchanged.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

allenai-olmo-2-0425-1b-instruct · GeneralOnboarded

OLMo-2 1B Instruct

AllenAI OLMo-2 1B Instruct is a public Apache-2.0 compact instruction-following text-generation model derived from the OLMo-2 0425 1B line and post-trained with supervised fine-tuning, DPO, and RLVR data for chat, math, GSM8K-style reasoning, and IFEval-style instruction following. Hugging Face API metadata refreshed on 2026-05-27T07Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision 48d788eca847d4d7548f375ad03d3c9312f6139e, and 1,484,916,736 BF16 safetensors parameters. The artifact config reports the standard Olmo2ForCausalLM architecture, model_type=olmo2, BF16 dtype, 16 layers, 16 attention heads, hidden size 2048, intermediate size 8192, 100,352 vocab entries, and 4,096 max positions with no auto_map. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, serves the native 4,096-token context, clears unnecessary Hugging Face token env for this public artifact, and stores Hugging Face and vLLM caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache. The 2026-05-27T07Z live matrix supports B200/us-central1, B300/uk-south1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 with 3 warmups and 10 measured chat requests per cell. Keep active/testing and non-default until product review decides whether this open-science small model should be exposed in broader general routing.

lora

Best throughput: 8,679 tok/son H100
GPUs benchmarked: 2

allenai-olmo-2-1124-7b-instruct · GeneralBenchmarking

OLMo-2 7B Instruct

AllenAI OLMo-2 1124 7B Instruct is a public Apache-2.0 English instruction-following and chat model in the OLMo-2 open-science family. Hugging Face API metadata refreshed on 2026-05-27T10Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision 470b1fba1ae01581f270116362ee4aa1b97f4c84, and 7,298,617,344 BF16 safetensors parameters. The model config reports Olmo2ForCausalLM, model_type=olmo2, BF16 dtype, 32 layers, 32 attention heads, 32 KV heads, hidden size 4096, vocab size 100352, and 4096 max positions. The model card documents Apache 2.0 licensing, OLMo-specific SFT, DPO, and RLVR post-training, OpenAI-compatible vLLM serving instructions, and a note that some fine-tuning data includes outputs from third-party models subject to additional terms. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, serves the native 4096-token context, clears unnecessary Hugging Face token env for this public ungated artifact, and stores Hugging Face and vLLM caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache. It remains onboarding and default-ineligible until live GPU probes produce support and latency evidence.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

openai-gpt-oss-20b · GeneralBenchmarking

OpenAI gpt-oss 20B

OpenAI gpt-oss 20B is a public Apache-2.0 open-weight text reasoning model for instruction following, coding, tool-use style prompts, structured outputs, and agentic workflows. Hugging Face reports the artifact as public and ungated with 21,511,953,984 safetensors parameters, GptOssForCausalLM architecture, native MXFP4 MoE quantization, and 131,072-token max positions. This Forge profile uses the already mirrored vLLM 0.21.0 CUDA 13 OpenAI-compatible image, caps served context to 32,768 tokens for bounded probes, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The 2026-05-24T08Z B200 canary reached vLLM 0.21.0 engine startup but failed CUDA initialization while the manifest forced VLLM_ENABLE_CUDA_COMPATIBILITY=1; this manifest now leaves the CUDA 13 image default compatibility mode in place. The 2026-05-24T10Z full matrix probe succeeded on B200, B300, H200, and L40S with 3 warmups and 10 measured chat requests per supported cell. A focused 2026-05-24T20Z RTX6000 reprobe resolved the prior ReadError, reached HTTP 200, and completed 10 measured warm chat requests on the same digest. The checked-in manifest is active, testing, and non-default so operators can select the five-GPU profile without replacing any default route.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

qwen-2-5-7b-instruct · GeneralBenchmarking

Qwen 2.5 7B Instruct

Qwen 2.5 7B instruction model packaged as an NVIDIA NIM, useful for multilingual and coding-adjacent chat workloads.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

qwen-qwen2-5-7b-instruct · GeneralBenchmarking

Qwen2.5 7B Instruct

Qwen2.5 7B Instruct is a public, ungated Apache-2.0 general instruction-following chat model for multilingual customer support, summarization, reasoning, and long-context text workflows. Hugging Face metadata rechecked on 2026-06-02 reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision a09a35458c702b33eeacc393d103063234e8bc28, and 7,615,616,512 BF16 safetensors parameters. The artifact config reports Qwen2ForCausalLM, model_type=qwen2, BF16 weights, 32,768 configured max positions, 28 layers, 28 attention heads, 4 KV heads, hidden size 3584, intermediate size 18944, vocab size 152064, rope_theta 1000000.0, and no auto_map. The model card lists Apache-2.0 licensing, states the current config is set for 32,768-token context, and recommends vLLM for deployment. This active/testing non-default Forge profile reuses the already mirrored Forge vLLM 0.21.0 CUDA 13 image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, clears unnecessary Hugging Face token env for this public artifact, stores Hugging Face and vLLM caches under versioned /mnt/data paths, and is backed by a 2026-06-02 matrix with B200, B300, H100, H200, L40S, and RTX6000 support. Keep default_eligible=false until product routing deliberately chooses a new general-chat default.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

qwen-qwen2-5-coder-7b-instruct · GeneralBenchmarking

Qwen2.5-Coder 7B Instruct

Qwen2.5-Coder 7B Instruct is a public, ungated Apache-2.0 code-specialized instruction model for code generation, code repair, code reasoning, agentic developer workflows, and long-context code assistance. Hugging Face metadata checked on 2026-06-01 reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision c03e6d358207e414f1eca0bb1891e29f1db0e242, and 7,615,616,512 BF16 safetensors parameters. The config reports Qwen2ForCausalLM, model_type=qwen2, 32,768 max positions, 28 layers, 28 attention heads, 4 KV heads, hidden size 3584, intermediate size 18944, vocab size 152064, BF16 weights, and no auto_map or quantization_config. The model card describes Qwen2.5-Coder as a code-specific model family for developers, code agents, generation, reasoning, and fixing, notes the current config is set for 32,768-token context, and recommends vLLM for deployment. This active testing, non-default profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, reflects 2026-06-01T02Z five-cell GPU support evidence, clears unnecessary Hugging Face token env, and stores Hugging Face and vLLM caches under versioned /mnt/data paths. Keep default_eligible=false until product routing explicitly chooses this compact code-specialized model as a default code route.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

qwen-qwen3-0-6b · GeneralBenchmarking

Qwen3 0.6B

Qwen3 0.6B is a public Apache-2.0 dense text-generation model for lightweight instruction following, short reasoning, coding prompts, tool-use style prompts, multilingual chat, and low-cost Forge smoke workloads. Hugging Face API metadata refreshed on 2026-05-28T08Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision c1899de289a04d12100db370d81485cdf75e47ca, Qwen3ForCausalLM architecture, 40,960 max positions, and 751,632,384 BF16 safetensors parameters. The upstream model card documents 32,768-token context and vLLM serving for Qwen/Qwen3-0.6B. This Forge profile uses the already mirrored official vLLM 0.22.0 CUDA 13 OpenAI-compatible image digest sha256:0fec7ec5f3e6bc168e54899935fb0557da908a4832a1dbc88e2debcf2f889416 to move the active non-default route above the GHSA-94f4-hr76-p5j6 / CVE-2026-48746 affected vLLM OpenAI API range. The profile caps served context to 32,768 tokens for bounded probes, enables the vLLM qwen3 reasoning parser, clears unnecessary Hugging Face token env for this public ungated artifact, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The 2026-06-03T05Z supervisor cycle live-upserted the vLLM 0.22.0 manifest and persisted a B300/uk-south1 probe with 3 warmups, 10 measured chat requests, HTTP 200, p50 286 ms, and 653.85 tokens/s. The 2026-06-03T12Z supervisor cycle reran L40S/eu-north1 on the same vLLM 0.22.0 digest after a prior pod_deleted attempt; the retained reprobe passed with startup 572,969 ms, p50 462 ms over 10 measured chat requests, 454.55 tokens/s, 26.96 GB VRAM, and HTTP 200. B200 and H200 still need current vLLM 0.22.0 reprobes before treating the full historical matrix as refreshed. A targeted RTX6000 probe could not schedule because the visible RTX6000 nodes were tainted node.cluster.x-k8s.io/uninitialized:NoSchedule, so RTX6000 is not claimed in gpu_compatibility.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

qwen-qwen3-1-7b · GeneralBenchmarking

Qwen3 1.7B

Qwen3 1.7B is a public Apache-2.0 dense text-generation model for lightweight reasoning, instruction following, coding, tool-use prompts, multilingual chat, and long-context smoke workloads. Hugging Face API metadata refreshed on 2026-05-26T09Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision 70d244cc86ccca08cf5af4e1e306ecf908b1ad5e, Qwen3ForCausalLM architecture, 40,960 max positions, and 2,031,739,904 BF16 safetensors parameters. The upstream model card documents native 32,768-token context and vLLM serving for Qwen/Qwen3-1.7B. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps served context to 32,768 tokens for bounded probes, enables the vLLM qwen3 reasoning parser, clears unnecessary Hugging Face token env for this public ungated artifact, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The 2026-05-26T07Z live onboarding matrix completed 3 warmups and 10 measured chat requests on all five active GPU cells with client-wall median p50s: B200/us-central1 832 ms, B300/uk-south1 658 ms, H200/eu-north2 778 ms, L40S/eu-north1 1958 ms, and RTX6000/us-central1 1308 ms. B300/SM103 started successfully on the same CUDA 13 digest and selected FlashInfer attention. The 2026-05-26T09Z worker promotes the model to active/testing while keeping default_eligible=false so customers can target the direct slug without changing broader general-model routing policy.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

qwen-qwen3-14b · GeneralBenchmarking

Qwen3 14B

Qwen3 14B is a public Apache-2.0 dense text-generation model for hybrid reasoning, instruction following, coding, agent-style prompts, multilingual chat, and long-context summarization. Hugging Face API metadata refreshed on 2026-05-27T06Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision 40c069824f4251a91eefaf281ebe4c544efd3e18, Qwen3ForCausalLM architecture, 40,960 max positions, and 14,768,307,200 BF16 safetensors parameters. The upstream model card documents native 32,768-token context with 131,072-token YaRN extension guidance plus vLLM serving for Qwen/Qwen3-14B. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps served context to 32,768 tokens for bounded probes, enables the vLLM qwen3 reasoning parser, clears unnecessary Hugging Face token env for this public ungated artifact, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The 2026-05-27T06Z matrix completed 3 warmups and 10 measured chat requests on all five active GPU cells with client-wall median p50s: B200/us-central1 6400 ms, B300/uk-south1 6420 ms, H200/eu-north2 8445 ms, L40S/eu-north1 39942 ms, and RTX6000/us-central1 20898 ms. This worker promotes the CUDA 13 variant to active/testing while keeping default_eligible=false so customers can target the direct slug without changing default general Qwen routing policy.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

qwen-qwen3-14b · GeneralBenchmarking

Qwen3 14B

Qwen3 14B is a public Apache-2.0 dense text-generation model for hybrid reasoning, instruction following, multilingual chat, agent-style prompts, and long-context summarization. Hugging Face API metadata reports private=false, gated=false, disabled=false, 14,768,307,200 BF16 safetensors parameters, architecture Qwen3ForCausalLM, and text-generation usage. The upstream model card reports 14.8B total parameters, 40 layers, grouped-query attention with 40 Q heads and 8 KV heads, native 32,768-token context, and 131,072-token YaRN extension guidance. This hidden Forge profile reuses the already mirrored vLLM 0.10.2 CUDA 12.8 OpenAI-compatible image digest sha256:607442e407b0fea97f8a132a78b787c121a996dd4de181fa08e8da06e71ec2db, enables the vLLM qwen3 reasoning parser documented for vLLM 0.10.2, caps served context to 32,768 tokens for bounded probes, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The 2026-05-27T05Z hidden matrix supported H200, L40S, and RTX6000, but B200 and B300 failed during pod startup, so the profile remains hidden/onboarding and default-ineligible pending Blackwell runtime investigation.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

qwen-qwen3-30b-a3b-instruct-2507 · GeneralBenchmarking

Qwen3 30B A3B Instruct 2507 BF16

Qwen3 30B A3B Instruct 2507 BF16 is a public Apache-2.0 MoE chat model for general instruction following, coding, multilingual knowledge, long-context understanding, and tool-use style prompts. Hugging Face API metadata refreshed on 2026-05-26T16Z reports private=false, gated=false, disabled=false, commit 0d7cf23991f47feeb3a57ecb4c9cee8ea4a17bfe, qwen3_moe architecture, Qwen3MoeForCausalLM, 30,532,122,624 BF16 safetensors parameters, and text-generation usage. The upstream config reports 48 layers, 32 query heads, 4 KV heads, 128 experts with 8 active experts, and a native 262,144-token context. This Forge profile is the BF16/non-FP8 fallback for the sibling FP8 profile whose B200 startup fails in Torch/vLLM/CUTLASS cutlass_scaled_mm dispatch. It reuses the already mirrored official vLLM 0.21.0 CUDA 13 image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps served context to 32,768 tokens for bounded first probes, reduces profile-run pressure with max-num-batched-tokens 4096 and max-num-seqs 2, clears unnecessary Hugging Face token env vars for this public ungated artifact, and disables FlashInfer BF16 MoE while forcing Triton MoE because the default B200 BF16 path loaded the checkpoint but failed when FlashInfer tried to write cubin symlinks under the read-only image filesystem. The 2026-05-26T16Z matrix completed 3 warmups and 10 measured requests on B200/us-central1, B300/uk-south1, H200/eu-north2, H200/us-central1, and RTX6000/us-central1 with HTTP 200 OpenAI-compatible chat output. L40S/eu-north1 failed before readiness with torch.OutOfMemoryError while allocating Qwen3 MoE weights on a 44.39 GiB GPU. Keep the model hidden onboarding/non-default until product review decides whether the BF16 fallback should be exposed alongside the faster FP8 sibling.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

qwen-qwen3-30b-a3b-instruct-2507 · GeneralBenchmarking

Qwen3 30B A3B Instruct 2507 FP8

Qwen3 30B A3B Instruct 2507 FP8 is a public Apache-2.0 non-thinking MoE chat model for general instruction following, coding, multilingual knowledge, long-context understanding, and tool-use style prompts. Hugging Face API metadata refreshed on 2026-05-26 reports private=false, gated=false, disabled=false, commit 5a5a776300a41aaa681dd7ff0106608ef2bc90db, qwen3_moe architecture, Qwen3MoeForCausalLM, 30,533,947,392 safetensors parameters, F8_E4M3 plus BF16 tensors, and text-generation usage. The model card documents 30.5B total parameters, 3.3B activated parameters, 128 experts with 8 active experts, a native 262,144-token context, Apache-2.0 licensing, fine-grained FP8 quantization with 128x128 weight blocks, and OpenAI-compatible vLLM/SGLang serving. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps served context to 32,768 tokens for bounded first probes, clears unnecessary Hugging Face token env for this public ungated artifact, records H200, L40S, and RTX6000 as supported from the 2026-05-26T11Z partial matrix, and adds B300 support from the 2026-05-26T12Z DeepGEMM fallback probe. B200 remains disabled because the 2026-05-26T13Z retained diagnostic reproduced the failure on a fresh B200 node and captured the startup traceback: vLLM reaches Qwen3MoeForCausalLM profile_run, then Torch Inductor executes torch.ops._C.cutlass_scaled_mm.default and aborts with RuntimeError: dispatch_scaled_mm at /workspace/csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/scaled_mm_helper.hpp:17.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

qwen-qwen3-4b-instruct-2507 · GeneralBenchmarking

Qwen3 4B Instruct 2507

Qwen3 4B Instruct 2507 is a public Apache-2.0 text-generation model for instruction following, coding, multilingual knowledge, tool-use style prompts, and long-context understanding. The upstream card reports 4.0B parameters, 36 layers, grouped-query attention, and a native 262,144-token context window; this initial Forge onboarding variant caps vLLM at 32,768 tokens to keep first probes bounded across current GPU cells. It uses the already mirrored official vLLM 0.10.2 CUDA 12.8 OpenAI-compatible image digest sha256:607442e407b0fea97f8a132a78b787c121a996dd4de181fa08e8da06e71ec2db after the Qwen3-Coder thread showed vLLM 0.21.0 CUDA 13 failing on the current B200 driver stack with cudaGetDeviceCount error 803. Hidden probes now support B200/us-central1 with p50 881 ms client wall time, H200/eu-north2 with p50 900 ms client wall time, L40S/eu-north1 with p50 2348 ms client wall time, and RTX6000/us-central1 with p50 1780 ms client wall time. B300/uk-south1 is still disabled because the same vLLM image fails on SM103 during Torch/Triton compile with ptxas rejecting gpu-name sm_103a. The runtime passes both the serve target and explicit --model argument, selected Flash Attention on RTX6000 during the 2026-05-23T14Z probe, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

qwen-qwen3-4b-instruct-2507 · GeneralBenchmarking

Qwen3 4B Instruct 2507

Qwen3 4B Instruct 2507 is a public Apache-2.0 text-generation model for instruction following, coding, multilingual knowledge, tool-use style prompts, and long-context understanding. Hugging Face API metadata refreshed on 2026-05-24T04Z reports private=false, gated=false, disabled=false, commit cdbee75f17c01a7cc42f958dc650907174af0554, Qwen3ForCausalLM architecture, 4,022,468,096 BF16 safetensors parameters, and text-generation usage. This Forge candidate isolates the vLLM 0.21.0 CUDA 13 runtime from the existing vLLM 0.10.2 CUDA 12.8 profile. It uses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps served context to 32,768 tokens for bounded probes, passes only the positional model id because the image entrypoint is already `vllm serve`, and does not override the image default VLLM_ENABLE_CUDA_COMPATIBILITY=0. Live Forge probes support all five GPU cells for this CUDA 13 variant: B300/uk-south1 from the 2026-05-23T15Z probe, plus B200/us-central1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 from the 2026-05-23T16Z remaining-cell probe. The 2026-05-24T04Z publication review keeps the checked-in manifest active, testing, non-default, and non-latest for operator-selected CUDA 13 usage while preserving the separate CUDA 12.8 profile and a 32,768-token served-context cap.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

qwen-qwen3-8b · GeneralBenchmarking

Qwen3 8B

Qwen3 8B is a public Apache-2.0 dense text-generation model for hybrid reasoning, instruction following, coding, agent-style prompts, multilingual chat, and long-context summarization. Hugging Face API metadata refreshed on 2026-05-26T05Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision b968826d9c46dd6066d109eabc6255188de91218, Qwen3ForCausalLM architecture, 40,960 max positions, and 8,190,735,360 BF16 safetensors parameters. The upstream model card documents native 32,768-token context with 131,072-token YaRN extension guidance plus vLLM serving for Qwen/Qwen3-8B. This Forge profile uses the already mirrored official vLLM 0.22.0 CUDA 13 OpenAI-compatible image digest sha256:0fec7ec5f3e6bc168e54899935fb0557da908a4832a1dbc88e2debcf2f889416, caps served context to 32,768 tokens for bounded probes, enables the vLLM qwen3 reasoning parser, clears unnecessary Hugging Face token env for this public ungated artifact, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The 2026-06-03T05Z supervisor cycle moved this active testing route off vLLM 0.21.0 because GHSA-94f4-hr76-p5j6 / CVE-2026-48746 is patched in vLLM 0.22.0, verified all four regional mirrors, live-upserted the patched manifest, and ran a focused B300/uk-south1 no-persist smoke before keeping the route active. The historical 2026-05-26T05Z and 2026-05-26T10Z vLLM 0.21.0 onboarding matrices completed 3 warmups and 10 measured chat requests on all five active GPU cells with client-wall median p50s: B200/us-central1 1879 ms, B300/uk-south1 1809 ms, H200/eu-north2 2201 ms, L40S/eu-north1 9224 ms, and RTX6000/us-central1 5706 ms. The route remains active/testing with default_eligible=false so customers can target the direct slug without changing broader general-model default routing policy.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

qwen-qwen3-coder-30b-a3b-instruct-fp8 · GeneralBenchmarking

Qwen3-Coder 30B A3B Instruct FP8

Qwen3-Coder 30B A3B Instruct FP8 is a public Apache-2.0 coding-focused mixture-of-experts model for agentic coding, browser-use style tasks, repository-scale prompts, tool-call style workflows, and general code generation. Hugging Face metadata refreshed on 2026-05-31 reports the model as public and ungated with 30,533,947,392 safetensors parameters, Qwen3MoeForCausalLM architecture, and fine-grained FP8 quantization; the model card reports 30.5B total parameters, 3.3B activated parameters, 128 experts, 8 active experts, and a native 262,144-token context window. This Forge onboarding profile uses the mirrored vLLM 0.10.2 CUDA 12.8 OpenAI-compatible image digest sha256:607442e407b0fea97f8a132a78b787c121a996dd4de181fa08e8da06e71ec2db after the vLLM 0.21.0 CUDA 13 image failed on a B200 node with CUDA driver error 803. The runtime caps served context to 32,768 tokens to match the upstream OOM mitigation guidance, passes the model both as the vLLM serve target and explicit engine --model argument, enables the vLLM qwen3_coder tool-call parser observed in the vLLM 0.10.2 API server, sets max-num-seqs to 16 to avoid the Blackwell warmup shape failure seen with 4, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. Current probes validate this corrected CUDA 12.8 path with 3 warmups and 10 measured chat requests on B200/us-central1 p50 3916 ms, H200/eu-north2 p50 4292 ms, L40S/eu-north1 p50 7616 ms, and RTX6000/us-central1 p50 7232 ms. B300/uk-south1 remains disabled because the 2026-05-31 13Z SM103 probe pulled the regional image successfully but the model container terminated before readiness with `pod_start_failure:model:terminated:Error`. Keep the row hidden and non-default until B300 is fixed or explicitly classified unsupported and product review decides whether to expose this CUDA 12.8 FP8 Coder profile.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

bytedance-seed-oss-36b-instruct · GeneralBenchmarking

Seed-OSS 36B Instruct

Seed-OSS 36B Instruct is a public Apache-2.0 dense causal language model from ByteDance Seed for long-context reasoning, coding, agentic workflows, tool-use style prompts, and international English/Chinese usage. Hugging Face API metadata refreshed on 2026-06-05T05Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision 497f1dca95ebdec98e41d517b9f060ee753c902f, lastModified 2025-08-26T02:33:00Z, 40,027 downloads, 499 likes, and 36,151,104,512 BF16 safetensors parameters. The upstream config reports SeedOssForCausalLM, model_type seed_oss, 64 layers, hidden size 5120, 80 attention heads, 8 KV heads, 524,288 max positions, BF16 weights, and no remote-code auto_map. The model card states Apache-2.0 licensing, 36B parameters, GQA, a 512K context, flexible thinking budget control, reasoning, agentic intelligence, and vLLM usage. vLLM latest documentation lists SeedOssForCausalLM with ByteDance-Seed/Seed-OSS-36B-Instruct as supported, and the Seed-OSS recipe recommends vLLM serving with --enable-auto-tool-choice and --tool-call-parser seed_oss while also warning that recent support may require main-branch bits. This hidden Forge manifest uses the already mirrored official vLLM 0.22.0 CUDA 13 image digest sha256:0fec7ec5f3e6bc168e54899935fb0557da908a4832a1dbc88e2debcf2f889416, caps served context to 32768 for initial probes, sets explicit non-root UID/GID 65532 for the stock vLLM image, and clears unnecessary Hugging Face tokens because the artifact is public and ungated. Persisted probes now support B200/us-central1, B300/uk-south1, and H200/us-central1 with 3 warmups and 10 measured requests each. The B300/SM103 runtime selected SeedOssForCausalLM, the seed_oss tool parser, FLASHINFER attention, HND KV cache layout, TRTLLM prefill attention, CUDA graph capture, and a 679k-token GPU KV cache before serving HTTP 200 responses. The H200 runtime selected FLASH_ATTN, loaded 67.48 GiB of weights, allocated 53.56 GiB KV cache, exposed 219,360 GPU KV tokens, and completed HTTP 200 chat responses. L40S and RTX6000 remain false until their own probes validate startup, memory headroom, and kernel dispatch.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

huggingfacetb-smollm3-3b · GeneralBenchmarking

SmolLM3 3B

SmolLM3 3B is a public Apache-2.0 small general language model from Hugging Face for instruction following, hybrid reasoning, multilingual chat, tool-use style prompts, and long-context summarization. Hugging Face API metadata reports private=false, gated=false, disabled=false, 3,075,098,624 BF16 safetensors parameters, and text-generation usage. The model card describes SmolLM3 as an instruct model with dual-mode reasoning, a 64k trained context, optional 128k YaRN extrapolation, recommended temperature 0.6 and top_p 0.95, and vLLM deployment with the Hermes tool-call parser. vLLM 0.10.2 supported-model docs list SmolLM3ForCausalLM and HuggingFaceTB/SmolLM3-3B through the Transformers backend. This Forge profile reuses the already mirrored vLLM 0.10.2 CUDA 12.8 OpenAI-compatible image digest sha256:607442e407b0fea97f8a132a78b787c121a996dd4de181fa08e8da06e71ec2db, caps served context to 32,768 tokens for bounded first probes, disables thinking by default for predictable short responses, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. Focused probes on 2026-05-22T22Z through 2026-05-23T02Z support B200/us-central1, L40S/eu-north1, H200/eu-north2, and RTX6000/us-central1 with 10 measured warm chat requests per supported cell; B300/uk-south1 remains disabled because the same vLLM CUDA 12.8 stack fails during Torch/Triton compile on SM103 with ptxas rejecting gpu-name sm_103a.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0

huggingfacetb-smollm3-3b · GeneralBenchmarking

SmolLM3 3B

SmolLM3 3B is a public Apache-2.0 small general language model from Hugging Face for instruction following, hybrid reasoning, multilingual chat, tool-use style prompts, and long-context summarization. Hugging Face API metadata refreshed on 2026-05-24T07Z reports private=false, gated=false, disabled=false, commit a07cc9a04f16550a088caea529712d1d335b0ac1, 3,075,098,624 BF16 safetensors parameters, SmolLM3ForCausalLM architecture, and text-generation usage. This CUDA 13 variant is split from the supported vLLM 0.10.2 CUDA 12.8 Forge profile to expose the newer vLLM 0.21.0 CUDA 13 runtime without overwriting that profile's matrix. It reuses the already mirrored vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps served context to 32,768 tokens for bounded probes, disables thinking by default for predictable short responses, uses the image's built-in `vllm serve` entrypoint with only the positional model id, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The CUDA 13 runtime has live supported evidence on all five Forge GPU cells: B300/uk-south1 from the 2026-05-23T04Z probe, plus B200/us-central1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 from the 2026-05-23T12Z probe. The 2026-05-24T07Z publication review promotes this checked-in manifest to active, testing, non-default, and non-latest for operator-selected CUDA 13 usage while preserving the separate CUDA 12.8 profile and a 32,768-token served-context cap.

lora

Best throughput: PendingBenchmark in progress
GPUs benchmarked: 0