Public Preview Model CatalogPublic previewLive capacity on Status

Nebius Forge

A public catalog and playground for curated AI models on Nebius on-demand GPUs. Routing is automatic, region-aware, and GPU-pinned when you need it.

Public preview

Forge is open for early testing and real workloads, but it is still a beta surface. Model coverage, routes, benchmarks, and APIs may change while the platform hardens.

Start with API-ready models Read API docs

Regional degradation detected

Forge reports 2 regions with degraded or down status. Requests pinned to affected regions may fail; use automatic routing or check Status for details.

degraded

Find a model

Search by model, workload, or GPU. Open a model to run it or inspect its verified targets.

⌕

APIGPUSort

159 model families · Newest first

Healthcare / Life Science

coldNative inference

LAMMPS/Kokkos Molecular Dynamics

LAMMPS/Kokkos Molecular Dynamics is a live-tested Forge runtime for bounded nonclinical LAMMPS simulations with Kokkos/CUDA acceleration. The wrapper runs a default Lennard-Jones argon smoke input or a user-supplied LAMMPS input/data bundle, writes artifacts under /mnt/data, and parses LAMMPS throughput as ns/day when available. It is not force-field-equivalent to GROMACS, OpenMM, AMBER, NAMD, or CHARMM results unless the scientific setup is independently matched and validated.

Fastest verified: RTX6000 in us-central1
Performance: 705 ms

Healthcare / Life Science

coldOpenAI SDK

Carbon 3B

Carbon-3B is a public Apache-2.0 generative DNA foundation model from Hugging Face Bio. The Hugging Face API checked on 2026-06-05 reports private=false, gated=false, disabled=false, library_name=transformers, revision fe755cb5c7498acbf630080609ef61ecc4e36c17, and 3,451,444,224 BF16 safetensors parameters. Its model card describes a decoder-only autoregressive genomic model for DNA and RNA, native 32,768-token context, and vLLM compatibility. The stock model architecture is LlamaForCausalLM; the tokenizer is a custom HybridDNATokenizer that switches into 6-mer DNA mode with the <dna> tag and therefore requires tokenizer trust_remote_code. This Forge profile serves the checkpoint through the already mirrored digest-pinned vLLM 0.22.0 CUDA 13 OpenAI-compatible image, pins the model, tokenizer, and remote code revisions to the reviewed commit, clears unnecessary Hugging Face token variables for this public artifact, uses versioned caches under /opt/nim/.cache, and limits the published surface to nonclinical genomics research. H200/eu-north2 passed the pinned event benchmark on 2026-06-05 with startup_ms=376755, p50=297 ms across 10 measured requests after one warmup, tokens_per_second=252.53, and vram_used_gb=115.26.

Fastest verified: H200 in eu-north2
Performance: 297 ms

Healthcare / Life Science

coldNative inference

MolMIM NIM

NVIDIA MolMIM NIM is a life-science small-molecule generation service over SMILES input. NVIDIA documentation pins the downloadable container to nvcr.io/nim/nvidia/molmim:1.0.0 and documents the /generate endpoint. Forge sets NIM_CACHE_PATH=/opt/nim/.cache so MolMIM uses the shared writable NIM cache mount instead of the image's read-only /home/nvs filesystem. The 1.0.0 image is mirrored into all active Forge regional registries by digest as of 2026-06-05. This is a nonclinical molecular-design research workflow and must not be positioned as clinical, therapeutic, safety, efficacy, or patient-specific guidance.

Fastest verified: H200 in eu-north2
Performance: 5.5s

Healthcare / Life Science

coldNative inference

d4data Biomedical NER All

d4data/biomedical-ner-all is an Apache-2.0 Hugging Face Transformers token-classification model for broad biomedical named entity recognition over case-report and biomedical text. Hugging Face API and model-card metadata rechecked on 2026-06-05T03:03Z report private=false, gated=false, disabled=false, library_name=transformers, pipeline_tag=token-classification, revision 015a4050c9ac99722e61c547aa9b4282bcbedc7f, cardData.license=apache-2.0, top-level license=null, 66,427,476 F32 safetensors parameters, model.safetensors SHA256 d744b846a71ce6ccdb49d7bfe5097eadc41e766ffd28481e1636ed796e820165, tokenizer.json, and DistilBertForTokenClassification config with 84 BIO labels. Existing model-specific Docker/GHCR image checks did not produce a safe Forge-compatible serving image: the d4data/GHCR refs are denied and the public Bytez image is mutable latest, requires an external Bytez API key, lacks Forge cache/checkpoint/product-safety contracts, and exposes generated text-generation docs unrelated to this token-classification workload. This hidden wrapper keeps executable serving code in the repo, verifies the pinned safetensors checkpoint before load, exposes /token_classification, /v1/token_classification, and /run, returns label/start/end/text/score spans plus model_time_ms, clears inherited Hugging Face token env vars, disables HF telemetry, runs as UID/GID 10001:10001, and routes Hugging Face, Transformers, Torch, XDG, and HOME caches under /opt/nim/.cache, which the Forge scheduler mounts from shared /mnt/data/forge-weights. The wrapper image was pushed and mirrored on 2026-06-05T04Z at sha256:a604e749b245bad5de4958b9499ca3591a905e115c41c87f162936327442c9fa; the first full matrix exposed a non-root permission bug with direct /mnt/data cache paths, now patched in source and pending a rebuilt cachefix image. It remains hidden/onboarding/default-ineligible and must not be used for medical advice, clinical decision-making, diagnosis, treatment, triage, medication recommendations, patient care, or patient-specific interpretation.

Fastest verified: H100 in eu-north1
Performance: 3 ms

General

coldOpenAI SDK

Ministral 3 3B Instruct 2512

Ministral 3 3B Instruct 2512 is a public Apache-2.0 compact instruction model from Mistral AI for lightweight chat, multilingual generation, JSON-style extraction, and agentic prompts. Hugging Face metadata refreshed on 2026-06-05T02Z reports private=false, gated=false, disabled=false, library_name=vllm, license=apache-2.0, revision cfcb068fa7c44114cf77a462357c6cdcd2c304b4, safetensors total parameters 3,849,090,048, and tags for vllm, mistral3, mistral-common, fp8, and safetensors. The model card describes an FP8 instruct checkpoint with a 3.4B language model plus 0.4B vision encoder, multilingual support, native function calling and JSON outputting, and a 256k context window. This active Forge testing profile intentionally exposes only text chat, caps served context to 32,768 tokens for the validated product surface, leaves tool and image inputs disabled until separate validation, clears unnecessary Hugging Face token variables for this public artifact, and uses the already mirrored vLLM 0.22.0 CUDA 13 image digest sha256:0fec7ec5f3e6bc168e54899935fb0557da908a4832a1dbc88e2debcf2f889416 with versioned Hugging Face and vLLM caches under the reviewed /opt/nim/.cache mount. The 2026-06-04T09Z live row includes explicit non-root runtime UID/GID 65532 so the stock vLLM image can satisfy Forge runAsNonRoot policy, and B300/uk-south1 passed a persisted hostPath probe with 3 warmups and 10 measured requests. The 2026-06-04T12Z remaining standard hostPath matrix persisted B200/us-central1, H200/us-central1, H200/eu-north1, H200/eu-north2, and L40S/eu-north1 as supported for the text-chat profile with 3 warmups and 10 measured requests per cell. RTX6000 was requested but not present in discovered inventory. The 2026-06-05T02Z publication pass promotes this text-chat surface to active/testing/non-default while keeping default routing, RTX6000, image input, and tool calling blocked pending separate validation.

Fastest verified: B300 in uk-south1
Performance: 326 ms

Healthcare / Life Science

coldNative inference

OpenMed ZeroShot NER Pathology Tiny 60M

OpenMed ZeroShot NER Pathology Tiny 60M is a public Apache-2.0 GLiNER token-classification model for disease and pathology-oriented biomedical entity extraction. The Hugging Face API rechecked on 2026-06-04T02Z reports private=false, gated=false, disabled=false, library_name=gliner, pipeline_tag=token-classification, revision 751c87f2dfa77800e1bead7f9fb40f5734078e47, tokenizer.json present, pytorch_model.bin SHA256 d84285fd6758d40f8d67809b91870acb9caaeee322b5daa4da132aac3fdb65eb, and Apache-2.0 license metadata. The model card describes high-precision disease NER tuned for research literature, NCBI Disease corpus grounding, and zero-shot flexibility through caller-specified entity labels. Forge did not find an official serving container suitable for the scheduler, so this row uses a Forge-owned GLiNER/FastAPI wrapper source added under services/models/hcls/openmed_zeroshot_ner_pathology_tiny_60m. The wrapper mirrors the existing OpenMed disease NER safety pattern: clear inherited Hugging Face token env vars for the public artifact, route caches to /opt/nim/.cache backed by Forge shared /mnt/data/forge-weights, require research_use_acknowledgement=true, expose /token_classification, /v1/token_classification, and /run, pin the HF revision, verify pytorch_model.bin SHA256 before load, force GLiNER's inference word splitter to whitespace by default to avoid non-required multilingual tokenizer extras, run as UID/GID 10001:10001, and return label/start/end/text/score spans plus model_time_ms. The 2026-06-04T01Z CVE-fix image was scanned with zero fixed HIGH/CRITICAL Trivy findings, pushed, and mirrored to every active Forge region at digest sha256:5f0047a8476633e6187c8386e3c48238e296e869593df45c2303c3f9b65ba1f7. The 2026-06-04T04Z standard matrix supports B200, B300, H200, and L40S with 3 warmups plus 10 measured token-classification requests per live cell using runtime-reported model_time_ms. RTX6000 remains false/unclassified because no RTX6000 inventory cell appeared in the probe run. Keep hidden, onboarding, and default-ineligible until healthcare product safety review approves the nonclinical pathology text-mining surface.

Fastest verified: H100 in eu-north1
Performance: 7 ms

General

coldNative inference

FLUX.2 Dev

FLUX.2 Dev is a gated, non-commercial Black Forest Labs image generation and editing model served through a Forge-owned Diffusers Flux2Pipeline BF16 wrapper. Hugging Face API metadata checked on 2026-06-03T09:05:09Z reports private=false, gated=auto, disabled=false, pipeline_tag=image-to-image, library_name=diffusers, revision 26afe3a78bb242c0a8bb181dcc8937bb16e5c66c, lastModified 2026-02-17T15:56:06.000Z, downloads=320657, likes=1720, license=other, and license_name=flux-non-commercial-license. The model has large safetensors artifacts including flux2-dev.safetensors at 64,446,596,128 bytes plus sharded transformer and text-encoder weights, so the selected wrapper writes Hugging Face, Diffusers, Torch, and XDG caches under /mnt/data. The container build based on pytorch/pytorch:2.9.1-cuda12.8-cudnn9-runtime and Diffusers commit 4c77dcdbac6c75c8a1fdfaa1657c70f8930c8f3e succeeded with image ID sha256:7608fd3102154b7d8086337266a0c9c3e1f5d00d6b31bae5f6cb06bf829ebee7, size 8,296,349,976 bytes, and mirrored digest sha256:2a033f066904c01d1377b00aa28773fdea0e131da03f51b01ab0bde448acb918 in all active Forge regional registries. The 2026-06-03 GPU matrix supports B200, B300, H200, H100, and RTX6000, with L40S marked OOM. B200 and B300 are the preferred runtime targets for user-facing generation latency.

Fastest verified: B300 in uk-south1
Performance: 15.7s

General

coldOpenAI SDK

Devstral Small 1.1 2507

Devstral Small 1.1 2507 is a public Apache-2.0 text-only coding and software-engineering agent model from Mistral AI and All Hands AI. Hugging Face API metadata checked on 2026-06-02 reports private=false, gated=false, disabled=false, library_name=mistral-common, license:apache-2.0, revision bd165ab26cebbcc2eea2c4ecbfc07f3ac42b3c39, lastModified 2025-08-18T08:14:29Z, and 23,572,403,200 BF16 safetensors parameters. The upstream card describes Devstral as an agentic coding model with Mistral function-calling support, 24-language coverage, a 128K context window, and recommended vLLM serving via tokenizer_mode/config_format/load_format mistral plus tool-call-parser mistral and tensor-parallel-size 2. This active testing non-default Forge profile uses the mirrored official vLLM 0.22.0 CUDA 13 image digest sha256:0fec7ec5f3e6bc168e54899935fb0557da908a4832a1dbc88e2debcf2f889416, replacing the earlier vLLM 0.21.0 draft image so the path is above the CVE-2026-48746 / GHSA-94f4-hr76-p5j6 affected range before publication. It caps served context to 32768 tokens, clears Hugging Face token environment variables because the artifact is public and ungated, and stores Hugging Face, Transformers, vLLM, and XDG caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache. The 2026-06-02T22Z worker live-upserted the prior TP2 draft as hidden onboarding and verified DB readback, but B200/us-central1 and H200/eu-north2 TP2 probes were blocked by current two-GPU placement capacity. The 2026-06-02T23Z worker mirrored and dry-run validated the vLLM 0.22.0 manifest update. The 2026-06-03T00Z worker rechecked compute inventory and found current B300/H200/RTX6000 cells are one-GPU node pools, so this row uses an explicit tensor-parallel-size 1 fallback. The 2026-06-03T01Z H200/eu-north2 TP1 probe completed startup, 3 warmups, and 10 measured coding-chat requests on the v0.22.0 digest. The 2026-06-03T02Z remaining-cell probe added supported B300/uk-south1 and RTX6000/us-central1 rows on the same digest. The 2026-06-03T03Z B200/us-central1 reprobe succeeded on a different B200 node and persisted a supported row; the same cycle classified L40S/eu-north1 as one-GPU BF16 CUDA OOM during model load. The 2026-06-03T04Z no-persist B200 tool-call smoke forced the Mistral parser to call a read_file function and the response preview contained a structured OpenAI tool_calls entry with arguments {"path":"src/retry.py"}.

Fastest verified: B300 in uk-south1
Performance: 616 ms

Healthcare / Life Science

coldNative inference

GROMACS Molecular Dynamics

GROMACS Molecular Dynamics is an active Forge runtime for bounded nonclinical MD simulations. Official GROMACS and NVIDIA NGC images are CLI/HPC oriented, so Forge serves a FastAPI wrapper around the NGC 2023.2 CUDA image. The wrapper can run a default argon Lennard-Jones smoke system, accept custom .gro/.top/.mdp text, or run a supplied base64 .tpr, and writes generated simulation artifacts under /mnt/data/model-outputs/gromacs-md-ngc-wrapper.

Fastest verified: L40S in eu-north1
Performance: 530 ms

Healthcare / Life Science

coldNative inference

OpenMM Molecular Dynamics

OpenMM Molecular Dynamics is an active Forge runtime for bounded nonclinical MD simulations. Upstream OpenMM is a Python/C++ toolkit for GPU-accelerated molecular simulation and this wrapper installs OpenMM 8.5.1 from conda-forge into a CUDA 12 runtime. The HTTP runtime executes a small argon Lennard-Jones example with configurable particle count, step count, platform, precision, nonbonded method, cutoff distance, integrator, and optional final-state readback. The 2026-06-04 build runs as a non-root Forge runtime user, reports simulated_ns and ns_per_day, reuses bounded OpenMM Context caches to remove repeated CUDA context-construction overhead for hot workloads, and exposes cutoff plus Verlet integrator modes for profiler-driven throughput experiments.

Fastest verified: L40S in eu-north1
Performance: 2 ms

General

coldOpenAI SDK

Qwen2.5 7B Instruct

Qwen Team 2025

Qwen2.5 7B Instruct is a public, ungated Apache-2.0 general instruction-following chat model for multilingual customer support, summarization, reasoning, and long-context text workflows. Hugging Face metadata rechecked on 2026-06-02 reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision a09a35458c702b33eeacc393d103063234e8bc28, and 7,615,616,512 BF16 safetensors parameters. The artifact config reports Qwen2ForCausalLM, model_type=qwen2, BF16 weights, 32,768 configured max positions, 28 layers, 28 attention heads, 4 KV heads, hidden size 3584, intermediate size 18944, vocab size 152064, rope_theta 1000000.0, and no auto_map. The model card lists Apache-2.0 licensing, states the current config is set for 32,768-token context, and recommends vLLM for deployment. This active/testing non-default Forge profile reuses the already mirrored Forge vLLM 0.21.0 CUDA 13 image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, clears unnecessary Hugging Face token env for this public artifact, stores Hugging Face and vLLM caches under versioned /mnt/data paths, and is backed by a 2026-06-02 matrix with B200, B300, H100, H200, L40S, and RTX6000 support. Keep default_eligible=false until product routing deliberately chooses a new general-chat default.

Fastest verified: B300 in uk-south1
Performance: 633 ms

General

coldOpenAI SDK

BGE Base EN v1.5

BAAI bge-base-en-v1.5 is a public MIT-licensed English embedding model for semantic search, RAG retrieval, clustering, and medium-cost passage ranking. Hugging Face API metadata checked on 2026-06-01T21Z reports private=false, gated=false, disabled=false, library_name=sentence-transformers, pipeline_tag=feature-extraction, license=mit, revision a5beb1e3e68b9ab74eb54cfd186867f64f240e1a, tags text-embeddings-inference and endpoints_compatible, and safetensors parameters I64=512 plus F32=109,482,240. The upstream config uses BertModel with 12 layers, 768 hidden size, 12 attention heads, 512 positions, and float32 weights; sentence-transformers metadata sets max_seq_length=512, lowercase tokenization, CLS pooling, and 768-dimensional embeddings. This active testing non-default profile reuses Forge's already mirrored Hugging Face Text Embeddings Inference CUDA 1.9.3 digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a, clears unnecessary Hugging Face token variables because the artifact is public and ungated, and follows existing Forge TEI manifests by routing Hugging Face caches through /opt/nim/.cache paths backed by shared model cache storage. The 2026-06-01T21Z hidden B200/us-central1 probe persisted a supported row with 3 warmups and 10 measured requests: startup 201072 ms, client-wall request p50 160 ms, 168.75 tokens/s, and 0.96 GB VRAM. The 2026-06-01T22Z hidden H200/us-central1 probe persisted a supported row with 3 warmups and 10 measured requests: startup 291019 ms, client-wall request p50 158 ms, 170.89 tokens/s, and 0.86 GB VRAM. The 2026-06-01T23Z hidden L40S/eu-north1 probe persisted a supported row with 3 warmups and 10 measured requests: startup 371803 ms, client-wall request p50 96 ms, 281.25 tokens/s, and 0.69 GB VRAM. The 2026-06-01T23Z hidden B300/uk-south1 probe persisted an other_error row with pod_start_failure:model:terminated:Error, matching the pinned TEI CUDA 1.9.3 SM103 risk observed on sibling TEI profiles. The first 2026-06-02T20Z RTX6000/us-central1 probe hit a kubelet ContainerCreating EOF before model logs, but the 2026-06-02T20Z-next no-persist retry on a different RTX6000 node reached TEI readiness and completed 3 warmups plus 10 measured embedding requests: startup 314651 ms, client-wall request p50 167 ms, 161.68 tokens/s, and 0.84 GB VRAM. A follow-up persistent RTX6000/us-central1 probe then wrote a supported row with startup 192949 ms, p50 166 ms, 162.65 tokens/s, and 0.84 GB VRAM. The live manifest upsert in this cycle publishes RTX6000 compatibility while keeping B300 disabled.

Fastest verified: L40S in eu-north1
Performance: 96 ms

General

coldOpenAI SDK

Seed-OSS 36B Instruct

Seed-OSS 36B Instruct is a public Apache-2.0 dense causal language model from ByteDance Seed for long-context reasoning, coding, agentic workflows, tool-use style prompts, and international English/Chinese usage. Hugging Face API metadata refreshed on 2026-06-05T05Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision 497f1dca95ebdec98e41d517b9f060ee753c902f, lastModified 2025-08-26T02:33:00Z, 40,027 downloads, 499 likes, and 36,151,104,512 BF16 safetensors parameters. The upstream config reports SeedOssForCausalLM, model_type seed_oss, 64 layers, hidden size 5120, 80 attention heads, 8 KV heads, 524,288 max positions, BF16 weights, and no remote-code auto_map. The model card states Apache-2.0 licensing, 36B parameters, GQA, a 512K context, flexible thinking budget control, reasoning, agentic intelligence, and vLLM usage. vLLM latest documentation lists SeedOssForCausalLM with ByteDance-Seed/Seed-OSS-36B-Instruct as supported, and the Seed-OSS recipe recommends vLLM serving with --enable-auto-tool-choice and --tool-call-parser seed_oss while also warning that recent support may require main-branch bits. This hidden Forge manifest uses the already mirrored official vLLM 0.22.0 CUDA 13 image digest sha256:0fec7ec5f3e6bc168e54899935fb0557da908a4832a1dbc88e2debcf2f889416, caps served context to 32768 for initial probes, sets explicit non-root UID/GID 65532 for the stock vLLM image, and clears unnecessary Hugging Face tokens because the artifact is public and ungated. Persisted probes now support B200/us-central1, B300/uk-south1, and H200/us-central1 with 3 warmups and 10 measured requests each. The B300/SM103 runtime selected SeedOssForCausalLM, the seed_oss tool parser, FLASHINFER attention, HND KV cache layout, TRTLLM prefill attention, CUDA graph capture, and a 679k-token GPU KV cache before serving HTTP 200 responses. The H200 runtime selected FLASH_ATTN, loaded 67.48 GiB of weights, allocated 53.56 GiB KV cache, exposed 219,360 GPU KV tokens, and completed HTTP 200 chat responses. L40S and RTX6000 remain false until their own probes validate startup, memory headroom, and kernel dispatch.

Fastest verified: B300 in uk-south1
Performance: 11.0s

Healthcare / Life Science

coldOpenAI SDK

MedEmbed Large

MedEmbed Large is Abhinand Balachandran's Apache-2.0 sentence-transformers embedding model fine-tuned from BAAI/bge-large-en-v1.5 for medical and clinical information retrieval. Hugging Face metadata checked on 2026-06-01T08Z and rechecked on 2026-06-01T16Z, 2026-06-01T20Z, and 2026-06-01T21Z reports private=false, gated=false, disabled=false, library_name=sentence-transformers, revision 963121bfb9c625475f65b08fb54990ce9c4e7a1a, tags including medical-embedding, clinical-embedding, information-retrieval, safetensors, and license:apache-2.0, and safetensors total 335,141,888 F32 parameters. Raw config reports BertModel, hidden_size 1024, 24 layers, 16 attention heads, and max_position_embeddings 512; 1_Pooling/config.json enables CLS pooling and modules.json includes Normalize. The model is a higher-capacity sibling to the existing Forge MedEmbed Small and Base rows for nonclinical HCLS retrieval and RAG evaluation. Container search found no model-specific NIM or vendor serving image; Hugging Face TEI CUDA 1.9.3 is available and mirrored, but neighboring Forge TEI CUDA 1.9.3 embedding rows record B300/SM103 startup rejection. This manifest stages the already mirrored official vLLM 0.21.0 CUDA 13 image in pooling mode with CLS pooling plus activation, stores Hugging Face and vLLM caches under /mnt/data, and clears inherited Hugging Face token env vars because the artifact is public and ungated. The 2026-06-01T16Z hidden live upsert succeeded; the first retained B300/uk-south1 probe persisted other_error after a slow first image pull, the 2026-06-01T17Z retained B300 retry succeeded with 3 warmups, 10 measured requests, 78 ms median client-wall model time, 910.26 tokens/s, and 1.61 GB VRAM, and the 2026-06-02T00Z persisted remaining-cell probe supported B200/us-central1, H200/us-central1, H200/eu-north1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 with 3 warmups plus 10 measured requests per cell. The 2026-06-01T22Z parity gate compared direct Transformers CLS plus L2 normalization against the vLLM CUDA 13 CLS activation endpoint over eight synthetic non-PHI HCLS retrieval/control texts and passed with min raw cosine 0.9999988465, max unit delta 0.0002835765, and mean vLLM norm 0.9999999966. Keep hidden/onboarding/default-ineligible until healthcare product-safety approval and catalog positioning complete.

Fastest verified: B300 in uk-south1
Performance: 78 ms

General

coldOpenAI SDK

Granite 4.1 8B

IBM Granite 4.1 8B is a public Apache-2.0 dense decoder-only general instruction model released on 2026-04-29 for enterprise assistant, RAG, coding, multilingual dialog, structured JSON, and tool-use workflows. Hugging Face API metadata checked on 2026-06-02T18Z reports private=false, gated=false, disabled=false, library_name=transformers, commit 1504002f650e656a0a3789d99574df12e3e94ed0, 112122 monthly downloads, 184 likes, safetensors BF16 parameters=8791592960, and license:apache-2.0. The model config reports GraniteForCausalLM, model_type=granite, BF16 dtype, 131072 max positions, 40 layers, 32 attention heads, 8 KV heads, hidden size 4096, intermediate size 12800, vocab size 100352, and RoPE theta 10000000. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, because IBM documents Granite 4.1 tool calling, vLLM documents GraniteForCausalLM support, and the selected image is already mirrored to all active Forge regions. The served context is capped at 32768 tokens for bounded one-GPU probes even though the artifact advertises 131072 positions. Tool calling is enabled with the vLLM granite4 parser because IBM and vLLM document Granite 4 tool-call support. The 2026-06-01T08Z hidden live upsert succeeded, retained Forge probes support B200/us-central1 and B300/uk-south1 with 3 warmups plus 10 measured chat requests per cell, and a no-persist B200 tool-parser smoke returned an OpenAI tool_calls response. The 2026-06-01T16Z retained H200/eu-north2 probe also passed with 3 warmups plus 10 measured chat requests, startup 436250 ms, p50 794 ms, p95 874 ms, p99 880 ms, 238.04 tokens/s, 120.07 GB VRAM, and vLLM logs showing FLASH_ATTN with FlashAttention 3. The 2026-06-02T18Z retained L40S/eu-north1 probe passed with 3 warmups plus 10 measured chat requests, startup 105888 ms, p50 2935 ms, 64.74 tokens/s, and 37.86 GB VRAM. The manifest advances to active/testing/non-default for B200, B300, H200, and L40S while RTX6000 stays disabled because current public inventory did not advertise an RTX6000 cell.

Fastest verified: B300 in uk-south1
Performance: 607 ms

General

coldOpenAI SDK

Granite Embedding Small English R2

IBM Granite Embedding Small English R2 is a public Apache-2.0, non-HCLS, non-physical-AI English dense embedding model for enterprise semantic search, RAG retrieval, long-document retrieval, code retrieval, table retrieval, and multi-turn conversational retrieval. Hugging Face metadata refreshed on 2026-06-04T16Z reports private=false, gated=false, disabled=false, pipeline_tag=feature-extraction, library_name=sentence-transformers, tags text-embeddings-inference and endpoints_compatible, revision 2ab6fa8ea2d674564defd37171ae19079b864b33, and 47,662,464 BF16 safetensors parameters. The model card reports a 47M ModernBERT-based bi-encoder with 384-dimensional output vectors and 8192-token context length, trained with permissive enterprise-friendly datasets plus IBM data. This active non-default Forge profile reuses the already mirrored Hugging Face Text Embeddings Inference CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a, clears unnecessary Hugging Face token variables for this public artifact, and keeps Hugging Face caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache. The 2026-06-01T03Z focused B200/us-central1 probe passed with 3 warmups, 10 measured /v1/embeddings requests, 196923 ms startup, 157 ms p50 client-wall request time, 229.3 tokens/s, and 0.84 GB VRAM. The 2026-06-01T05Z bounded matrix probe added H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 support with 3 warmups and 10 measured requests per cell; p50 client-wall request times were 98 ms on H200, 98 ms on L40S, and 160 ms on RTX6000. B300 remains disabled because this exact TEI image digest exits on SM103 with `cuda compute cap 103 is not supported`.

Fastest verified: L40S in eu-north1
Performance: 98 ms

General

coldOpenAI SDK

Qwen2.5-Coder 7B Instruct

Qwen Team 2025

Qwen2.5-Coder 7B Instruct is a public, ungated Apache-2.0 code-specialized instruction model for code generation, code repair, code reasoning, agentic developer workflows, and long-context code assistance. Hugging Face metadata checked on 2026-06-01 reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision c03e6d358207e414f1eca0bb1891e29f1db0e242, and 7,615,616,512 BF16 safetensors parameters. The config reports Qwen2ForCausalLM, model_type=qwen2, 32,768 max positions, 28 layers, 28 attention heads, 4 KV heads, hidden size 3584, intermediate size 18944, vocab size 152064, BF16 weights, and no auto_map or quantization_config. The model card describes Qwen2.5-Coder as a code-specific model family for developers, code agents, generation, reasoning, and fixing, notes the current config is set for 32,768-token context, and recommends vLLM for deployment. This active testing, non-default profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, reflects 2026-06-01T02Z five-cell GPU support evidence, clears unnecessary Hugging Face token env, and stores Hugging Face and vLLM caches under versioned /mnt/data paths. Keep default_eligible=false until product routing explicitly chooses this compact code-specialized model as a default code route.

Fastest verified: B300 in uk-south1
Performance: 1.1s

General

coldOpenAI SDK

BGE Small EN v1.5

BAAI bge-small-en-v1.5 is a public MIT-licensed English embedding model for semantic search, short RAG retrieval, clustering, and low-cost passage ranking. Hugging Face API metadata refreshed on 2026-05-31T16Z reports private=false, gated=false, disabled=false, library_name=sentence-transformers, pipeline_tag=feature-extraction, license=mit, revision 5c38ec7c405ec4b44b94cc5a9bb96e735b38267a, tags text-embeddings-inference and endpoints_compatible, and safetensors parameters I64=512 plus F32=33,360,000. The config uses BertModel with 12 layers, 384 hidden size, 12 attention heads, 512 positions, and no remote-code auto_map. The sentence-transformers metadata sets max_seq_length=512, lowercase tokenization, CLS pooling, and 384-dimensional embeddings. This active testing, non-default profile reuses Forge's already mirrored Hugging Face Text Embeddings Inference CUDA 1.9.3 digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a, clears unnecessary Hugging Face token variables because the artifact is public and ungated, and follows existing Forge TEI manifests by routing Hugging Face caches through /opt/nim/.cache paths backed by shared model cache storage. The 2026-05-31T18Z persisted matrix validated B200/us-central1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 with 3 warmups and 10 measured requests per cell. B300/uk-south1 remains blocked on this pinned TEI image because a retained diagnostic pod exited with `cuda compute cap 103 is not supported`; use an SM103-capable TEI or CUDA 13 pooling fallback before claiming B300.

Fastest verified: B300 in uk-south1
Performance: 72 ms

Healthcare / Life Science

coldNative inference

OpenMed DiseaseDetect BioMed 335M

OpenMed-NER-DiseaseDetect-BioMed-335M is an Apache-2.0 Hugging Face Transformers token-classification model for disease entity recognition in biomedical text. The Hugging Face model page and API rechecked on 2026-06-04T05Z report a public, ungated, enabled, safetensors-backed model with library_name=transformers, pipeline_tag=token-classification, revision a62e2a2da5ff7d6b79024ba906f481581466ea61, Apache-2.0 metadata, and model.safetensors SHA256 5c32d7bd84387f349490cf84e2b3ee2ddc2e965fc9926dcc047da625587d1a26. The model card documents Transformers token-classification usage with aggregation_strategy and reports BC5CDR disease NER metrics of F1 0.9005, precision 0.8887, recall 0.9126, and accuracy 0.9838. This Forge-owned wrapper is the safe fallback for the previously inspected Bytez container because it keeps executable serving code in this repo, verifies the pinned checkpoint before load, exposes /token_classification, /v1/token_classification, and /run, returns label/start/end/text/score disease spans plus model_time_ms, clears inherited Hugging Face token env vars, disables HF telemetry, runs as UID/GID 10001:10001, and stores model/runtime caches under /opt/nim/.cache backed by Forge shared /mnt/data/forge-weights. The 2026-06-04T05Z r2 image replaces the prior CVE-blocked mirrored runtime with Transformers 5.3.0 and Hugging Face Hub 1.3.7, updates fixed OS/Python package findings, passes Trivy HIGH/CRITICAL fixed-vulnerability scanning, and is mirrored to all active Forge regions at digest sha256:7feb438879810b4670d9a7bff7fc7f022a3651ae9a6835731fdb953bf1fc1a8e. The 2026-06-04T08Z hidden live upsert and standard matrix probe persisted support for B200/us-central1, B300/uk-south1, H200/us-central1, H200/eu-north1, H200/eu-north2, and L40S/eu-north1 on this exact r2 digest with 3 warmups plus 10 measured requests per cell using model_time_ms. RTX6000 remains false because no RTX6000 inventory cell appeared in the run. It remains hidden/onboarding/default-ineligible and must not be used for medical advice, clinical decision-making, diagnosis, treatment, triage, medication recommendations, patient care, or patient-specific interpretation.

Fastest verified: B200 in us-central1
Performance: 7 ms

Physical AI

coldNative inference

SberRoboticsCenter GreenVLA 2B Base

GreenVLA 2B Base is an Apache-2.0 vision-language-action robot policy checkpoint from SberRoboticsCenter for Bridge/WidowX-style manipulation. This hidden Forge candidate uses a Forge-owned FastAPI wrapper that rejects real actuation and returns only a no-actuation action trace. The current fast-vendored digest is mirrored to all active regions, keeps the Qwen3-VL rotary position and image split-size CPU-grid patches for the prior B300/SM103 NVRTC blockers, vendors physical-intelligence/fast with startup hash verification, and is benchmarked on B200/us-central1, B300/uk-south1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 with 10 measured warm no-actuation requests per cell. The 2026-06-02 B300 retained-pod capture records a full finite 1x10x7 no-actuation action trace and static safety metrics. Publication remains blocked pending live manifest upsert plus representative simulator replay, HIL dry run, or external robot-policy safety signoff.

Fastest verified: B300 in uk-south1
Performance: 136 ms

Healthcare / Life Science

coldOpenAI SDK

PubMedBERT Base Embeddings Matryoshka

NeuML PubMedBERT Base Embeddings Matryoshka is a public Apache-2.0 sentence-transformers model for nonclinical biomedical literature semantic search, clustering, RAG retrieval, and dimensionality-tradeoff experiments. Hugging Face metadata checked on 2026-05-31 reports private=false, gated=false, disabled=false, library_name=sentence-transformers, pipeline_tag=sentence-similarity, text-embeddings-inference/endpoints_compatible tags, revision 723775ee67d9a2d15e07cc1ca189445d1034a2fa, and 109,482,240 F32 safetensors parameters. The pinned config has BertModel architecture, hidden size 768, 12 layers, 12 attention heads, and 512 max positions; 1_Pooling/config.json reports mean pooling. This hidden Forge onboarding row reuses the already mirrored Hugging Face Text Embeddings Inference CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a across all active regions, stores Hugging Face cache paths under shared /mnt/data, and clears inherited Hugging Face token env vars because the artifact is public and ungated. Hidden live upsert and the 2026-05-31 full matrix validated B200, H200, L40S, and RTX6000 with 10 measured warm embedding requests per supported cell. The 2026-05-31 cycle-2 parity pass compared a kept L40S TEI pod against direct CPU Transformers attention-mask mean pooling over 10 synthetic non-PHI texts; all outputs were 768-dimensional and min cosine against the unit-normalized reference was 0.9999922453. This TEI profile returns full 768-dimensional embeddings only, so selectable Matryoshka dimensions are not advertised. B300 remains disabled for this TEI CUDA 1.9.3 profile because the B300 pod log reported cuda compute cap 103 is not supported. Keep the model hidden until B300 fallback or explicit B300 exclusion and HCLS product safety review are complete.

Fastest verified: L40S in eu-north1
Performance: 90 ms

General

coldOpenAI SDK

Fara-7B

Microsoft Fara-7B is a public, ungated MIT-licensed 7B vision-language computer-use agent model. The model takes a user goal, browser screenshot, and action history text, then emits reasoning plus a tool-call style next action for web automation tasks while stopping at critical points that require user permission or sensitive information. Hugging Face metadata refreshed on 2026-05-31 reports private=false, gated=false, disabled=false, pipeline_tag=image-text-to-text, library_name=transformers, revision 2b5558315618e2cce8c617aaa0078d0a8d81d2d9, and license:mit. The artifact config reports Qwen2_5_VLForConditionalGeneration, model_type=qwen2_5_vl, bfloat16, 128,000 max positions, no auto_map, and no custom remote-code requirement. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, leaves the image default VLLM_ENABLE_CUDA_COMPATIBILITY=0 path in place, caps first serving to 32,768 tokens and one image per prompt, clears Hugging Face token env for this public artifact, allowlists raw.githubusercontent.com for remote image fetches, disables media URL redirects, and stores Hugging Face/vLLM caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache. The 2026-05-31 cycle2 probes supported B200, B300, H200, L40S, and RTX6000 with 3 warmups plus 10 measured requests per cell; the row is active/testing and not default-eligible.

Fastest verified: B300 in uk-south1
Performance: 390 ms

General

coldOpenAI SDK

BGE Large EN v1.5

BAAI bge-large-en-v1.5 served by Hugging Face Text Embeddings Inference CUDA 1.9.3 for English semantic retrieval, RAG indexing, passage search, and reranker preselection. Hugging Face API metadata refreshed on 2026-05-30 reports private=false, gated=false, disabled=false, library_name=sentence-transformers, pipeline_tag=feature-extraction, license=mit, revision d4aa6901d3a41ba39fb536a557fa166f842b0e09, tags text-embeddings-inference and endpoints_compatible, and 335,142,400 parameters. The upstream model card describes BGE as BAAI general embedding, lists bge-large-en-v1.5 as an English v1.5 retrieval embedding model with a more reasonable similarity distribution, and recommends the query prefix used in the default example for short-query-to-passage retrieval. This onboarding manifest reuses the already mirrored TEI CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a and leaves all GPU compatibility cells disabled until live Forge probes complete. B300 is explicitly kept disabled because the same TEI image has existing SM103 failures on other ModernBERT/BGE candidates.

Fastest verified: L40S in eu-north1
Performance: 88 ms

General

coldOpenAI SDK

OLMo 3 7B Instruct

AllenAI OLMo 3 7B Instruct is a public, ungated Apache-2.0 English instruction-following model in the fully open OLMo 3 family, aimed at chat, tool-use style prompts, coding, math, general reasoning, and long-context workflows. Live Hugging Face metadata checked on 2026-05-30T14Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision 096bb5469fe34348bc88d851a69edb3bf6f40df4, license:apache-2.0, and 7,298,011,136 BF16 safetensors parameters. The artifact config reports Olmo3ForCausalLM, model_type=olmo3, 65,536 native max positions with YaRN rope scaling from 8,192, 32 layers, 32 attention heads, 32 KV heads, hidden size 4096, intermediate size 11008, vocab size 100278, and no auto_map. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps first serving to 32,768 tokens for bounded probes, clears unnecessary Hugging Face token env for this public artifact, and stores Hugging Face/vLLM caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache. The 2026-05-30T15Z live matrix validated B200, B300, H200, L40S, and RTX6000 with 3 warmups and 10 measured warm chat-completion requests per cell, so this source manifest is advanced to active/testing/non-default while leaving default routing unchanged.

Fastest verified: B300 in uk-south1
Performance: 814 ms

Healthcare / Life Science

coldNative inference

MedCPT Cross Encoder

MedCPT Cross Encoder is NCBI's public-domain biomedical reranker for scoring a query jointly with candidate PubMed-style article text. The model card documents ranking articles for a query by tokenizing query/article pairs with max_length=512 and using a BertForSequenceClassification logit where higher scores indicate higher relevance. Hugging Face API metadata checked on 2026-05-30 reports private=false, gated=false, disabled=false, license_name=public-domain, library_name=transformers, pipeline_tag=text-classification, text-embeddings-inference and endpoints_compatible tags, revision 71caf65d4927987813984f54c284405a13fcca49, tokenizer.json present, and a single pytorch_model.bin LFS artifact at SHA256 61d5ccd48869e03500544525fc231641d7daa9ba267b202c82724750038dc1e0. This Forge-owned wrapper is the fallback for the previous TEI preflight blocker: it verifies the pinned checkpoint SHA256 before loading, refuses torch.load unless weights_only=True exists, instantiates BertForSequenceClassification from config, loads the verified state dict, exposes /rerank and /v1/rerank, returns sorted higher-is-more-relevant scores with model_time_ms, clears inherited Hugging Face token env vars, and stores Hugging Face, Transformers, Torch, XDG, and HOME caches under shared /mnt/data. The 2026-05-30T11Z image build produced config id sha256:9b3ea7fd74805f7e1ed38feecfe6cd36d647b5608a17a179a1ef0b868a4f0a8b, registry digest sha256:fbbde71387304186005f4dfdd85aaec4cb26ab235fb3674c933ac27b491a76a0, local size 7749696315 bytes, and 4267717479 compressed bytes. A CPU smoke loaded the pinned checkpoint with weights_only=True and ranked the diabetes-treatment review above central diabetes insipidus and hypertension. The hidden 2026-05-30T11Z Forge matrix supports B200, B300, H200, and L40S with 3 warmups plus 10 measured rerank requests per cell using model_time_ms; an explicit RTX6000/us-central1 probe found no matching inventory cell, so RTX6000 remains pending rather than failed. Keep onboarding/default-ineligible until HCLS license and product safety review complete.

Fastest verified: B200 in us-central1
Performance: 4 ms

Physical AI

coldNative inference

Robotics Diffusion Transformer RDT2-VQ

RDT2-VQ is an 8.29B-parameter autoregressive vision-language-action policy adapted from Qwen2.5-VL-7B-Instruct. It consumes a short imperative instruction plus right and left 384x384 wrist-camera observations and predicts a 24-step, 20D relative bimanual policy-space action chunk via a residual VQ action tokenizer. This is a hidden onboarding manifest because no official pullable serving image was verified. A Forge-owned FastAPI wrapper now builds locally, hydrates the model, RVQ tokenizer, and UMI normalizer into /mnt/data behind a per-artifact lock, and is pushed to all active Forge regional registries by digest. The SDPA runtime has 10 measured query_time_ms samples on B200, B300, H200, L40S, and RTX6000. A retained B300 full-response safety preflight showed the default smoke request returns finite actions but with max_abs_action_value 5155.09033203125 and max_abs_step_delta 1052.19091796875. The action-unit/normalizer review confirmed gripper units and upstream normalizer use, but the wrapper returns pre-actuation policy-space actions before convert_policy_to_tcp_space and get_real_umi_action, so publication remains blocked pending representative UMI replay, simulator replay, HIL dry run, or external robot-policy safety signoff.

Fastest verified: B200 in us-central1
Performance: 510 ms

Healthcare / Life Science

coldOpenAI SDK

MedEmbed Base

MedEmbed Base is Abhinand Balachandran's Apache-2.0 sentence-transformers embedding model fine-tuned from BAAI/bge-base-en-v1.5 for medical and clinical information retrieval. Hugging Face metadata checked on 2026-06-02T08Z reports private=false, gated=false, disabled=false, library_name=sentence-transformers, revision 7a90c50263f620dff743eb9794b89a42bfc5d765, and license:apache-2.0. Raw config reports BertModel, hidden_size 768, 12 layers, 12 attention heads, and max_position_embeddings 512; 1_Pooling/config.json enables CLS pooling and modules.json includes Normalize. The model is useful as a higher-capacity sibling to the existing Forge MedEmbed Small row for nonclinical HCLS retrieval and RAG evaluation. Container search found no model-specific NIM or vendor serving image; Hugging Face TEI CUDA 1.9.3 is available and mirrored, but the same digest is already blocked for B300/SM103 in neighboring TEI embedding rows. This manifest therefore stages the already mirrored official vLLM 0.21.0 CUDA 13 image in pooling mode with CLS pooling plus activation, stores Hugging Face and vLLM caches under /mnt/data, and clears inherited Hugging Face token env vars because the artifact is public and ungated. A 2026-06-02T08Z B300/uk-south1 probe persisted support with 10 measured requests, but the timing source is client wall time. Keep hidden/onboarding/default-ineligible until non-Blackwell matrix expansion, embedding parity check, and healthcare product safety review complete.

Fastest verified: B300 in uk-south1
Performance: 73 ms

General

coldOpenAI SDK

All-MiniLM-L6-v2

sentence-transformers/all-MiniLM-L6-v2 is a public Apache-2.0 sentence-transformers embedding model for semantic search, sentence similarity, clustering, and compact RAG retrieval over short English text. Hugging Face metadata checked on 2026-05-29T22Z reports private=false, gated=false, disabled=false, library_name=sentence-transformers, pipeline_tag=sentence-similarity, tags text-embeddings-inference and endpoints_compatible, revision c9745ed1d9f207416be6d2e6f8de32d1f16199bf, and 22,713,728 safetensors parameters. The model card reports 384-dimensional dense vectors and intended use for clustering or semantic search; sentence_bert_config.json sets max_seq_length=256, and 1_Pooling/config.json selects mean pooling. This hidden Forge onboarding profile reuses the already mirrored Hugging Face Text Embeddings Inference CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a, clears unnecessary Hugging Face token variables for this public artifact, and keeps Hugging Face caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache. The 2026-05-29T22Z hidden hostPath-cache matrix supports B200/us-central1, H200/eu-north2, and L40S/eu-north1 with 3 warmups and 10 measured embedding requests per cell. B300/uk-south1 failed before readiness on the same digest with `cuda compute cap 103 is not supported`, so B300 remains disabled until a newer TEI runtime or a parity-checked CUDA 13 pooling fallback is added. RTX6000 was not present in public inventory during this cycle.

Fastest verified: L40S in eu-north1
Performance: 89 ms

Healthcare / Life Science

coldNative inference

Biohub ESMC 600M Protein Embeddings

Biohub ESMC 600M is a 2026 protein language model for protein representation learning, protein engineering, variant-effect research, and masked-language modeling. Biohub's model card says ESMC learns protein-biology representations from billions of protein sequences and publishes 300M, 600M, and 6B variants; the 600M Hugging Face artifact is public, ungated, safetensors-backed, and reports 575,036,992 parameters at revision 465f75840fee10acc8c0db104ae244d8abb9288e. This hidden Forge entry now uses a dedicated Python 3.12 CUDA 12.8 FastAPI wrapper under services/models/hcls/biohub_esmc_600m that installs Biohub esm at c94ed8d, loads AutoModelForMaskedLM, pools non-special-token hidden states, hydrates weights into /mnt/data, and enforces research_use_acknowledgement=true. The 2026-05-30T06Z worker rechecked the Hugging Face artifact, confirmed the GitHub LICENSE.md is MIT, confirmed the linked THIRD_PARTY_NOTICE.md still returns 404, and verified all four regional Forge registry refs for the first wrapper image resolve to sha256:d6a79ef67c441f3da82671455154dfb5b12e79d18b47fa152b7949bb6062845e. The 2026-05-30T04Z B200/us-central1 hidden probe loaded the 600M weights and completed 3 warmups plus 10 measured protein-embedding requests with p50 14 ms from response-local model_time_ms, startup 95.965 s, and 2.0 GB VRAM. The 2026-05-30T06Z hidden remaining-matrix probe completed the same 3 warmups plus 10 measured requests on B300/uk-south1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 with p50 response-local model_time_ms of 15 ms, 16 ms, 24 ms, and 19 ms respectively. The 2026-05-30T08Z worker pushed and mirrored the slim multi-stage runtime image to all active Forge registries at sha256:902308e74ddac3b163159bf49696841a93c3429ef940b0eb64bf917bb13d82da after the repo mirror helper hit an in-cluster docker-config permission error; direct crane copy fallback preserved the region-local repository path. The slim remote image has 18 compressed layers totaling 7,009,385,444 bytes, a 39.89% compressed-layer reduction versus the first wrapper image, while preserving Torch, Transformers, esm, /mnt/data cache envs, and the acknowledgement gate. The 2026-05-30T09Z worker live-upserted the hidden onboarding manifest and completed a full slim-image matrix reprobe on all five live GPU cells with 3 warmups and 10 measured requests per cell; model-time p50s were B200 13 ms, B300 13 ms, H200 16 ms, L40S 24 ms, and RTX6000 19 ms, with no OOMs or failures, and support persisted for the onboarding row. The model remains default-ineligible until the third-party notice blocker and HCLS product safety review are resolved.

Fastest verified: B300 in uk-south1
Performance: 13 ms

General

coldOpenAI SDK

Mixedbread Embed Large v1

Mixedbread Embed Large v1 served by Hugging Face Text Embeddings Inference CUDA 1.9 for English dense retrieval, RAG, semantic search, clustering, and classification embedding workloads. Hugging Face API metadata checked on 2026-05-29 reports private=false, gated=false, disabled=false, Apache-2.0 license metadata, sentence-transformers library usage, feature-extraction pipeline usage, TEI and endpoints compatibility tags, revision b33106f585b9ce46904ad7443a3b52b7a63e231c, and 335,141,888 F16 safetensors parameters. The config reports BertModel, 1024 hidden size, 24 layers, 16 attention heads, 512 max positions, and float16 weights; the Sentence Transformers pooling config reports 1024-dimensional CLS pooling with prompt inclusion. The model card says retrieval queries should use the prompt `Represent this sentence for searching relevant passages:` and documents optional Matryoshka truncation. This Forge profile reuses the already mirrored TEI CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a across all Forge regions, serves the standard OpenAI-compatible dense embedding endpoint, clears unnecessary Hugging Face token env for the public artifact, and stores Hugging Face cache under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. B300 starts disabled for this pinned TEI CUDA 1.9.3 runtime because prior entrypoint audits for the same digest show no SM103 dispatch branch.

Fastest verified: L40S in eu-north1
Performance: 98 ms

General

coldOpenAI SDK

GTE ModernBERT Base

Alibaba-NLP gte-modernbert-base is a public Apache-2.0 English text embedding model for semantic search, RAG retrieval, long-document retrieval, and code retrieval. Hugging Face metadata refreshed on 2026-05-30T08Z reports private=false, gated=false, disabled=false, pipeline_tag=sentence-similarity, library_name=transformers, revision e7f32e3c00f91d699e8c43b53106206bcc72bb22, 457738 downloads, 196 likes, and 149014272 float16 safetensors parameters. The upstream card reports a 149M ModernBERT-based text embedding model with 8192-token input length, 768-dimensional output embeddings, CLS pooling with normalization in examples, competitive MTEB/BEIR/LoCo/COIR scores, and a documented Text Embeddings Inference OpenAI-compatible /v1/embeddings deployment path. This active testing Forge profile reuses the already mirrored Hugging Face Text Embeddings Inference CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a, clears unnecessary Hugging Face token variables for this public artifact, and keeps Hugging Face caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache. The 2026-05-30 standard hostPath matrix supports B200/us-central1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 with 3 warmups and 10 measured /v1/embeddings requests each. B300/SM103 remains runtime-incompatible with this exact TEI CUDA 1.9.3 digest: a retained diagnostic pod exited with `cuda compute cap 103 is not supported`.

Fastest verified: L40S in eu-north1
Performance: 98 ms

Healthcare / Life Science

coldOpenAI SDK

MedCPT Article Encoder

MedCPT Article Encoder is NCBI's public-domain biomedical dense retrieval encoder for PubMed-style article titles and abstracts. The Hugging Face model card describes MedCPT as a query/article encoder pair trained on 255M PubMed query-article pairs and documents the Article Encoder input as a list containing title and abstract text, with the last-layer CLS hidden state used as the article representation. The 2026-05-31T08Z artifact recheck reported private=false, gated=false, disabled=false, license_name=public-domain, library_name=transformers, pipeline_tag=feature-extraction, text-embeddings-inference and endpoints_compatible tags, revision d05a736da4bb84ee4057b7f7999485be6ed85465, tokenizer.json present, and safetensors total 109,482,240 F32 parameters. This Forge-owned wrapper is the scoped fallback for the existing MedCPT query-side routing blocker: it accepts explicit title and abstract fields, calls the tokenizer with title as text and abstract as text_pair, computes CLS embeddings with AutoModel, returns unit-normalized vectors by default to align with the current MedCPT Query Encoder TEI row, exposes model_time_ms for probes, rejects inference unless research_use_acknowledgement=true, clears inherited Hugging Face token env vars, and stores Hugging Face, Transformers, Torch, XDG, and HOME caches under shared /mnt/data. The 2026-05-31T08Z acknowledgement-gated image was built, pushed, mirrored to all four active Forge regions at digest sha256:25194349a8d43c6b6946348941fbf76a55ce07103c7d2457f0a0dc917bcd211d, passed in-image acknowledgement smoke, live-upserted hidden, and persisted support on B200, B300, H200, and L40S with 3 warmups plus 10 measured model_time_ms requests per schedulable cell. RTX6000/us-central1 exact-digest probing is not claimed because scheduling failed with insufficient GPU capacity during this cycle. Keep onboarding/default-ineligible until RTX6000 is reprobed or explicitly excluded, HCLS product safety review, and query-to-article product routing approval are complete.

Fastest verified: B200 in us-central1
Performance: 3 ms

Physical AI

coldNative inference

Hugging Face LeRobot X-VLA Base

X-VLA Base is an Apache-2.0 LeRobot vision-language-action policy checkpoint with 0.9B parameters, soft-prompted flow matching, three visual observations, an 8D robot state, and 20D ee6d robot actions. This hidden Forge candidate now has a digest-backed Forge HTTP wrapper image in all active regions, an import smoke, static dispatch audit, hidden DB upsert, completed five-cell Forge GPU probe coverage with 10 warm samples per cell, and a B300 no-actuation static action-trace preflight. It remains blocked from publication until robot-policy simulator/HIL safety validation exists.

Fastest verified: B200 in us-central1
Performance: 107 ms

General

coldOpenAI SDK

MiniCPM4 8B

MiniCPM4 8B is a public Apache-2.0 OpenBMB text-generation model for English/Chinese chat, instruction following, long-context summarization, and efficient edge-oriented general LLM workloads. Hugging Face API metadata checked on 2026-05-29T07Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, license=apache-2.0, revision bb2ae14cf59d4ca769c4e42ece54cc3b82a58ef7, lastModified 2025-10-24T08:32:25Z, and 8,185,253,888 BF16 safetensors parameters. The artifact config reports MiniCPMForCausalLM, model_type=minicpm, BF16 dtype, 32 layers, 32 attention heads, 2 KV heads, hidden size 4096, 73,448 vocab entries, 32,768 max positions, and an auto_map remote-code contract. The upstream model card documents OpenAI-compatible vLLM serving for openbmb/MiniCPM4-8B, notes that vLLM chat should send add_special_tokens=true, and states that MiniCPM4 natively supports 32,768-token context with optional longer-context RoPE modifications. This Forge profile intentionally uses the native 32K context only, enables --trust-remote-code, reuses the already mirrored official vLLM 0.21.0 CUDA 13 image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, clears unnecessary Hugging Face token env for this public ungated artifact, stores Hugging Face hub and vLLM caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache, and moves Transformers dynamic modules to a writable /tmp path after the 08Z B200 diagnostic found PermissionError under /opt/nim/.cache/huggingface/modules. The 10Z L40S reprobe confirmed that cache mitigation and completed a five-cell hidden support matrix; keep hidden onboarding and default-ineligible until publication review.

Fastest verified: B300 in uk-south1
Performance: 1.4s

Healthcare / Life Science

coldOpenAI SDK

MedCPT Query Encoder

MedCPT Query Encoder is NCBI's public-domain biomedical dense retrieval encoder for short texts such as search queries, questions, and sentences. The Hugging Face model card describes MedCPT as a two-encoder system trained on 255M PubMed query-article pairs for semantic search, with the query encoder's last-layer CLS embedding used as a 768-dimensional representation in the same space as the MedCPT Article Encoder. Hugging Face API metadata checked on 2026-05-29 reports private=false, gated=false, disabled=false, license_name=public-domain, library_name=transformers, pipeline_tag=feature-extraction, text-embeddings-inference and endpoints_compatible tags, revision d83a36cc6b8e3a5c5e9d9d6ba156808c1643dcbc, tokenizer.json present, and safetensors total 109,482,240 F32 parameters. This hidden Forge candidate reuses the already mirrored Hugging Face Text Embeddings Inference CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a across all active regions, explicitly sets POOLING=cls to match the model card, stores TEI/Hugging Face cache paths under shared /mnt/data, and clears inherited Hugging Face token env vars because the artifact is public and ungated. Hidden Forge probes on 2026-05-29 validated B200, H200, L40S, and RTX6000 with 3 warmups plus 10 measured requests each; B300 remains disabled because the same pinned TEI image exits immediately on SM103 with `cuda compute cap 103 is not supported`. A 2026-05-29T10Z L40S parity pass compared the kept TEI pod against direct Transformers CPU CLS pooling over 10 synthetic non-PHI retrieval/control texts: min cosine was 0.9999862475 and max unit-vector delta was 0.0032963447, while TEI returned unit-normalized vectors with mean norm 1.0000000008 versus direct raw CLS mean norm 11.9323803395. Keep the model hidden and default-ineligible until the Article Encoder/corpus path uses the same normalization policy and HCLS product safety review is complete.

Fastest verified: L40S in eu-north1
Performance: 94 ms

General

coldOpenAI SDK

Qwen3-VL 4B Instruct

Qwen Team 2025

Qwen3-VL 4B Instruct is a public Apache-2.0 vision-language chat model for OCR, visual question answering, document and screenshot understanding, visual coding assistance, and lightweight multimodal reasoning. Hugging Face API metadata checked on 2026-05-29 reports private=false, gated=false, disabled=false, pipeline_tag=image-text-to-text, library_name=transformers, revision ebb281ec70b05090aa6165b016eac8ec08e71b17, lastModified 2025-10-15T16:15:55Z, and 4,437,815,808 BF16 safetensors parameters. The upstream model card documents vLLM OpenAI-compatible /v1/chat/completions serving for Qwen/Qwen3-VL-4B-Instruct, image_url requests, native 256K long context, and video understanding; this first Forge profile intentionally exposes one image and no video, caps context to 32,768 tokens, and uses a public Qwen demo image from a single allowed media domain. The container decision reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, clears unnecessary Hugging Face token env for this public ungated artifact, disables media redirects, and stores Hugging Face and vLLM caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache. The 2026-05-31T08Z reprobe aligned VLLM_ENABLE_CUDA_COMPATIBILITY with the image default and supported B200, B300, H200, L40S, and RTX6000 with 3 warmups plus 10 measured image-chat requests per cell; the row is active/testing and not default-eligible.

Fastest verified: B200 in us-central1
Performance: 596 ms

General

coldNative inference

Qwen3 Reranker 0.6B

Qwen Team 2025

Qwen3 Reranker 0.6B is an Apache-2.0 multilingual text reranking model for retrieval, RAG, code retrieval, and cross-lingual search. Hugging Face API metadata checked on 2026-05-29 reports private=false, gated=false, disabled=false, text-ranking usage, revision e61197ed45024b0ed8a2d74b80b4d909f1255473, and 595,776,512 BF16 safetensors parameters. The upstream model card states 100+ language support, 0.6B parameters, 32K context, instruction-aware reranking, Sentence Transformers CrossEncoder use, Transformers use, and vLLM usage requiring vLLM >=0.8.5. This Forge profile initially caps served request context to 8,192 tokens to match the upstream sample max_length and keep first probes bounded. The container decision reuses the already mirrored official vLLM 0.21.0 CUDA 13 image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, with Hugging Face and vLLM caches under /opt/nim/.cache backed by Forge's shared /mnt/data cache. vLLM 0.21 scoring docs expose /v1/rerank and document Qwen3 reranker serving through scoring/rerank APIs; this candidate applies the same Qwen3ForSequenceClassification no/yes classifier override pattern already validated by the Forge Qwen3 Reranker 4B profile. The 2026-05-29T13Z full matrix probe supports B200, B300, H200, L40S, and RTX6000 with 3 warmups and 10 measured rerank requests per cell. The 2026-05-29T15Z review promotes it to active/testing/non-default so it is directly addressable by slug while preserving default routing for established reranker choices.

Fastest verified: B300 in uk-south1
Performance: 70 ms

General

coldOpenAI SDK

MiniCPM5 1B

MiniCPM5 1B is a public Apache-2.0 compact causal language model from OpenBMB for local assistants, coding agents, tool-use style prompts, hybrid reasoning, English/Chinese chat, and long-context summarization. Hugging Face metadata checked on 2026-05-29T18Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision 4e9de7a0778dc1c362e983e6858f0e77542cbdca, and Apache-2.0 card metadata. The model card reports 1,080,632,832 parameters, 24 layers, 16 query heads, 2 KV heads, 131,072-token native context, standard LlamaForCausalLM architecture, a vLLM quickstart requiring vLLM >= 0.21, and separate Think/No Think chat-template modes. This Forge profile reuses the already mirrored vLLM 0.21.0 CUDA 13 image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps first serving to 32,768 tokens for bounded GPU probes, leaves tool-calling disabled because upstream recommends SGLang's MiniCPM5 parser for native tool-call conversion, clears unnecessary Hugging Face token environment variables, and stores Hugging Face/vLLM caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache. The 2026-05-28T17Z L40S/eu-north1 smoke passed with 3 warmups and 10 measured chat requests; the 2026-05-28T18Z remaining schedulable-cell probe validated B300/uk-south1 and H200/eu-north2; the 2026-05-28T22Z missing-cell probe validated B200/us-central1; and the 2026-05-29T18Z L40S emptyDir fallback probe validated the exposed Thinking mode control with 3 warmups and 10 measured requests. Keep active/testing and non-default until RTX6000 inventory returns and separate long-context/tool-use gates pass.

Fastest verified: B300 in uk-south1
Performance: 229 ms

Healthcare / Life Science

coldOpenAI SDK

MedEmbed Small Biomedical Matryoshka v2

MedEmbed Small Biomedical Matryoshka v2 is a public Apache-2.0 sentence-transformers embedding model for biomedical semantic similarity and retrieval. Hugging Face metadata checked on 2026-05-28 reports private=false, gated=false, disabled=false, library_name=sentence-transformers, pipeline_tag=sentence-similarity, endpoints_compatible, text-embeddings-inference, revision 0e64af8703721ac187772ae93658360a80d72496, and 33,360,000 F32 safetensors parameters. The model card describes a 384-dimensional dense vector space, 512-token maximum sequence length, cosine similarity, CLS pooling, and Matryoshka training for lower-dimensional retrieval tradeoffs. This hidden Forge candidate reuses the already mirrored Hugging Face Text Embeddings Inference CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a across all active Forge regions, stores TEI/Hugging Face cache paths under shared /mnt/data, and clears inherited Hugging Face token env vars because the artifact is public and ungated. The 2026-05-28T17Z L40S/eu-north1 probe and 2026-05-28T20Z B200/us-central1 plus H200/eu-north2 probes passed with 3 warmups and 10 measured embedding requests per cell. The 2026-05-28T21Z B300/uk-south1 TEI classification failed before readiness, and the kept diagnostic pod logged `cuda compute cap 103 is not supported`, matching adjacent TEI CUDA 1.9.3 embedding rows. Keep hidden/default-ineligible until RTX6000 inventory, a separate B300 fallback decision, embedding parity against existing MedEmbed rows, and healthcare product safety review are complete.

Fastest verified: L40S in eu-north1
Performance: 93 ms

General

coldOpenAI SDK

DeepSeek R1 0528 Qwen3 8B

Qwen Team 2025

DeepSeek R1 0528 Qwen3 8B is a public, non-gated MIT reasoning-oriented text-generation model distilled from DeepSeek-R1-0528 into a Qwen3 8B base model. Hugging Face metadata refreshed on 2026-05-28T14Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, license=mit, revision 6e8885a6ff5c1dc5201574c8fd700323f23c25fa, Qwen3ForCausalLM architecture, 131,072 max positions, and 8,190,735,360 BF16 safetensors parameters. The model card documents vLLM serving for deepseek-ai/DeepSeek-R1-0528-Qwen3-8B and reports the 8B distillation as state-of-the-art among open-source models on AIME 2024 at release time. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps served context to 32,768 tokens for bounded first probes, enables the vLLM deepseek_r1 reasoning parser, clears unnecessary Hugging Face token env for this public ungated artifact, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The 2026-05-28T11Z hidden matrix probe validated B200, B300, H200, and L40S with 10 measured warm requests per cell; RTX6000 remains disabled because no current RTX6000 inventory cell was available. The 2026-05-28T14Z publication pass switched the playground to non-streaming JSON output so vLLM responses with message.reasoning and message.content=null remain visible until the web streaming parser grows first-class reasoning-delta rendering.

Fastest verified: B300 in uk-south1
Performance: 4.1s

Healthcare / Life Science

coldOpenAI SDK

EMBO SODA-VEC Dot/Std/Cov

EMBO SODA-VEC Dot/Std/Cov is a MIT-licensed sentence-transformers ModernBERT biomedical and life-science literature embedding model trained on PubMed Central title-abstract pairs with a VICReg-style dot, standard deviation, and covariance objective. Hugging Face metadata rechecked on 2026-05-28T11Z reports the artifact public, ungated, enabled, sentence-transformers based, safetensors-backed, tagged text-embeddings-inference and endpoints_compatible, revision ec602c16dd5e973d36128ba03bd430927c21754a, and built on an Apache-2.0 ModernBERT base. The final model files include model.safetensors, tokenizer.json, modules.json, sentence_bert_config.json, and mean-pooling configuration. This hidden Forge onboarding manifest reuses the existing regionally mirrored Hugging Face Text Embeddings Inference CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a, routes Hugging Face cache into a versioned /mnt/data path, and clears unnecessary Hugging Face token variables for this public artifact. Hidden live upsert has completed and live probe rows now support B200/us-central1, H200/eu-north2, and L40S/eu-north1 with 10 measured embedding requests each. B300 remains deliberately false after a repeated UK probe failed at startup with `cuda compute cap 103 is not supported`; the companion vLLM CUDA 13 pooling manifest is the validated B300 fallback and passed normalized TEI/vLLM parity on synthetic non-PHI HCLS retrieval text. RTX6000 is still unclaimed because current public inventory did not expose an RTX6000 cell during the 13Z cycle. Keep hidden/onboarding/default-ineligible until HCLS product safety approval is complete.

Fastest verified: L40S in eu-north1
Performance: 97 ms

General

coldOpenAI SDK

Qwen3 0.6B

Qwen Team 2025

Qwen3 0.6B is a public Apache-2.0 dense text-generation model for lightweight instruction following, short reasoning, coding prompts, tool-use style prompts, multilingual chat, and low-cost Forge smoke workloads. Hugging Face API metadata refreshed on 2026-05-28T08Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision c1899de289a04d12100db370d81485cdf75e47ca, Qwen3ForCausalLM architecture, 40,960 max positions, and 751,632,384 BF16 safetensors parameters. The upstream model card documents 32,768-token context and vLLM serving for Qwen/Qwen3-0.6B. This Forge profile uses the already mirrored official vLLM 0.22.0 CUDA 13 OpenAI-compatible image digest sha256:0fec7ec5f3e6bc168e54899935fb0557da908a4832a1dbc88e2debcf2f889416 to move the active non-default route above the GHSA-94f4-hr76-p5j6 / CVE-2026-48746 affected vLLM OpenAI API range. The profile caps served context to 32,768 tokens for bounded probes, enables the vLLM qwen3 reasoning parser, clears unnecessary Hugging Face token env for this public ungated artifact, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The 2026-06-03T05Z supervisor cycle live-upserted the vLLM 0.22.0 manifest and persisted a B300/uk-south1 probe with 3 warmups, 10 measured chat requests, HTTP 200, p50 286 ms, and 653.85 tokens/s. The 2026-06-03T12Z supervisor cycle reran L40S/eu-north1 on the same vLLM 0.22.0 digest after a prior pod_deleted attempt; the retained reprobe passed with startup 572,969 ms, p50 462 ms over 10 measured chat requests, 454.55 tokens/s, 26.96 GB VRAM, and HTTP 200. B200 and H200 still need current vLLM 0.22.0 reprobes before treating the full historical matrix as refreshed. A targeted RTX6000 probe could not schedule because the visible RTX6000 nodes were tainted node.cluster.x-k8s.io/uninitialized:NoSchedule, so RTX6000 is not claimed in gpu_compatibility.

Fastest verified: B300 in uk-south1
Performance: 284 ms

Healthcare / Life Science

coldOpenAI SDK

SapBERT PubMedBERT Entity Embeddings

SapBERT PubMedBERT is an Apache-2.0 biomedical entity representation model trained with UMLS 2020AA entity names using PubMedBERT as the base model. The upstream repository is public, ungated, enabled, safetensors-backed, and tagged for biomedical feature extraction and entity linking, but it publishes vocab.txt plus tokenizer_config.json without tokenizer.json. The earlier TEI CUDA 1.9.3 candidate failed before readiness while trying to download tokenizer.json. This hidden Forge-owned wrapper is the tokenizer fallback path: it loads the same pinned revision with AutoTokenizer(use_fast=false), computes last-layer CLS embeddings through AutoModel, serves /v1/embeddings with model_time_ms and token usage, clears inherited Hugging Face token env vars, and stores Hugging Face, Transformers, Torch, XDG, and HOME caches under /mnt/data. The wrapper image is built, mirrored, live-upserted, supported across the tracked Forge GPU matrix, and passes direct-Transformers CLS parity on non-PHI biomedical entity strings. Keep onboarding/default-ineligible until HCLS product safety review approves the research-only UX.

Fastest verified: B200 in us-central1
Performance: 3 ms

General

coldOpenAI SDK

Snowflake Arctic Embed M v2.0

Snowflake Arctic Embed M v2.0 served by Hugging Face Text Embeddings Inference CUDA 1.9 for lower-latency multilingual semantic search, enterprise RAG, clustering, and retrieval workloads. Hugging Face API metadata checked on 2026-05-28 reports private=false, gated=false, disabled=false, Apache-2.0 license metadata, sentence-transformers library usage, sentence-similarity pipeline, endpoints compatibility, 74 listed languages, and 305,368,320 F32 safetensors parameters. The model config reports the GTE architecture, 768 hidden size, 12 layers, 12 attention heads, and 8192 max positions; the Sentence Transformers pooling config reports 768-dimensional CLS pooling with normalization. This Forge profile reuses the already mirrored TEI CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a across all Forge regions, serves the standard OpenAI-compatible dense embedding endpoint, clears unnecessary Hugging Face token env for the public artifact, and stores Hugging Face cache under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. It remains onboarding-only, non-default, and unpublished until live Forge probes validate at least one GPU cell with 3 warmups and 10 measured embedding requests.

Fastest verified: L40S in eu-north1
Performance: 90 ms

Physical AI

coldNative inference

RAIL Berkeley Octo Small

Octo Small 1.5 is a 27M-parameter transformer-based generalist robot policy trained on a mix of Open X-Embodiment robot datasets. It consumes language-conditioned primary and wrist camera observations with a history window up to two timesteps and predicts 7D robot actions four steps into the future through a diffusion policy. This Forge row is hidden because the patched digest has retained 3-warmup/10-run support on RTX6000, B200, L40S, H200, and B300, while the newer action-clamp digest has so far cleared B300/uk-south1 with 3 warmups, 10 measured requests, and strict Bridge action-bounds checks. Keep hidden until the action-clamp digest completes the remaining four-cell reprobe and robot-policy safety clearance through simulator replay, HIL dry run, or external signoff.

Fastest verified: H200 in eu-north2
Performance: 33 ms

Healthcare / Life Science

coldOpenAI SDK

MedEmbed Small

MedEmbed Small is an Apache-2.0 sentence-transformers embedding model fine-tuned from BAAI/bge-small-en-v1.5 for medical and clinical information retrieval. The Hugging Face API reports the artifact as public, ungated, enabled, safetensors-backed, and tagged for medical-embedding, clinical-embedding, information-retrieval, MedicalQARetrieval, NFCorpus, PublicHealthQA, TRECCOVID, and license:apache-2.0. This hidden Forge onboarding manifest reuses the already mirrored Hugging Face Text Embeddings Inference CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a across all active regions, points Hugging Face cache paths at Forge shared /mnt/data storage, and clears inherited Hugging Face token env vars because the model is public and ungated. The 2026-05-27T18Z full matrix upserted five supported cells with 3 warmups and 10 measured requests each on B200, H200, L40S, and RTX6000; B300/SM103 failed at pod start and remains disabled pending diagnostics or a separate vLLM CUDA 13 pooling fallback. Keep the row onboarding/default-ineligible until B300 is classified and healthcare product safety review approves publication.

Fastest verified: H200 in eu-north1
Performance: 87 ms

General

coldOpenAI SDK

Granite 3.3 8B Instruct

IBM Granite 3.3 8B Instruct is a public Apache-2.0 text-generation model for enterprise-style instruction following, coding, reasoning, structured summaries, and long-context document or meeting summarization. Hugging Face API metadata refreshed on 2026-05-27T17Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision 51dd4bc2ade4059a6bd87649d68aa11e4fb2529b, and 8,170,864,640 BF16 safetensors parameters. The model config reports GraniteForCausalLM, model_type=granite, BF16 dtype, 40 layers, 32 attention heads, 8 KV heads, hidden size 4096, intermediate size 12800, vocab size 49159, and 131072 max positions. The model card documents 128K context, improved reasoning and instruction-following capabilities, coding and long-context use cases, permissively licensed and synthetic training data sources, and Apache 2.0 licensing. NVIDIA NIM model tables list ibm-granite/granite-3.3-8b-instruct version 1.8.4, but direct OCI inspection of nvcr.io/nim/ibm-granite/granite-3.3-8b-instruct:1.8.4 returned authentication required during this cycle, so this Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9. The profile caps served context to 32768 tokens for first validation, clears unnecessary Hugging Face token env for this public ungated artifact, and stores Hugging Face and vLLM caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache.

Fastest verified: B300 in uk-south1
Performance: 1.7s

General

coldOpenAI SDK

Nomic Embed Text v2 MoE

Nomic Embed Text v2 MoE is a public Apache-2.0 multilingual text embedding model for retrieval, RAG indexing, semantic search, and clustering. Hugging Face API metadata refreshed on 2026-05-27T11Z reports private=false, gated=false, disabled=false, license:apache-2.0, sentence-transformers usage, safetensors weights, a 1.90 GB model.safetensors artifact, and text-embeddings-inference/endpoints_compatible tags. The model card describes 475M total parameters, 305M active parameters during inference, 8 experts with top-2 routing, 768-dimensional Matryoshka embeddings that can be truncated to 256 dimensions, 512-token maximum input length, around 100 supported languages, and required search_query/search_document prefixes. This hidden Forge onboarding profile reuses the already mirrored TEI CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a, clears unnecessary Hugging Face token environment variables for the public artifact, and stores Hugging Face cache under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. All GPU compatibility cells are disabled until a live model-specific smoke probe validates TEI loading for this custom-code NomicBERT MoE artifact; B300 is expected to need a newer SM103-capable TEI runtime because the pinned TEI CUDA 1.9.3 entrypoint rejects SM103 in sibling Forge TEI profiles.

Fastest verified: L40S in eu-north1
Performance: 106 ms

General

coldOpenAI SDK

OLMo-2 7B Instruct

AllenAI OLMo-2 1124 7B Instruct is a public Apache-2.0 English instruction-following and chat model in the OLMo-2 open-science family. Hugging Face API metadata refreshed on 2026-05-27T10Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision 470b1fba1ae01581f270116362ee4aa1b97f4c84, and 7,298,617,344 BF16 safetensors parameters. The model config reports Olmo2ForCausalLM, model_type=olmo2, BF16 dtype, 32 layers, 32 attention heads, 32 KV heads, hidden size 4096, vocab size 100352, and 4096 max positions. The model card documents Apache 2.0 licensing, OLMo-specific SFT, DPO, and RLVR post-training, OpenAI-compatible vLLM serving instructions, and a note that some fine-tuning data includes outputs from third-party models subject to additional terms. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, serves the native 4096-token context, clears unnecessary Hugging Face token env for this public ungated artifact, and stores Hugging Face and vLLM caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache. It remains onboarding and default-ineligible until live GPU probes produce support and latency evidence.

Fastest verified: B300 in uk-south1
Performance: 1.1s

General

coldOpenAI SDK

OLMo-2 1B Instruct

AllenAI OLMo-2 1B Instruct is a public Apache-2.0 compact instruction-following text-generation model derived from the OLMo-2 0425 1B line and post-trained with supervised fine-tuning, DPO, and RLVR data for chat, math, GSM8K-style reasoning, and IFEval-style instruction following. Hugging Face API metadata refreshed on 2026-05-27T07Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision 48d788eca847d4d7548f375ad03d3c9312f6139e, and 1,484,916,736 BF16 safetensors parameters. The artifact config reports the standard Olmo2ForCausalLM architecture, model_type=olmo2, BF16 dtype, 16 layers, 16 attention heads, hidden size 2048, intermediate size 8192, 100,352 vocab entries, and 4,096 max positions with no auto_map. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, serves the native 4,096-token context, clears unnecessary Hugging Face token env for this public artifact, and stores Hugging Face and vLLM caches under /opt/nim/.cache paths backed by Forge shared /mnt/data cache. The 2026-05-27T07Z live matrix supports B200/us-central1, B300/uk-south1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 with 3 warmups and 10 measured chat requests per cell. Keep active/testing and non-default until product review decides whether this open-science small model should be exposed in broader general routing.

Fastest verified: B300 in uk-south1
Performance: 302 ms

General

coldOpenAI SDK

Qwen3 14B

Qwen Team 2025

Qwen3 14B is a public Apache-2.0 dense text-generation model for hybrid reasoning, instruction following, multilingual chat, agent-style prompts, and long-context summarization. Hugging Face API metadata reports private=false, gated=false, disabled=false, 14,768,307,200 BF16 safetensors parameters, architecture Qwen3ForCausalLM, and text-generation usage. The upstream model card reports 14.8B total parameters, 40 layers, grouped-query attention with 40 Q heads and 8 KV heads, native 32,768-token context, and 131,072-token YaRN extension guidance. This hidden Forge profile reuses the already mirrored vLLM 0.10.2 CUDA 12.8 OpenAI-compatible image digest sha256:607442e407b0fea97f8a132a78b787c121a996dd4de181fa08e8da06e71ec2db, enables the vLLM qwen3 reasoning parser documented for vLLM 0.10.2, caps served context to 32,768 tokens for bounded probes, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The 2026-05-27T05Z hidden matrix supported H200, L40S, and RTX6000, but B200 and B300 failed during pod startup, so the profile remains hidden/onboarding and default-ineligible pending Blackwell runtime investigation.

Fastest verified: H200 in eu-north2
Performance: 2.5s

Healthcare / Life Science

coldNative inference

BIOMEDICA BMC-CLIP CF

BIOMEDICA BMC-CLIP CF is a biomedical CLIP-style vision-language model released with the CVPR 2025 BIOMEDICA work. The upstream project reports broad biomedical image coverage across pathology, radiology, ophthalmology, dermatology, surgery, molecular biology, parasitology, and cell biology, with concept-filtered BMCA-CLIP-CF outperforming BioMedCLIP on most reported general biomedical imaging classification subsets. The Hugging Face artifact is public, ungated, disabled=false, MIT-licensed, and pinned here at commit c882d526f3d5cb9073035435dd66f6a374ca6515 with checkpoint SHA256 fbbf956aab2af68273b26797bdcde4dc61304ab819512b7e4bfe174058919664. Forge did not find an official serving container, so this hidden entry uses a Forge-owned OpenCLIP/FastAPI wrapper that hydrates the 5.13 GB checkpoint into /mnt/data and exposes image-text classification, similarity, and embedding modes. The 2026-05-27T04Z Docker Buildx fallback built and pushed the wrapper to all active Forge regional registries; the follow-up safe-load image digest sha256:c88435e1c80a12508126db6e12fa6f15ec535ab2dab23f21aaa30e3e44baf68d verifies the pinned checkpoint hash and allowlists NumPy scalar checkpoint metadata for PyTorch 2.6+ weights-only loading. The 2026-05-27T05Z hidden full-matrix probe completed on B200, B300, H200, L40S, and RTX6000 with 3 warmups and 10 measured classify requests per cell; all cells returned HTTP 200 with 768-dimensional embeddings and model_time_ms timing. The 2026-05-27T07Z corpus probe added deterministic synthetic non-PHI PNG examples covering histopathology, radiology, microscopy, ophthalmology, dermatology, cell biology, surgery/endoscopy, molecular biology, and parasitology; all five cells supported 30 measured model_time_ms samples. The 2026-05-27T08Z pass adds research-only UX copy and a request acknowledgement field to the manifest. The 2026-05-29T18Z pass rebuilt that source into digest sha256:3715eb9a4e9c8b18899fba64b392b0a0ee75fe7fcbcee6b475662fcae6865186, mirrored it to all four active Forge regional registries, verified the container rejects requests without research_use_acknowledgement=true while accepting acknowledged research requests, live-upserted the hidden row, and persisted B200/B300/H200/L40S support for the new image. The 2026-06-03T09Z RTX6000/us-central1 acknowledgement-image reprobe passed and persisted on the same digest with startup 354131 ms, model_time_ms p50 11 ms, and 2.33 GB VRAM. It is not a clinical diagnostic service and must remain hidden until healthcare product safety review confirms the allowed nonclinical research scope and approves routable use.

Fastest verified: B200 in us-central1
Performance: 9 ms

General

coldOpenAI SDK

Microsoft Phi-4 Mini Reasoning

Microsoft Phi-4 Mini Reasoning is a public MIT-licensed 3.8B-parameter dense text-generation model for multi-step mathematical reasoning, symbolic problem solving, formal proof-style prompts, and compact reasoning use cases where latency and GPU footprint matter. Hugging Face API metadata refreshed on 2026-05-27 reports private=false, gated=false, disabled=false, library_name=transformers, pipeline_tag=text-generation, license=mit, revision 0e3b1e2d02ee478a3743abe3f629e9c0cb722e0a, Phi3ForCausalLM architecture, 131,072 native max positions, and 3,836,021,760 BF16 safetensors parameters. The Hugging Face model page exposes generated vLLM instructions using `vllm serve "microsoft/Phi-4-mini-reasoning"` and OpenAI-compatible `/v1/chat/completions`. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps served context to 32,768 tokens for bounded probes, clears unnecessary Hugging Face token env for this public artifact, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by Forge's shared /mnt/data cache. The 2026-05-27T01Z live matrix supports B200/us-central1, B300/uk-south1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 with 3 warmups and 10 measured chat requests per cell. Keep the model active/testing, non-default, and 32K-capped until product review validates whether this math-specialized Phi profile should be exposed in broader general routing or receive a separate 128K long-context profile.

Fastest verified: B300 in uk-south1
Performance: 2.7s

General

coldOpenAI SDK

Qwen3 30B A3B Instruct 2507 FP8

Qwen Team 2025

Qwen3 30B A3B Instruct 2507 FP8 is a public Apache-2.0 non-thinking MoE chat model for general instruction following, coding, multilingual knowledge, long-context understanding, and tool-use style prompts. Hugging Face API metadata refreshed on 2026-05-26 reports private=false, gated=false, disabled=false, commit 5a5a776300a41aaa681dd7ff0106608ef2bc90db, qwen3_moe architecture, Qwen3MoeForCausalLM, 30,533,947,392 safetensors parameters, F8_E4M3 plus BF16 tensors, and text-generation usage. The model card documents 30.5B total parameters, 3.3B activated parameters, 128 experts with 8 active experts, a native 262,144-token context, Apache-2.0 licensing, fine-grained FP8 quantization with 128x128 weight blocks, and OpenAI-compatible vLLM/SGLang serving. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps served context to 32,768 tokens for bounded first probes, clears unnecessary Hugging Face token env for this public ungated artifact, records H200, L40S, and RTX6000 as supported from the 2026-05-26T11Z partial matrix, and adds B300 support from the 2026-05-26T12Z DeepGEMM fallback probe. B200 remains disabled because the 2026-05-26T13Z retained diagnostic reproduced the failure on a fresh B200 node and captured the startup traceback: vLLM reaches Qwen3MoeForCausalLM profile_run, then Torch Inductor executes torch.ops._C.cutlass_scaled_mm.default and aborts with RuntimeError: dispatch_scaled_mm at /workspace/csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/scaled_mm_helper.hpp:17.

Fastest verified: H200 in eu-north2
Performance: 877 ms

General

coldOpenAI SDK

Qwen3 1.7B

Qwen Team 2025

Qwen3 1.7B is a public Apache-2.0 dense text-generation model for lightweight reasoning, instruction following, coding, tool-use prompts, multilingual chat, and long-context smoke workloads. Hugging Face API metadata refreshed on 2026-05-26T09Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision 70d244cc86ccca08cf5af4e1e306ecf908b1ad5e, Qwen3ForCausalLM architecture, 40,960 max positions, and 2,031,739,904 BF16 safetensors parameters. The upstream model card documents native 32,768-token context and vLLM serving for Qwen/Qwen3-1.7B. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps served context to 32,768 tokens for bounded probes, enables the vLLM qwen3 reasoning parser, clears unnecessary Hugging Face token env for this public ungated artifact, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The 2026-05-26T07Z live onboarding matrix completed 3 warmups and 10 measured chat requests on all five active GPU cells with client-wall median p50s: B200/us-central1 832 ms, B300/uk-south1 658 ms, H200/eu-north2 778 ms, L40S/eu-north1 1958 ms, and RTX6000/us-central1 1308 ms. B300/SM103 started successfully on the same CUDA 13 digest and selected FlashInfer attention. The 2026-05-26T09Z worker promotes the model to active/testing while keeping default_eligible=false so customers can target the direct slug without changing broader general-model routing policy.

Fastest verified: B300 in uk-south1
Performance: 658 ms

Healthcare / Life Science

coldOpenAI SDK

PubMedBERT Base Embeddings

NeuML PubMedBERT Base Embeddings is an Apache-2.0 sentence-transformers model fine-tuned from PubMedBERT for biomedical sentence and paragraph embeddings. The Hugging Face model card describes 768-dimensional vectors for clustering, semantic search, and RAG over PubMed-style biomedical text, and the Hugging Face API reports public, ungated, enabled metadata with sentence-transformers, safetensors, text-embeddings-inference, and endpoints-compatible tags. This Forge onboarding manifest reuses the already mirrored Hugging Face Text Embeddings Inference CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a across all Forge regions, points Hugging Face cache paths at Forge shared /mnt/data storage, and clears inherited Hugging Face token env vars because the model is public and ungated. Hidden probes support B200, H200, L40S, and RTX6000 with 3 warmups plus 10 measured requests per cell. The 2026-05-26T21Z B300 probe scheduled and pulled the same digest-pinned TEI CUDA 1.9.3 image on SM103, but the container exited before readiness with cuda compute cap 103 is not supported, so B300 remains disabled for this pinned runtime. Keep hidden/onboarding and non-default until healthcare product safety review completes; intended use is nonclinical biomedical literature retrieval, not medical advice or patient-specific decision support.

Fastest verified: H200 in eu-north2
Performance: 96 ms

General

coldOpenAI SDK

Qwen3 8B

Qwen Team 2025

Qwen3 8B is a public Apache-2.0 dense text-generation model for hybrid reasoning, instruction following, coding, agent-style prompts, multilingual chat, and long-context summarization. Hugging Face API metadata refreshed on 2026-05-26T05Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, revision b968826d9c46dd6066d109eabc6255188de91218, Qwen3ForCausalLM architecture, 40,960 max positions, and 8,190,735,360 BF16 safetensors parameters. The upstream model card documents native 32,768-token context with 131,072-token YaRN extension guidance plus vLLM serving for Qwen/Qwen3-8B. This Forge profile uses the already mirrored official vLLM 0.22.0 CUDA 13 OpenAI-compatible image digest sha256:0fec7ec5f3e6bc168e54899935fb0557da908a4832a1dbc88e2debcf2f889416, caps served context to 32,768 tokens for bounded probes, enables the vLLM qwen3 reasoning parser, clears unnecessary Hugging Face token env for this public ungated artifact, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The 2026-06-03T05Z supervisor cycle moved this active testing route off vLLM 0.21.0 because GHSA-94f4-hr76-p5j6 / CVE-2026-48746 is patched in vLLM 0.22.0, verified all four regional mirrors, live-upserted the patched manifest, and ran a focused B300/uk-south1 no-persist smoke before keeping the route active. The historical 2026-05-26T05Z and 2026-05-26T10Z vLLM 0.21.0 onboarding matrices completed 3 warmups and 10 measured chat requests on all five active GPU cells with client-wall median p50s: B200/us-central1 1879 ms, B300/uk-south1 1809 ms, H200/eu-north2 2201 ms, L40S/eu-north1 9224 ms, and RTX6000/us-central1 5706 ms. The route remains active/testing with default_eligible=false so customers can target the direct slug without changing broader general-model default routing policy.

Fastest verified: B300 in uk-south1
Performance: 1.8s

General

coldOpenAI SDK

DeepSeek R1 Distill Qwen 14B

DeepSeek R1 Distill Qwen 14B is a public reasoning-oriented text-generation model derived from Qwen2.5-14B and fine-tuned with DeepSeek-R1 samples. Hugging Face metadata refreshed on 2026-05-25T16Z reports private=false, gated=false, disabled=false, pipeline_tag=text-generation, library_name=transformers, license=mit, revision 1df8507178afcc1bef68cd8c393f61a886323761, Qwen2ForCausalLM architecture, 131072 max positions, and 14,770,033,664 BF16 safetensors parameters. This Forge profile reuses the already mirrored official vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, caps served context to 32,768 tokens, enables the vLLM deepseek_r1 reasoning parser, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. B200/us-central1, B300/uk-south1, H200/us-central1, H200/eu-north1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 have supported probe evidence with at least 10 measured requests per cell. The profile is published as active/testing and kept default-ineligible while customer-facing reasoning quality, cost, and cold-start behavior are reviewed.

Fastest verified: B300 in uk-south1
Performance: 1.8s

Physical AI

coldNative inference

AllenAI MolmoAct2 BimanualYAM

MolmoAct2-BimanualYAM is AllenAI's fine-tuned MolmoAct2 checkpoint for bimanual YAM robot manipulation. It maps a natural-language instruction, three RGB camera observations in top/left/right order, and a 14D bimanual state vector into continuous robot actions using the yam_dual_molmoact2 normalization tag. The candidate has public safetensors artifacts, Apache-2.0 source/model licensing evidence, a Forge-owned CUDA13 wrapper image mirrored to every active Forge region, and successful B200/us-central1, B300/uk-south1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 representative 3-warmup/10-measured-run probes using the upstream ai2-cortex sample frames. B300 cache-cold accounting, static physical-action bounds validation, and robot-policy trajectory preflight are recorded. Publication remains blocked until simulation replay or external robot-policy safety signoff, performance-copy separation, and publication review are complete.

Fastest verified: B200 in us-central1
Performance: 353 ms

General

coldOpenAI SDK

Qwen3 Embedding 4B

Qwen Team 2025

Qwen3 Embedding 4B served by Hugging Face Text Embeddings Inference CUDA 1.9.3 for mid-tier multilingual semantic search, code retrieval, clustering, and RAG. The upstream public Hugging Face model metadata reports Apache-2.0 licensing, no gating, 4,021,774,336 BF16 safetensors parameters, sentence-transformers feature-extraction usage, and text-embeddings-inference compatibility. The model card describes Qwen3 Embedding 4B as a 32k-context, instruction-aware text embedding model with 2,560 output dimensions, and includes both vLLM and TEI deployment examples. This Forge profile exposes the standard OpenAI-compatible dense embedding endpoint, reuses the already mirrored TEI CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a, clears inherited Hugging Face token env vars for this public model, and stores Hugging Face cache under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The 2026-05-27T00Z live probe supports H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 with 3 warmups and 10 measured embedding requests per cell. Keep hidden/onboarding and non-default until B200 is probed and B300 has a CUDA 13 or other SM103-compatible embedding runtime; the retained B300 diagnostic pod exits immediately with `cuda compute cap 103 is not supported`.

Fastest verified: H200 in eu-north2
Performance: 100 ms

General

coldNative inference

Qwen3-VL Reranker 2B

Qwen Team 2025

Qwen3-VL Reranker 2B is an Apache-2.0 multimodal reranker for text, image, screenshot, and visual-document retrieval workflows. Hugging Face API metadata checked on 2026-05-25T08:05Z reports private=false, gated=false, disabled=false, text-ranking usage, revision 4bd860ac4f15ad1897a214615cccc700f8f71818, lastModified 2026-04-16T08:55:33Z, 338,562 downloads, and 2,127,532,032 BF16 safetensors parameters. The model card says the 2B reranker supports 30+ languages, 32K context, text/images/screenshots/videos, and multimodal document reranking; this Forge profile starts with text plus image inputs and caps runtime context at 4096 tokens to match the vLLM online example and reduce first-start memory risk. The container decision reuses the already mirrored official vLLM 0.21.0 CUDA 13 image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, with Hugging Face and vLLM caches under /opt/nim/.cache backed by Forge shared /mnt/data cache. Live Forge probes support B200/us-central1, B300/uk-south1, H200/eu-north2, H200/us-central1, L40S/eu-north1, and RTX6000/us-central1 with 3 warmups and 10 measured multimodal rerank requests per cell. Request timing is client wall time because vLLM /v1/rerank did not expose model execution timing. Video input remains intentionally unexposed until Forge request mapping, benchmark payload stability, and remote-media policy are validated.

Fastest verified: B300 in uk-south1
Performance: 642 ms

Healthcare / Life Science

coldOpenAI SDK

Llama 3.1 Nemotron Nano 8B Healthcare Text2SQL

NVIDIA's Llama 3.1 Nemotron Nano 8B Healthcare Text2SQL NIM translates natural-language healthcare analytics questions plus DDL into SQL. This Forge candidate is useful for nonclinical self-service analytics and research data exploration over de-identified clinical schemas. It uses digest-pinned regional mirrors and must remain hidden and default-ineligible until GPU probes pass and healthcare product safety review approves nonclinical positioning. It must not be used for medical advice, clinical decision-making, diagnosis, treatment, triage, or patient-specific record interpretation.

Fastest verified: B300 in uk-south1
Performance: 163 ms

General

coldNative inference

Qwen3 Reranker 4B

Qwen Team 2025

Qwen3 Reranker 4B is an Apache-2.0 multilingual text reranking model for retrieval, RAG, code retrieval, and cross-lingual search. Hugging Face API metadata checked on 2026-05-25 reports private=false, gated=false, disabled=false, text-ranking usage, revision 22e683669bc0f0bd69640a1354a6d0aebcfeede5, and 4,021,784,576 BF16 safetensors parameters. The upstream model card states 100+ language support, 4B parameters, 32K context, and instruction-aware reranking; this Forge profile initially caps served request context to 8,192 tokens to match the upstream sample max_length and keep first probes bounded. The container decision reuses the already mirrored official vLLM 0.21.0 CUDA 13 image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9, with Hugging Face and vLLM caches under /opt/nim/.cache backed by Forge's shared /mnt/data cache. vLLM 0.21 scoring docs expose /v1/rerank and document the Qwen3 reranker pooling runner with Qwen3ForSequenceClassification plus no/yes classifier overrides. Live Forge probes now support B200/us-central1, B300/uk-south1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 with 3 warmups and 10 measured rerank requests per cell. This row is active, testing, non-default, and non-latest while longer-context and production traffic behavior continue to bake.

Fastest verified: B300 in uk-south1
Performance: 72 ms

General

coldOpenAI SDK

Multilingual E5 Large Instruct

intfloat multilingual-e5-large-instruct served by Hugging Face Text Embeddings Inference CUDA 1.9 for multilingual instruction-conditioned dense retrieval and RAG embeddings. Hugging Face API metadata reports private=false, gated=false, disabled=false, MIT license metadata, sentence-transformers library usage, feature-extraction pipeline, xlm-roberta architecture, text-embeddings-inference compatibility, endpoints compatibility, 94 listed languages, and 559,890,432 safetensors parameters. The selected runtime reuses the already mirrored TEI CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a across Forge regions and stores Hugging Face caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. B300 remains disabled for this pinned TEI image because the entrypoint dispatches SM75, SM80-89, SM90, SM100, and SM120 only; B300 reports SM103 and exits before server startup.

Fastest verified: L40S in eu-north1
Performance: 97 ms

General

coldOpenAI SDK

Snowflake Arctic Embed L v2.0

Snowflake Arctic Embed L v2.0 served by Hugging Face Text Embeddings Inference CUDA 1.9 for multilingual semantic search, enterprise RAG, clustering, and retrieval workloads. Hugging Face metadata reports the model is public, ungated, Apache-2.0, sentence-transformers based, text-embeddings-inference compatible, and backed by 567,754,752 safetensors parameters. The model card describes Arctic Embed 2.0 as a multilingual retrieval model that preserves English retrieval quality, supports 74 languages, 1024-dimensional embeddings, Matryoshka-friendly compression, and up to 8192-token inputs. This initial Forge profile reuses the already mirrored TEI CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a, serves the standard OpenAI-compatible dense embedding endpoint, clears unnecessary Hugging Face token env for the public artifact, and stores Hugging Face cache under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. Keep status onboarding until live Forge matrix probes classify all current GPU cells.

Fastest verified: H200 in eu-north2
Performance: 94 ms

Physical AI

coldNative inference

Hugging Face LeRobot SmolVLA RobotWin

SmolVLA RobotWin is an Apache-2.0 LeRobot SmolVLA policy checkpoint fine-tuned from lerobot/smolvla_base on the pepijn223/robotwin_unified_v3 dataset. The published config.json describes three 256px RGB camera observations and a 6D robot state, but the shipped policy preprocessor renames RobotWin camera keys cam_high, cam_left_wrist, and cam_right_wrist into camera1, camera2, and camera3, while the normalizer safetensors stores 14D observation.state statistics. This hidden Forge candidate reuses the already mirrored SmolVLA wrapper image. The 2026-05-28T21Z visible-cell exact-digest reprobe supports L40S, H200, B300, and B200 with a processor-aligned 14D state payload, and the live hidden support rows were updated. The 2026-05-28T22Z cross-cell static safety check captured full action chunks from retained pods on the same four cells and verified finite 1x50x14 outputs within upstream action bounds. The 2026-05-29T01Z artifact hygiene pass confirmed upstream Apache-2.0 metadata and repaired the missing plain regional tag aliases for the digest-pinned wrapper image outside us-central1. The 2026-05-29T04Z RTX6000/us-central1 probe completed on the same digest with 10 measured query_time_ms samples and upserted the hidden support row. Simulator replay, HIL, or external robot-policy safety signoff remains required before publication.

Fastest verified: B300 in uk-south1
Performance: 145 ms

Healthcare / Life Science

coldOpenAI SDK

Llama3 OpenBioLLM 8B

Llama3 OpenBioLLM 8B is a public biomedical instruction-tuned Llama 3 8B derivative intended for biomedical question answering and research workflows. This Forge onboarding manifest keeps the model hidden and non-default while license and safety review work proceed. Hugging Face API metadata checked on 2026-05-24 reports private=false, gated=false, disabled=false, license=llama3, commit 70d6bb521cab6ca755b675ade38831eedf89d31c, LlamaForCausalLM architecture, bfloat16 weights, and 8192 position embeddings. The selected container is the already mirrored vLLM 0.21.0 CUDA 13 OpenAI-compatible image digest sha256:a230095847e93bd4df9888b33dab956fa9504537b828a23657d2b26fed57b5c9. A 2026-05-24 B200 smoke reached vLLM but the chat-completions request failed because the tokenizer has no chat template, so the manifest now uses /v1/completions with an explicit Llama 3 instruct prompt envelope and stop tokens. B200, B300, H200, L40S, and RTX6000 now have supported probe evidence with non-empty instruct-envelope completions. The default playground prompt is limited to nonclinical biomedical literature triage. The model must not be used for medical advice, clinical decision-making, diagnosis, treatment, triage, or patient-specific interpretation.

Fastest verified: B200 in us-central1
Performance: 168 ms

Healthcare / Life Science

coldNative inference

ESM-2 3B Protein Embeddings

ESM-2 3B is a nonclinical protein language model for sequence representation learning and downstream protein analysis. The Hugging Face model card identifies MIT licensing, a 36-layer 3B-parameter checkpoint, Transformers support, and masked-language-model protein sequence use. This Forge onboarding entry reuses the existing life-science transformer wrapper and existing regional ESM-2 3B wrapper images, with all active regional images digest-pinned on 2026-05-23. Runtime caches are directed to /mnt/data. The hidden onboarding row was upserted and standard probes passed with 10 measured warm model-time samples on B200/us-central1, B300/uk-south1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1. Publication dedupe is resolved: do not expose this staging slug as a separate public model; use it as the digest-pinned replacement source for the existing facebook-esm-2-3b public route or an explicit backwards-compatible alias.

Fastest verified: H200 in eu-north1
Performance: 15 ms

General

coldOpenAI SDK

Qwen3 Embedding 0.6B

Qwen Team 2025

Qwen3 Embedding 0.6B served by Hugging Face Text Embeddings Inference CUDA 1.9.3 for low-cost multilingual semantic search, code retrieval, clustering, and RAG. The upstream public Hugging Face model metadata reports Apache-2.0 licensing, no gating, 595,776,512 BF16 safetensors parameters, sentence-transformers feature-extraction usage, and text-embeddings-inference compatibility. The model card describes the 0.6B embedding model as a 32k-context, instruction-aware, multilingual embedding model with up to 1024 output dimensions. This initial Forge profile exposes the standard OpenAI-compatible dense embedding endpoint, reuses the already mirrored TEI CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a, clears inherited Hugging Face token env vars for this public model, and stores Hugging Face cache under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. Hidden probes on 2026-05-23 support L40S/eu-north1, B200/us-central1, H200/eu-north2, and RTX6000/us-central1 with 10 measured warm embedding requests per cell. Keep hidden/onboarding and non-default while B300/SM103 remains disabled for this pinned TEI CUDA 1.9.3 runtime.

Fastest verified: H200 in eu-north2
Performance: 97 ms

General

coldOpenAI SDK

SmolLM3 3B

SmolLM3 3B is a public Apache-2.0 small general language model from Hugging Face for instruction following, hybrid reasoning, multilingual chat, tool-use style prompts, and long-context summarization. Hugging Face API metadata reports private=false, gated=false, disabled=false, 3,075,098,624 BF16 safetensors parameters, and text-generation usage. The model card describes SmolLM3 as an instruct model with dual-mode reasoning, a 64k trained context, optional 128k YaRN extrapolation, recommended temperature 0.6 and top_p 0.95, and vLLM deployment with the Hermes tool-call parser. vLLM 0.10.2 supported-model docs list SmolLM3ForCausalLM and HuggingFaceTB/SmolLM3-3B through the Transformers backend. This Forge profile reuses the already mirrored vLLM 0.10.2 CUDA 12.8 OpenAI-compatible image digest sha256:607442e407b0fea97f8a132a78b787c121a996dd4de181fa08e8da06e71ec2db, caps served context to 32,768 tokens for bounded first probes, disables thinking by default for predictable short responses, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. Focused probes on 2026-05-22T22Z through 2026-05-23T02Z support B200/us-central1, L40S/eu-north1, H200/eu-north2, and RTX6000/us-central1 with 10 measured warm chat requests per supported cell; B300/uk-south1 remains disabled because the same vLLM CUDA 12.8 stack fails during Torch/Triton compile on SM103 with ptxas rejecting gpu-name sm_103a.

Fastest verified: H200 in eu-north2
Performance: 732 ms

Physical AI

coldNative inference

Robotics Diffusion Transformer RDT-170M

RDT-170M is a 170M-parameter vision-language-action diffusion policy for mobile-manipulator control. It maps a language instruction, RGB camera history, control frequency, and 14D Mobile ALOHA proprioception into a 64-step robot action chunk. This hidden candidate now has a digest-pinned Forge wrapper that verifies the upstream pytorch_model.bin hash, converts it to safetensors in the shared cache, and has supported 10-run warm probes across B200, B300, H200, L40S, and RTX6000 plus a completed B300/SM103 dispatch audit; it must not be published until checkpoint conversion compliance accepts the remaining one-time torch.load(weights_only=True) step.

Fastest verified: B200 in us-central1
Performance: 94 ms

Healthcare / Life Science

coldNative inference

BioEmu v1.1

BioEmu v1.1 is a Microsoft Research AI for Science protein monomer conformational ensemble sampler. Upstream documents MIT licensing, Linux Python 3.10+ package support, BioEmu v1.1 as the checkpoint used for the Science paper, and outputs consisting of FASTA, topology PDB, and XTC trajectory files. This Forge entry is deliberately hidden: no official portable serving OCI image was found, so Forge wrapper code now exists under services/models/hcls/microsoft_bioemu_v1_1. The wrapper image has been built, digest-pinned across all active Forge regions, import-smoked in us-central1, hidden-probed on H200/eu-north2 with real sequence.fasta, topology.pdb, samples.xtc, npz, and zip outputs and on L40S/eu-north1 with HTTP 200 BioEmu artifact metadata and 10 measured warm model_time_ms samples. The deployed-image matrix is classified as failing on B200/us-central1, B300/uk-south1, and RTX6000/us-central1 because each returned HTTP 500/Internal Server Error on the first request with no measured benchmark samples. The 2026-05-23T11Z B200 AlphaFold FP32 diagnostic override image passed a full 3-warmup/10-measured probe with p50 7696 ms model_time_ms, proving the B200 blocker is fixable by the FP32 AlphaFold patch; the 12Z cycle mirrored that diagnostic image to every Forge region, but it still needs promotion and a fresh active-image matrix before B200 can be represented as supported in gpu_compatibility. Runtime caches for Hugging Face model files, AlphaFold2/ColabFold weights, embeddings, SO(3) precomputations, and outputs are routed under /mnt/data. The model is for nonclinical protein monomer research and is not intended for clinical decision-making, new protein sequence generation, protein-protein interaction modeling, or small-molecule interaction modeling.

Fastest verified: H200 in eu-north2
Performance: 6.2s

General

coldNative inference

BGE Reranker v2 M3

BAAI bge-reranker-v2-m3 served by Hugging Face Text Embeddings Inference CUDA 1.9 for multilingual retrieval reranking. Hugging Face API metadata reports private=false, gated=false, disabled=false, Apache-2.0 license metadata, text-classification pipeline usage, XLMRobertaForSequenceClassification architecture, and 567,755,777 safetensors parameters. The model card describes bge-reranker-v2-m3 as a lightweight multilingual reranker that scores query-passage pairs directly and is easy to deploy for retrieval pipelines. This onboarding entry reuses the already mirrored TEI CUDA 1.9 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a, serves the native /rerank endpoint, clears unnecessary Hugging Face token env for this public artifact, and stores Hugging Face caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. Live Forge probes now support B200/us-central1, H200/eu-north2, and L40S/eu-north1; B300/uk-south1 returned pod_failed on 15Z and 19Z, and image audit shows the TEI CUDA 1.9 entrypoint lacks an SM103 dispatch branch, so B300 remains disabled until a B300-capable TEI image or Forge-owned wrapper is validated. RTX6000 was not in the discovered inventory and remains disabled until probed or explicitly marked unavailable.

Fastest verified: L40S in eu-north1
Performance: 94 ms

General

coldOpenAI SDK

Qwen3-Coder 30B A3B Instruct FP8

Qwen Team 2025

Qwen3-Coder 30B A3B Instruct FP8 is a public Apache-2.0 coding-focused mixture-of-experts model for agentic coding, browser-use style tasks, repository-scale prompts, tool-call style workflows, and general code generation. Hugging Face metadata refreshed on 2026-05-31 reports the model as public and ungated with 30,533,947,392 safetensors parameters, Qwen3MoeForCausalLM architecture, and fine-grained FP8 quantization; the model card reports 30.5B total parameters, 3.3B activated parameters, 128 experts, 8 active experts, and a native 262,144-token context window. This Forge onboarding profile uses the mirrored vLLM 0.10.2 CUDA 12.8 OpenAI-compatible image digest sha256:607442e407b0fea97f8a132a78b787c121a996dd4de181fa08e8da06e71ec2db after the vLLM 0.21.0 CUDA 13 image failed on a B200 node with CUDA driver error 803. The runtime caps served context to 32,768 tokens to match the upstream OOM mitigation guidance, passes the model both as the vLLM serve target and explicit engine --model argument, enables the vLLM qwen3_coder tool-call parser observed in the vLLM 0.10.2 API server, sets max-num-seqs to 16 to avoid the Blackwell warmup shape failure seen with 4, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. Current probes validate this corrected CUDA 12.8 path with 3 warmups and 10 measured chat requests on B200/us-central1 p50 3916 ms, H200/eu-north2 p50 4292 ms, L40S/eu-north1 p50 7616 ms, and RTX6000/us-central1 p50 7232 ms. B300/uk-south1 remains disabled because the 2026-05-31 13Z SM103 probe pulled the regional image successfully but the model container terminated before readiness with `pod_start_failure:model:terminated:Error`. Keep the row hidden and non-default until B300 is fixed or explicitly classified unsupported and product review decides whether to expose this CUDA 12.8 FP8 Coder profile.

Fastest verified: B200 in us-central1
Performance: 3.9s

General

coldOpenAI SDK

OpenAI gpt-oss 20B

OpenAI gpt-oss 20B is a public Apache-2.0 open-weight text reasoning model for instruction following, coding, tool-use style prompts, structured outputs, and agentic workflows. Hugging Face reports the artifact as public and ungated with 21,511,953,984 safetensors parameters, GptOssForCausalLM architecture, native MXFP4 MoE quantization, and 131,072-token max positions. This Forge profile uses the already mirrored vLLM 0.21.0 CUDA 13 OpenAI-compatible image, caps served context to 32,768 tokens for bounded probes, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The 2026-05-24T08Z B200 canary reached vLLM 0.21.0 engine startup but failed CUDA initialization while the manifest forced VLLM_ENABLE_CUDA_COMPATIBILITY=1; this manifest now leaves the CUDA 13 image default compatibility mode in place. The 2026-05-24T10Z full matrix probe succeeded on B200, B300, H200, and L40S with 3 warmups and 10 measured chat requests per supported cell. A focused 2026-05-24T20Z RTX6000 reprobe resolved the prior ReadError, reached HTTP 200, and completed 10 measured warm chat requests on the same digest. The checked-in manifest is active, testing, and non-default so operators can select the five-GPU profile without replacing any default route.

Fastest verified: B300 in uk-south1
Performance: 287 ms

General

coldOpenAI SDK

BGE-M3

BAAI BGE-M3 served by Hugging Face Text Embeddings Inference CUDA 1.9 for multilingual semantic retrieval, long-document RAG, and dense embedding use. The public Hugging Face model metadata reports MIT licensing, sentence-transformers sentence-similarity usage, text-embeddings-inference compatibility, and no gating. The upstream model card describes dense, sparse, and multi-vector retrieval support across more than 100 languages with inputs up to 8192 tokens and 1024-dimensional embeddings; this Forge TEI variant exposes the standard OpenAI-compatible dense embedding endpoint only. It reuses the already mirrored TEI CUDA 1.9.3 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a and stores Hugging Face cache under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. Live Forge probes on 2026-05-22 support B200/us-central1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1; B300/uk-south1 returned pod_start_failure:pod_failed and remains disabled pending triage.

Fastest verified: L40S in eu-north1
Performance: 94 ms

General

coldOpenAI SDK

Qwen3 4B Instruct 2507

Qwen Team 2025

Qwen3 4B Instruct 2507 is a public Apache-2.0 text-generation model for instruction following, coding, multilingual knowledge, tool-use style prompts, and long-context understanding. The upstream card reports 4.0B parameters, 36 layers, grouped-query attention, and a native 262,144-token context window; this initial Forge onboarding variant caps vLLM at 32,768 tokens to keep first probes bounded across current GPU cells. It uses the already mirrored official vLLM 0.10.2 CUDA 12.8 OpenAI-compatible image digest sha256:607442e407b0fea97f8a132a78b787c121a996dd4de181fa08e8da06e71ec2db after the Qwen3-Coder thread showed vLLM 0.21.0 CUDA 13 failing on the current B200 driver stack with cudaGetDeviceCount error 803. Hidden probes now support B200/us-central1 with p50 881 ms client wall time, H200/eu-north2 with p50 900 ms client wall time, L40S/eu-north1 with p50 2348 ms client wall time, and RTX6000/us-central1 with p50 1780 ms client wall time. B300/uk-south1 is still disabled because the same vLLM image fails on SM103 during Torch/Triton compile with ptxas rejecting gpu-name sm_103a. The runtime passes both the serve target and explicit --model argument, selected Flash Attention on RTX6000 during the 2026-05-23T14Z probe, and stores Hugging Face and vLLM caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights.

Fastest verified: B200 in us-central1
Performance: 889 ms

General

coldNative inference

GTE Reranker ModernBERT Base

Alibaba-NLP gte-reranker-modernbert-base served by Hugging Face Text Embeddings Inference CUDA 1.9 for English retrieval reranking. The upstream model is Apache-2.0, has 149M parameters, supports 8192-token inputs, and documents TEI deployment through the native /rerank endpoint. This onboarding entry reuses the already mirrored TEI CUDA 1.9 image digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a and keeps Hugging Face caches under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. GPU compatibility remains false until live Forge probes classify B200, B300, H200, L40S, and RTX6000 cells with the native TEI rerank request shape.

Fastest verified: L40S in eu-north1
Performance: 101 ms

Healthcare / Life Science

coldNative inference

ESM-2 650M Protein Embeddings

ESM-2 650M is a non-clinical protein language model for sequence representation learning. The Hugging Face model card identifies MIT licensing, a 33-layer 650M-parameter checkpoint, and masked-language-model protein sequence use. This Forge onboarding entry uses the existing Forge life-science transformer wrapper to return pooled protein embeddings plus sequence metadata. Regional Forge wrapper images were inspectable and digest-pinned on 2026-05-22; runtime caches are directed to /mnt/data. A four-cell onboarding probe on 2026-05-22 validated the wrapper on the live B200, B300, H200, and L40S regions with 10 warm model-time samples per cell. Keep this slug non-routable until Forge decides whether it should replace or alias the existing public facebook-esm-2-650m entry.

Fastest verified: B300 in uk-south1
Performance: 13 ms

General

coldOpenAI SDK

Qwen3 Embedding 8B

Qwen Team 2025

Qwen3 Embedding 8B served by Hugging Face Text Embeddings Inference CUDA 1.9 for multilingual semantic search, code retrieval, clustering, and RAG. The upstream model is Apache-2.0, public, ungated, sentence-transformers based, tagged text-embeddings-inference/endpoints_compatible, and backed by 7,567,295,488 BF16 safetensors parameters. This onboarding entry has verified regional TEI image mirrors at digest sha256:249a0bc87522bfe2f1012b4d194f0225878f47079115ada3aeb0b1ef257b402a, uses TEI MODEL_ID/DTYPE/PORT env overrides, and stores Hugging Face cache under the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. Focused non-public probes support H200/eu-north2, B200/us-central1, L40S/eu-north1, and RTX6000/us-central1 with 10 measured warm embedding requests per supported cell. B300/uk-south1 was probed on 2026-05-22 with the same TEI CUDA 1.9 digest and remains unsupported because the container exits on SM103 with `cuda compute cap 103 is not supported`. Keep non-public and non-default until a Blackwell-compatible TEI/runtime is selected for B300 or product explicitly publishes this four-cell profile with B300 disabled.

Fastest verified: H200 in eu-north2
Performance: 104 ms

Physical AI

coldNative inference

NVIDIA Cosmos 3 Omni (Nano)

NVIDIA 2026

Early-access Cosmos 3 Omni Nano world-generation and action model served through a Forge CUDA 13 wrapper. It supports text-to-image, text-to-video, image-to-video, forward dynamics, inverse dynamics, and policy modes; image/video outputs are rendered in Forge and action outputs return JSON.

Fastest verified: B300 in uk-south1
Performance: 528 ms

General

coldOpenAI SDK

Qwen2.5-VL 7B Instruct

Qwen Team 2025

Qwen2.5-VL 7B Instruct is an Apache-2.0 vision-language chat model for OCR, document and chart understanding, visual question answering, and structured extraction. This active non-default Forge testing profile uses mirrored regional copies of the official vLLM OpenAI-compatible server image with first-run Hugging Face and vLLM caches directed to the shared /opt/nim/.cache hostPath backed by /mnt/data/forge-weights. The served profile caps context at 32k tokens and one image per prompt, leaves the CUDA 13 image default VLLM_ENABLE_CUDA_COMPATIBILITY=0 path in place, allowlists raw.githubusercontent.com for the default image request, and disables media URL redirects. B200/us-central1, B300/uk-south1, H200/eu-north2, and L40S/eu-north1 each have 3 warmups plus 10 measured image-chat requests; a retained B300 diagnostic validated FlashInfer/FlashAttention dispatch; and a B200 policy fallback validated the hardened no-redirect media path. RTX6000 stays disabled because current live inventory returned no matching cell. Upstream supports a 128k position window and video inputs, which should be validated in separate variants before being advertised.

Fastest verified: B300 in uk-south1
Performance: 432 ms

Healthcare / Life Science

coldOpenAI SDK

BioMistral 7B

Labrak et al. 2024

BioMistral 7B is an Apache-2.0 biomedical language model further pre-trained from Mistral 7B on PubMed Central Open Access text. This Forge onboarding manifest uses mirrored regional copies of the public vLLM OpenAI-compatible image and aligns Forge's probe request model with the vLLM served model name. A five-cell Forge matrix probe on 2026-05-22 validated B200, B300, H200, L40S, and RTX6000 support with 10 warm benchmark requests per cell. The default playground prompt is framed for nonclinical biomedical literature triage. The model must stay non-routable until healthcare product safety review approves a nonclinical user surface. It must not be used for medical advice or clinical decision-making.

Fastest verified: B300 in uk-south1
Performance: 537 ms

Physical AI

coldNative inference

NVIDIA Cosmos 3 Reasoner (Nano)

NVIDIA 2026

Hidden staging manifest for the early-access Cosmos 3 Nano Reasoner VLM. It exposes text, image, and video reasoning inputs while the Forge-owned Qwen3-VL runtime container and benchmark matrix are completed.

Fastest verified: B200 in us-central1
Performance: 10.9s

General

coldOpenAI SDK

Mistral Small 3.2 24B Instruct 2506

Mistral AI 2025

Mistral Small 3.2 24B Instruct 2506 packaged as NVIDIA's VLM NIM. It provides Apache-2.0 text and image chat with improved instruction following, lower repetition, and more robust function-calling behavior than Mistral Small 3.1. This onboarding entry has verified regional image mirrors plus focused H200/eu-north2 and B200/us-central1 probes. B300/uk-south1 currently fails during NIM 1.3.1 Triton startup on SM103. RTX6000/us-central1 now has terminal NIM 1.3.1 evidence: the 2026-05-28T02Z reprobe selected vllm-bf16-tp1-pp1 on an RTX PRO 6000 Blackwell GPU with compute capability 12.0, chose Flash Attention, loaded the safetensors shard, then raised RuntimeError: CUDA error: no kernel image is available for execution on the device while /v1/health/ready remained 503. L40S/eu-north1 selects the same vllm-bf16-tp1-pp1 profile and fails before readiness with CUDA OOM on the single 48 GB-class GPU; the 2026-05-25T21Z L40S reprobe persisted raw status other_error, but 2026-05-25T23Z supervisor classification normalizes that kept-pod log evidence to terminal OOM for routing and publication gating. Keep L40S and RTX6000 unsupported unless a lower-memory, quantized, reduced-context, multi-GPU, or newer NIM/vLLM profile is validated.

Fastest verified: B200 in us-central1
Performance: 546 ms

General

coldNative inference

Stable Video Diffusion XT

Stable Video Diffusion XT image-to-video generation through the Forge Diffusers media wrapper. It is an older but still widely referenced image-to-video baseline and gives Forge a lightweight SVD comparison point.

Fastest verified: B200 in us-central1
Performance: 22.0s

General

coldNative inference

Stable Diffusion XL Base 1.0

Stable Diffusion XL Base 1.0 text-to-image generation through the Forge Diffusers media wrapper. SDXL remains a common baseline for image generation comparisons and runs comfortably on one GPU with offload.

Fastest verified: RTX6000 in us-central1
Performance: 8.6s

General

coldNative inference

Kandinsky 5.0 T2I Lite SFT

Kandinsky 5.0 Image Lite SFT text-to-image generation through the Forge Diffusers wrapper. Diffusers documents it as a 6B lightweight image generation model.

Fastest verified: B200 in us-central1
Performance: 40.9s

General

coldNative inference

SkyReels-V2 DF 1.3B 540p

SkyReels-V2 diffusion-forcing text-to-video generation through the Forge Diffusers wrapper. The 1.3B 540p checkpoint is the practical single-GPU entry point for the larger SkyReels family.

Fastest verified: B200 in us-central1
Performance: 2m 1s

General

coldNative inference

HunyuanVideo 1.5 480p T2V

HunyuanVideo 1.5 480p text-to-video generation through the Forge Diffusers wrapper. The 8.3B model is documented as an efficient open video model with memory/offload examples.

Fastest verified: H200 in eu-north2
Performance: 1m 59s

General

coldNative inference

Mochi 1 Preview

Mochi 1 Preview text-to-video generation through the Forge Diffusers wrapper. Diffusers documents a bf16 single-GPU path around the 24GB VRAM class, so it is a practical broad-coverage video candidate.

Fastest verified: B200 in us-central1
Performance: 48.6s

General

coldNative inference

HiDream-I1 Fast

HiDream-I1 Fast text-to-image generation through the Forge Diffusers wrapper. It is a high-impact 17B open image model candidate; probes will determine whether the gated text encoder and memory profile are practical on one Forge GPU.

Fastest verified: B200 in us-central1
Performance: 52.6s

General

coldNative inference

PixArt-Sigma XL 2 1024 MS

PixArt-Sigma XL 1024 text-to-image generation through the Forge Diffusers wrapper. It is included as a low-risk open image baseline with documented low-VRAM inference paths.

Fastest verified: B200 in us-central1
Performance: 12.3s

General

coldNative inference

Z-Image Turbo

Z-Image Turbo text-to-image generation through the Forge Diffusers wrapper. Diffusers documents it as a 6B image model that fits in 16GB VRAM, making it a strong single-GPU candidate.

Fastest verified: RTX6000 in us-central1
Performance: 17.7s

General

coldNative inference

Wan2.2 TI2V 5B

Wan2.2 TI2V 5B text-to-video generation through the reusable Forge Diffusers media wrapper. The official model card documents single-GPU operation with offload on a 24GB-class GPU and 80GB-class faster operation without offload.

Fastest verified: B200 in us-central1
Performance: 27.9s

General

coldNative inference

CogVideoX 2B

CogVideoX 2B text-to-video generation through the reusable Forge Diffusers media wrapper. It is the first custom open video wrapper candidate because the model card documents single-GPU Diffusers deployment and low-VRAM offload paths.

Fastest verified: B200 in us-central1
Performance: 25.0s

General

coldNative inference

SANA Sprint 1.6B

SANA Sprint 1.6B text-to-image generation through a reusable Diffusers wrapper. It is included as a lightweight single-GPU open image model complementing the larger NIM-backed FLUX and Qwen entries.

Fastest verified: RTX6000 in us-central1
Performance: 5.5s

General

coldNative inference

Stable Diffusion 3.5 Large

Stability AI Stable Diffusion 3.5 Large Visual GenAI NIM for high-quality text-to-image generation on a single GPU.

Performance: Benchmark pending

General

coldNative inference

Qwen-Image

Qwen-Image Visual GenAI NIM for high-quality multilingual text-to-image generation, with the container default pinned to the qwen-image-2512 model version.

Fastest verified: B200 in us-central1
Performance: 7.5s

General

startingNative inference

Qwen-Image-Edit

Qwen-Image-Edit Visual GenAI NIM for prompt-driven image editing on a single 80GB GPU.

Fastest verified: B200 in us-central1
Performance: 3.4s

General

startingNative inference

FLUX.2 Klein 4B

Compact FLUX.2 Klein 4B Visual GenAI NIM for efficient text-to-image and image-editing workloads. The NIM is optimized around a hard 1-4 step generation window; Forge exposes the valid prompt, aspect ratio, seed, and step controls for this runtime.

Fastest verified: B200 in us-central1
Performance: 3.3s

General

coldNative inference

FLUX.1 Schnell

Distilled FLUX.1 Schnell image generation NIM for fast single-GPU text-to-image experiments.

Performance: Benchmark pending

General

coldNative inference

FLUX.1 Kontext Dev

FLUX.1 Kontext Dev NIM for prompt-driven in-context image editing on a single GPU.

Performance: Benchmark pending

General

coldNative inference

FLUX.1 Dev

Black Forest Labs FLUX.1 Dev image generation packaged as NVIDIA Visual GenAI NIM. Supports text-to-image and guided image variants; this Forge entry starts with the single-GPU base text-to-image path.

Performance: Benchmark pending

General

coldNative inference

LTX-Video 0.9

Onboarding entry for LTX-Video 0.9, a fast text/image-to-video candidate useful as a low-latency preview tier for synthetic-video workflows.

Fastest verified: B200 in us-central1
Performance: 13.4s

Earth Observation

coldNative inference

SatlasPretrain Aerial SwinB

SatlasPretrain high-resolution aerial imagery backbone for 0.5-2 m/pixel RGB inputs. Use it as the Forge base for building/road/water/agriculture/canopy fine-tuned label heads; the initial wrapper does not claim calibrated labels without EO_FINETUNED_CHECKPOINT.

Fastest verified: H200 in eu-north2
Performance: 1.4s

Healthcare / Life Science

coldNative inference

AntiFold

Open-source AntiFold onboarding manifest for antibody and nanobody inverse folding. It covers the Tamarind antibody/nanobody parameters for chains, antigen context, CDR/region selection, sampling temperature, batch count, and optional verification.

Fastest verified: H200 in eu-north2
Performance: 5.3s

Healthcare / Life Science

coldNative inference

ProteinMPNN Suite

Dauparas et al. 2022

ProteinMPNN inverse-folding and sequence-design suite with design-region, model-type, amino-acid bias, omit, temperature, and verification controls. Added because no current NVIDIA NIM gives the same low-latency backbone-to-sequence design workflow.

Fastest verified: H200 in eu-north2
Performance: 2.9s

Healthcare / Life Science

coldNative inference

RFdiffusion NIM

Watson et al. 2023

RFdiffusion NIM is NVIDIA's packaged IPD/RosettaCommons diffusion model for de novo protein backbone generation, binder design, and motif scaffolding from PDB targets plus contig specifications. Current NVIDIA docs identify RFdiffusion NIM 2.3.0, the /biology/ipd/rfdiffusion/generate endpoint, /v1/health/ready readiness checks, single-GPU execution, a 30 GB image-declared GPU-memory requirement, and tested H100, A100, L40S, A10G, and GB200 configurations. The 2026-05-26T09Z hidden Forge probe validated the fixed PDB-backed default request body on L40S with 3 warmups, 10 measured requests, HTTP 200 PDB output, 21.32 GB VRAM, and 2.068 s median client-wall request time after a cold 19.5 minute image/runtime startup. The 2026-05-26T11Z H200 probe validated the same hidden request with 3 warmups, 10 measured requests, HTTP 200 PDB output, 22.32 GB VRAM, and 1.921 s median client-wall request time after 825.488 s startup. The 2026-05-26T12Z RTX6000 probe validated the same hidden request with 3 warmups, 10 measured requests, HTTP 200 PDB output, 21.86 GB VRAM, and 2.110 s median client-wall request time after 851.791 s startup. The 2026-05-26T15Z warmed B300 probe validated the same hidden request with 3 warmups, 10 measured requests, HTTP 200 PDB output, 22.51 GB VRAM, and 1.902 s median client-wall request time after a 732.539 s startup plus first-warmup TensorRT hydration. The 2026-05-26T10Z B200 probe scheduled on healthy capacity and pulled the 17.58 GB pinned US mirror in 615.428 s, but NIM selected the B200 SM100 profile and failed before readiness when TensorRT rejected a cached simulator.extra_block.0.nonse3.trt plan with a platform tag mismatch. B200 remains disabled until successful rebuilt-cache evidence exists. This is a nonclinical protein-design research workflow, not clinical decision support or wet-lab safety validation.

Fastest verified: B300 in uk-south1
Performance: 1.9s

Healthcare / Life Science

coldNative inference

Boltz-2 NIM

Passaro et al. 2025

NVIDIA NIM packaging of MIT/Recursion Boltz-2 for all-atom biomolecular structure prediction, protein-ligand docking, and affinity-capable ligand predictions. The playground mirrors the Tamarind-style Boltz controls for polymers, ligands, constraints, samples, recycles, sampling steps, and step scale.

Fastest verified: H100 in eu-north1
Performance: 1.0s

Healthcare / Life Science

coldNative inference

OpenFold3 NIM

OpenFold3 Team 2026

OpenFold3 NIM is NVIDIA's packaged OpenFold3 biomolecular complex structure-prediction service. It accepts protein, DNA, RNA, ligand, MSA, and template-guided inputs and returns ranked PDB or CIF structures with confidence metrics. The public NVIDIA docs list NIM 1.4.0, the nvcr.io/nim/openfold/openfold3:latest container, the /biology/openfold/openfold3/predict endpoint, /v1/health/ready health checks, and a single-GPU support matrix that includes H200, B200, L40S, and RTX PRO 6000 Blackwell Workstation Edition. The 2026-05-24T21Z H200 reprobe confirmed Forge can start the stock image from writable /tmp with OUTPUT_DIR=/tmp/openfold-output, reuse cached H200 artifacts, reach readiness, and complete warm PDB predictions with a 10.168 s median client-wall request time. The 2026-05-24T22Z RTX6000 reprobe completed the same 3 warmups plus 10 measured PDB predictions with an 11.728 s median client-wall request time and 2.99 GB VRAM. The 2026-05-25T00Z L40S reprobe completed 3 warmups plus 10 measured PDB predictions with a 13.208 s median client-wall request time and 2.86 GB VRAM. The 2026-05-25T21Z B200 reprobe completed 3 warmups plus 10 measured PDB predictions with a 9.196 s median client-wall request time and 3.03 GB VRAM, so H200, L40S, RTX6000, and B200 are now supported by live Forge evidence. The 2026-05-25T23Z B300 diagnostic showed the stock cueq fallback aborts on SM103, while a manual torch_baseline fallback can serve PDB predictions on B300 with a 13.600 s median client-wall request time. The 2026-05-26T01Z standard probe confirmed that fallback as the separate hidden openfold3-nim-b300-baseline variant, so this stock manifest intentionally keeps gpu_compatibility.B300=false and remains the canonical row for H200, B200, L40S, and RTX6000. The workflow is for nonclinical biomolecular research and computer-aided discovery support, not clinical decision-making or patient-specific guidance.

Fastest verified: B200 in us-central1
Performance: 9.2s

Healthcare / Life Science

coldNative inference

NVIDIA VISTA-3D

NVIDIA VISTA-3D is a medical-imaging NIM for segmenting and annotating human anatomies from 3D CT images provided as NIfTI or NRRD URLs. The service exposes /v1/vista3d/inference and returns a ZIP archive containing segmentation output files. This Forge manifest stages a corrected, digest-pinned onboarding row for nonclinical research and annotation workflow evaluation only. It is not default-eligible because the model is governed by NVIDIA's software and model evaluation license, fresh 10-run matrix evidence is incomplete, and current live Blackwell/RTX6000 evidence shows TensorRT engine failures.

Fastest verified: H200 in eu-north2
Performance: 2.7s

Healthcare / Life Science

coldNative inference

NVIDIA MAISI

NVIDIA MAISI NIM for generating synthetic 3D CT volumes and paired segmentation labels from requested body regions and anatomy lists.

Fastest verified: H200 in eu-north2
Performance: 12.3s

Healthcare / Life Science

coldNative inference

OpenFold2

OpenFold2 NIM for protein structure prediction from amino-acid sequences with optional MSA/template support in the upstream API.

Fastest verified: H200 in eu-north2
Performance: 4.6s

Healthcare / Life Science

coldNative inference

MIT DiffDock

DiffDock NIM for molecular docking pose generation from protein structures and ligand files, returning predicted docking poses with confidence scores.

Fastest verified: H200 in eu-north2
Performance: 2.3s

Healthcare / Life Science

coldNative inference

ColabFold MSA Search

ColabFold MSA Search NIM for fast GPU-accelerated multiple sequence alignment search from biological protein sequences.

Fastest verified: B300 in uk-south1
Performance: 249 ms

General

coldOpenAI SDK

Nvidia Nemotron Nano 12b V2 VL

Nvidia Nemotron Nano 12b V2 VL NIM for vision-language chat; mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave7.

Fastest verified: H200 in eu-north2
Performance: 452 ms

General

coldOpenAI SDK

Nvidia Llama 3 1 Nemotron Nano VL 8b V1

Nvidia Llama 3 1 Nemotron Nano VL 8b V1 NIM for vision-language chat; mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave7.

Fastest verified: L40S in eu-north1
Performance: 333 ms

General

coldNative inference

Nvidia Llama Nemotron Rerank VL 1b V2

NVIDIA Llama Nemotron Rerank VL 1B v2 NIM for multimodal retrieval reranking; mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave7.

Fastest verified: L40S in eu-north1
Performance: 86 ms

General

coldOpenAI SDK

Nvidia Nvclip

NVIDIA NVCLIP NIM for multimodal text/image embeddings; mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave7.

Fastest verified: L40S in eu-north1
Performance: 269 ms

General

coldNative inference

Hive Deepfake Image Detection

Hive deepfake image detection NIM for classifying manipulated face images; mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave6.

Performance: Benchmark pending

General

coldNative inference

Hive AI-Generated Image Detection

Hive AI-generated image detection NIM for classifying synthetic images; mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave6.

Performance: Benchmark pending

General

coldOpenAI SDK

NVIDIA Nemotron Parse

NVIDIA Nemotron Parse NIM for parsing document images into structured text; mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave6.

Fastest verified: L40S in eu-north1
Performance: 636 ms

General

coldOpenAI SDK

NVIDIA NeMo Retriever Parse

NVIDIA NeMo Retriever Parse NIM for document parsing and OCR-style extraction; mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave6.

Fastest verified: H200 in eu-north2
Performance: 1.9s

General

coldOpenAI SDK

NVIDIA GLiNER PII

NVIDIA GLiNER PII NIM for extracting personally identifiable information spans from text; mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave5.

Fastest verified: L40S in eu-north1
Performance: 14 ms

General

coldNative inference

NVIDIA NemoGuard Jailbreak Detect

NVIDIA NemoGuard JailbreakDetect NIM for classifying jailbreak/prompt-injection attempts; mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave5.

Fastest verified: B200 in us-central1
Performance: 184 ms

General

coldNative inference

NVIDIA Llama 3.2 NV-RerankQA 1B v2

1B NVIDIA retrieval reranking NIM for question-answer ranking; mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave5.

Fastest verified: L40S in eu-north1
Performance: 35 ms

General

coldOpenAI SDK

NVIDIA Llama 3.2 NV-EmbedQA 1B v2

1B NVIDIA retrieval embedding NIM for question-answer retrieval; mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave5.

Fastest verified: L40S in eu-north1
Performance: 44 ms

General

coldNative inference

NVIDIA Llama Nemotron Rerank 1B v2

NVIDIA Llama Nemotron Rerank 1B v2 is a multilingual retrieval reranker that scores query/passage relevance for RAG pipelines. The NVIDIA model card describes a 1B-parameter transformer reranker with 8192-token optimized profiles and NVIDIA Open Model License plus Llama 3.2 Community License terms. The upstream NIM 1.10.0 image is usable but writes startup files under /opt/nim/etc and /opt/nim/tmp, so Forge wraps it with a minimal derivative image that redirects those paths to /tmp while keeping the NIM runtime and /v1/ranking API unchanged. The wrapper digest sha256:a30bb55e8e8133dfd91798a1f89853d7666839f785df15ddfc39675ea770eb12 is mirrored into all four Forge regional registries. Live Forge probes support B200/us-central1, H200/us-central1, H200/eu-north2, L40S/eu-north1, and RTX6000/us-central1 with 10 warm measured ranking requests per successful cell. B300/uk-south1 starts the NIM server but repeated /v1/ranking probes through 2026-05-27T21Z return HTTP 500 from ONNX Runtime with cudaErrorSymbolNotFound. A 2026-05-27T22Z authenticated mirror retry confirmed NVIDIA's documented 1.11.0 tag is still unavailable, so B300 stays disabled and the manifest remains publication-blocked until a real SM103-capable runtime is mirrored and benchmarked.

Fastest verified: L40S in eu-north1
Performance: 97 ms

General

coldOpenAI SDK

NVIDIA Llama Nemotron Embed 1B v2

1B NVIDIA NeMo Retriever text embedding NIM for semantic retrieval/RAG; mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave4.

Fastest verified: L40S in eu-north1
Performance: 50 ms

Physical AI

coldOpenAI SDK

NVIDIA Cosmos Embed1

NVIDIA 2025

Joint video-text embedding NIM for retrieving, clustering, deduplicating, and curating autonomous-driving clips before and after synthetic weather generation.

Fastest verified: B300 in uk-south1
Performance: 79 ms

Physical AI

coldNative inference

NVIDIA Cosmos Predict1 7B Text2World

NVIDIA 2025

Text-conditioned world-generation NIM for creating fresh autonomous-driving video clips when no seed video exists yet.

Fastest verified: H200 in eu-north2
Performance: 4m 48s

General

coldOpenAI SDK

NVIDIA Llama 3.2 NeMo Retriever 300M Embed v1

Earlier 300M multilingual text embedding NVIDIA NIM for retrieval version comparison; mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave4.

Fastest verified: L40S in eu-north1
Performance: 35 ms

Physical AI

coldNative inference

NVIDIA Cosmos Predict1 7B Video2World

NVIDIA 2025

Video-conditioned world-generation NIM for extending seed clips into new physically plausible autonomous-driving scenario variations.

Performance: Benchmark pending

General

coldOpenAI SDK

NVIDIA Llama 3.2 NeMo Retriever 300M Embed v2

Multilingual, cross-lingual text embedding NVIDIA NIM for long-document QA retrieval; mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave4.

Fastest verified: L40S in eu-north1
Performance: 74 ms

Physical AI

coldNative inference

NVIDIA Cosmos Transfer 2.5 2B

NVIDIA 2025

Primary autonomous-driving synthetic-data model for changing weather, lighting, and visual conditions in an existing video while preserving scene geometry through edge, segmentation, depth, and visual controls.

Fastest verified: B200 in us-central1
Performance: 1m 5s

Physical AI

coldNative inference

NVIDIA Cosmos Policy ALOHA Predict2 2B (CUDA 13, B300)

NVIDIA 2025

B300-oriented Cosmos Policy ALOHA serving build. The image keeps the upstream 50-step ALOHA policy horizon and 10 action denoising steps, patches SM103 NATTEN admission, and keeps SM103 on the CUDA 13 cuDNN/Flash/Efficient SDPA path.

Fastest verified: B200 in us-central1
Performance: 578 ms

Healthcare / Life Science

coldNative inference

GenMol

NVIDIA GenMol NIM for fragment-based molecular generation with SMILES or SAFE input. Forge keeps the image small and uses the shared /mnt/data-backed NIM cache for hydrated artifacts.

Fastest verified: H200 in eu-north2
Performance: 1.2s

Healthcare / Life Science

coldNative inference

AlphaFold2

DeepMind AlphaFold2 NVIDIA NIM for protein structure prediction from amino acid sequence. This manifest remains onboarding-only until the pod-start failure observed on H200/eu-north2 is diagnosed and a successful Forge probe is recorded.

Performance: Benchmark pending

General

coldOpenAI SDK

NVIDIA Nemotron Nano 9B v2

NVIDIA Nemotron Nano 9B v2 packaged as an NVIDIA NIM and mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave2.

Fastest verified: B200 in us-central1
Performance: 2.2s

General

coldOpenAI SDK

NVIDIA Llama 3.1 NemoGuard 8B Topic Control

NVIDIA Llama 3.1 NemoGuard 8B Topic Control packaged as an NVIDIA NIM and mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave2.

Fastest verified: B200 in us-central1
Performance: 679 ms

General

coldOpenAI SDK

NVIDIA Llama 3.1 NemoGuard 8B Content Safety

NVIDIA Llama 3.1 NemoGuard 8B Content Safety packaged as an NVIDIA NIM and mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave2.

Fastest verified: H200 in eu-north2
Performance: 489 ms

General

coldOpenAI SDK

BigCode StarCoder2 7B

BigCode StarCoder2 7B packaged as an NVIDIA NIM and mirrored into Forge regional registries as part of all-accessible NVIDIA NIM wave2.

Fastest verified: B300 in uk-south1
Performance: 1.1s

General

coldOpenAI SDK

NVIDIA Llama 3.1 Nemotron Nano 8B v1

NVIDIA Llama 3.1 Nemotron Nano 8B v1 packaged as an NVIDIA NIM and mirrored into Forge regional registries.

Fastest verified: H200 in eu-north2
Performance: 701 ms

General

coldOpenAI SDK

Mistral 7B Instruct v0.3

Mistral 7B Instruct v0.3 packaged as an NVIDIA NIM and mirrored into Forge regional registries.

Fastest verified: H200 in eu-north2
Performance: 1.7s

General

coldOpenAI SDK

Microsoft Phi-4 Mini Instruct

Microsoft Phi-4 Mini Instruct is a public MIT-licensed 3.8B-parameter small language model for instruction following, multilingual chat, code-oriented prompts, and tool/function-calling formats. Microsoft and Hugging Face metadata identify the artifact as text-generation, safetensors, and MIT licensed, while NVIDIA NIM documents the self-hosted nvcr.io/nim/microsoft/phi-4-mini-instruct:latest container and OpenAI-compatible /v1/chat/completions usage. This Forge onboarding entry reuses the already mirrored NVIDIA NIM image digest sha256:e5ea112a599102ddd0dba60aea603a5e631ed16d2b7e4c467fb4c33c3e30ff7d in all four Forge regional registries, routes the NIM cache to the /opt/nim/.cache hostPath backed by /mnt/data/forge-weights, and has live 10-run warm probes supporting B200, H200, L40S, and RTX6000. B300/uk-south1 remains disabled after 2026-05-23T11Z triage showed the pinned NIM 1.12.0 image is not SM103-ready: default startup fails on PyTorch/TensorRT-LLM arch detection for 10.3+PTX, and a TORCH_CUDA_ARCH_LIST fallback progresses to model load but aborts in Triton codegen for sm_103.

Fastest verified: B200 in us-central1
Performance: 915 ms

General

coldOpenAI SDK

Microsoft Phi-3 Mini 4K Instruct

Microsoft Phi-3 Mini 4K Instruct packaged as an NVIDIA NIM and mirrored into Forge regional registries.

Fastest verified: H200 in eu-north2
Performance: 2.4s

Healthcare / Life Science

coldNative inference

BioMedLM 2.7B

Stanford CRFM 2022

Stanford CRFM BioMedLM 2.7B biomedical text generation wrapper. Not for medical advice or clinical decision-making.

Fastest verified: B200 in us-central1
Performance: 996 ms

Healthcare / Life Science

coldNative inference

ESMFold

Meta ESMFold predicts protein structures directly from a single amino-acid sequence without requiring MSA database lookup. The Hugging Face model card identifies MIT licensing and Transformers support, and describes the model as ESM-2-backed end-to-end protein folding. This HCLS manifest promotes the existing customer-facing ESMFold workload from a legacy docs manifest into the HCLS onboarding manifest set with immutable regional image refs, /opt/nim/.cache-backed shared cache env, artifact revision evidence, and the cache-fixed five-cell corrected PDB export probe matrix. The 2026-06-03T02Z pass replaced the failed /mnt/data-cache CVE-safe image with a PyTorch 2.10 / Transformers 5.3 cache-fixed digest and proved B300 support; the 2026-06-03T03Z pass added B200, H200, L40S, and RTX6000 3-warmup/10-run no-persist benchmark evidence with non-null PDB previews and mean pLDDT responses; the 2026-06-03T04Z pass reran the same declared cells without --no-persist and upserted model_gpu_support as supported for B200, B300, H200, L40S, and RTX6000. This remains a nonclinical research structure-prediction workflow and must not be positioned as diagnostic, therapeutic, safety, efficacy, or patient-specific guidance.

Fastest verified: B300 in uk-south1
Performance: 578 ms

Healthcare / Life Science

coldNative inference

ESM-2 650M

Meta ESM-2 650M protein language model served as a Forge protein-sequence embedding endpoint.

Fastest verified: B200 in us-central1
Performance: 15 ms

General

coldOpenAI SDK

Qwen 2.5 7B Instruct

Qwen 2.5 7B instruction model packaged as an NVIDIA NIM, useful for multilingual and coding-adjacent chat workloads.

Fastest verified: H200 in eu-north2
Performance: 1.6s

General

coldOpenAI SDK

Meta Llama 3.1 8B Instruct

General-purpose Llama 3.1 8B chat model for instruction following, summarization, and lightweight agent workflows.

Fastest verified: B300 in uk-south1
Performance: 814 ms

General

coldOpenAI SDK

Meta Llama 3.2 3B Instruct

Compact Llama 3.2 instruction model with a stronger quality/latency tradeoff than the 1B variant.

Fastest verified: B200 in us-central1
Performance: 805 ms

General

coldOpenAI SDK

Meta Llama 3.2 1B Instruct

Small, low-latency Llama 3.2 chat model packaged as an NVIDIA NIM for fast instruction-following examples.

Fastest verified: H200 in eu-north2
Performance: 1.1s

General

coldOpenAI SDK

Meta Llama 3.1 70B Instruct

Flagship general-purpose chat model for instruction following and structured generation.

Fastest verified: B300 in uk-south1
Performance: 3.9s

Healthcare / Life Science

coldOpenAI SDK

NV EmbedQA E5

Domain-tuned embedding model suitable for retrieval, semantic search, and document ranking.

Fastest verified: H200 in eu-north2
Performance: 211 ms

Physical AI

coldNative inference

NVIDIA Cosmos Policy LIBERO Predict2 2B

NVIDIA 2025

Cosmos Policy LIBERO checkpoint wrapped as a Forge custom container. Uses a bundled sample observation by default and returns the predicted action chunk as JSON.

Fastest verified: B200 in us-central1
Performance: 340 ms

Physical AI

coldOpenAI SDK

NVIDIA Cosmos Reason 1

NVIDIA 2025

Deprecated but still self-hostable NVIDIA Physical AI reasoning VLM for robotics and world understanding.

Fastest verified: B200 in us-central1
Performance: 1.6s

Physical AI

coldOpenAI SDK

NVIDIA Cosmos Reason 2

NVIDIA 2025

Physical AI reasoning model for robotics and world understanding, packaged as a self-hosted NVIDIA NIM.

Fastest verified: B200 in us-central1
Performance: 637 ms