Fine-tune FLUX.1 Schnell as a LoRA from your own image bucket. Compare measured GPU performance, open a preloaded Serverless Job, choose your dataset and output buckets, and start training in your Nebius account.
Fastest measured GPU: B200.
GPU types with results or in progress.
Per-GPU fine-tune throughput and time-per-step, measured on Forge GPUs. Pick a target before you start; cells still being measured show as in progress.
| GPU | Region | Workload | Status | Throughput | Time / step | FLOP util. |
|---|---|---|---|---|---|---|
| B200fastest | us-central1 | flux-lora | Measured | 1.51 img/s | 661 ms | 5.0% |
| B300 | uk-south1 | flux-lora | Measured | 1.46 img/s | 683 ms | 4.3% |
| H100 | eu-north1 | flux-lora | Measured |
Serverless training
Open Nebius with the training image, GPU preset, and command preloaded. Then choose your dataset bucket and output bucket in your account.
Upload data
Put images or records in your Object Storage bucket.
Open job
Use the preloaded Serverless Job form.
Start training
Select your dataset and output bucket, then run it.
License: Apache-2.0, but this is a distilled fast model with gated Hugging Face checkpoint access. Pass HF_TOKEN after accepting the model terms; prefer FLUX.2 klein 4B Base for most new LoRAs.
Image folder with .jpg/.jpeg/.png files and optional same-name .txt captions
s3://my-bucket/flux-subject-lora/ customtoken_001.jpg customtoken_001.txt # "photo of customtoken person, studio portrait, natural skin texture" customtoken_002.png customtoken_002.txt # "customtoken person in everyday clothing, outdoor natural light" customtoken_003.jpeg
Captions are optional for image LoRA. If filenames start with a custom token, the training command can infer it automatically.
WANDB_API_KEY (secret): Optional W&B API key. Store it in MysteryBox and pass it with --env-secret.
WANDB_PROJECT: Optional W&B project name for training progress and sample tracking.
WANDB_RUN_NAME: Optional W&B run name, e.g. flux-klein-subject-lora.
FORGE_TRAIN_TRIGGER_WORD: Optional FLUX trigger token override. When unset, the trainer infers it from image filenames such as customtoken_001.jpg.
HF_TOKEN (secret): Required for gated BFL checkpoints after you accept the model terms on Hugging Face.
Serverless job URL
readyNebius Jobs create link is generated with training image, GPU platform, preset, command, and dataset mount defaults.
Open link ↗Serverless endpoint URL
verify after runEndpoint create link preloads the serving image and output mount; after training, attach the produced adapter/checkpoint and run a health check plus representative sample request.
Open link ↗Input data guidance
readyDataset format, accepted input methods, and an example are present: Image folder with .jpg/.jpeg/.png files and optional same-name .txt captions.
Agent handoff
readyAgent steps cover job creation, monitoring, output verification, endpoint smoke test, and user-facing closeout.
# Runs in YOUR Nebius account, on YOUR data — you own the weights and
# you pay for the GPUs. Forge does not run this job; this just starts it.
# Uses Nebius AI Jobs CLI (`nebius ai job create`).
# Fill in these customer-owned values before running:
# FORGE_NEBIUS_PROJECT_ID: your Nebius project / parent ID.
# FORGE_TRAIN_PLATFORM/FORGE_TRAIN_PRESET: pick GPU resources available in your project.
# FORGE_TRAIN_DATASET_URI: point this at your bucket, e.g. s3://my-bucket/train.jsonl.
# FORGE_TRAIN_OUTPUT_URI: bucket path where trained weights are written.
# Verify the command starts a user-data fine-tune, not a benchmark/probe.
# After completion: verify output artifacts, create the Serverless Endpoint,
# then run endpoint health and one representative sample request.
# Optional training environment:
# --env-secret WANDB_API_KEY=... Optional W&B API key. Store it in MysteryBox and pass it with --env-secret.
# --env WANDB_PROJECT=... Optional W&B project name for training progress and sample tracking.
# --env WANDB_RUN_NAME=... Optional W&B run name, e.g. flux-klein-subject-lora.
# --env FORGE_TRAIN_TRIGGER_WORD=... Optional FLUX trigger token override. When unset, the trainer infers it from image filenames such as customtoken_001.jpg.
# --env-secret HF_TOKEN=... Required for gated BFL checkpoints after you accept the model terms on Hugging Face.
export FORGE_NEBIUS_PROJECT_ID="YOUR_PROJECT_ID"
export FORGE_TRAIN_PLATFORM="YOUR_GPU_PLATFORM"
export FORGE_TRAIN_PRESET="YOUR_GPU_PRESET"
export FORGE_TRAIN_JOB_NAME="forge-fine-tune"
export FORGE_TRAIN_DATASET_URI="s3://my-bucket/train.jsonl"
export FORGE_TRAIN_OUTPUT_URI="s3://my-bucket/outputs/"
FORGE_TRAIN_COMMAND='set -eu
mkdir -p /workspace/config /workspace/dataset /workspace/output
# Dataset source: '"$FORGE_TRAIN_DATASET_URI"' (mounted at /workspace/dataset by the Jobs CLI).
# Output destination: '"$FORGE_TRAIN_OUTPUT_URI"' (mounted at /workspace/output by the Jobs CLI).
DATASET_IMAGE_ROOT="/workspace/dataset"
if [ -d /workspace/dataset/target ]; then DATASET_IMAGE_ROOT="/workspace/dataset/target"; fi
sanitize_trigger_word() {
printf '\''%s'\'' "$1" | tr '\''[:upper:]'\'' '\''[:lower:]'\'' | sed -E '\''s/[^a-z0-9_-]+/-/g; s/^-+|-+$//g'\''
}
TRIGGER_WORD="$(sanitize_trigger_word "${FORGE_TRAIN_TRIGGER_WORD:-}")"
if [ -z "$TRIGGER_WORD" ]; then
FIRST_IMAGE="$(find "$DATASET_IMAGE_ROOT" -maxdepth 2 -type f \( -iname '\''*.jpg'\'' -o -iname '\''*.jpeg'\'' -o -iname '\''*.png'\'' \) | sort | head -n 1 || true)"
if [ -n "$FIRST_IMAGE" ]; then
STEM="$(basename "$FIRST_IMAGE")"
STEM="${STEM%.*}"
STEM="$(printf '\''%s'\'' "$STEM" | sed -E '\''s/([_-]?[0-9]+)$//'\'')"
TRIGGER_WORD="$(sanitize_trigger_word "$STEM")"
fi
fi
if [ -z "$TRIGGER_WORD" ]; then TRIGGER_WORD="subject"; fi
export FORGE_TRAIN_TRIGGER_WORD="$TRIGGER_WORD"
echo "Using FLUX trigger token: ${FORGE_TRAIN_TRIGGER_WORD}"
for candidate in /app/ai-toolkit /workspace/ai-toolkit /root/ai-toolkit /ai-toolkit /app /workspace; do
if [ -f "$candidate/run.py" ]; then cd "$candidate"; break; fi
done
test -f run.py
cat > /workspace/config/forge-flux-lora.yaml <<YAML
---
job: extension
config:
name: "forge_black_forest_labs_flux_1_schnell_lora"
process:
- type: '\''sd_trainer'\''
training_folder: "/workspace/output"
performance_log_every: 50
device: cuda:0
trigger_word: "${FORGE_TRAIN_TRIGGER_WORD}"
network:
type: "lora"
linear: 16
linear_alpha: 16
save:
dtype: float16
save_every: 250
max_step_saves_to_keep: 4
push_to_hub: false
datasets:
- folder_path: "/workspace/dataset"
caption_ext: "txt"
caption_dropout_rate: 0.05
shuffle_tokens: false
cache_latents_to_disk: true
resolution: [512, 768, 1024]
train:
batch_size: 1
steps: 1600
gradient_accumulation_steps: 1
train_unet: true
train_text_encoder: false
gradient_checkpointing: true
noise_scheduler: "flowmatch"
optimizer: "adamw8bit"
lr: 1e-4
ema_config:
use_ema: true
ema_decay: 0.99
dtype: bf16
model:
name_or_path: "black-forest-labs/FLUX.1-schnell"
is_flux: true
quantize: true
assistant_lora_path: "ostris/FLUX.1-schnell-training-adapter"
sample:
sampler: "flowmatch"
sample_every: 250
width: 1024
height: 1024
prompts:
- "photo of ${FORGE_TRAIN_TRIGGER_WORD} person, studio portrait, natural skin texture"
- "${FORGE_TRAIN_TRIGGER_WORD} person in everyday clothing, outdoor natural light"
neg: ""
seed: 42
walk_seed: true
guidance_scale: 1
sample_steps: 4
meta:
name: "FLUX.1 schnell LoRA"
version: '\''1.0'\''
YAML
python run.py /workspace/config/forge-flux-lora.yaml'
nebius ai job create \
--parent-id "$FORGE_NEBIUS_PROJECT_ID" \
--name "$FORGE_TRAIN_JOB_NAME" \
--platform "$FORGE_TRAIN_PLATFORM" \
--preset "$FORGE_TRAIN_PRESET" \
--image 'docker.io/ostris/aitoolkit@sha256:220d85e443589c6b52521c594a2d9f052d733afe360966d24bb8a5fe853745f7' \
--volume "$FORGE_TRAIN_DATASET_URI":/workspace/dataset:ro \
--volume "$FORGE_TRAIN_OUTPUT_URI":/workspace/output:rw \
--container-command "/bin/sh" \
--args "-lc \"$FORGE_TRAIN_COMMAND\""docker.io/ostris/aitoolkit@sha256:220d85e443589c6b52521c594a2d9f052d733afe360966d24bb8a5fe853745f7
| 1.23 img/s |
| 815 ms |
| 9.1% |
| H200 | eu-north1 | flux-lora | Measured | 1.3 img/s | 770 ms | 9.7% |
| H200 | eu-north2 | flux-lora | Measured | 1.29 img/s | 773 ms | 9.6% |
| H200 | us-central1 | flux-lora | Measured | 1.28 img/s | 782 ms | 9.5% |
| L40S | eu-north1 | flux-lora | Measured | 0.80 img/s | 1.25 s | 16.4% |
| RTX6000 | us-central1 | flux-lora | Failed | — | — | — |
Throughput is reported in the model’s relevant unit: tokens/sec for text fine-tunes and images/sec for image LoRA runs. Time/step is the wall-clock time for one optimizer step. FLOP utilization compares the benchmark’s estimated achieved TFLOP/s with the GPU peak TFLOP/s used for that precision mode. All are measured on Forge GPUs for the listed workload (LoRA, SFT, FLUX LoRA, …). Each model is onboarded only after a real benchmark run — cells marked In progress are still being measured and show no number until verified. The fastest chip marks the highest measured throughput.