Runbook · exact commands + overnight automation

Training runbook.

The commands we run to get from "today's dataset" to "submission-ready weights." One manual path (you type each command) and one automated path (overnight_autotrainer.py runs everything unattended, backs up + promotes weights, and writes a status file by morning).

Manual
~7.5 hr · you watch each phase
early days, first runs
Automated
python overnight_autotrainer.py nightly
unattended overnight
Backup
Every run snapshots models/latest/
one-command rollback
Promote
Only if benchmark improves by ≥ 0.5%
regression gate
Read Winning Playbook first. The why lives there: effort budget, data-pipeline moat, reliability math, sim-to-real. This doc is the how.

§ 01Manual path — exact commands

Use this path the first few times, or when tuning a single knob and watching output live. Every command is copy-pasteable.

0 · Precheck (30 s)

python submit_check.py
python -c "import torch; print(torch.cuda.get_device_name(0))"
nvidia-smi

Expect: 0 failures, "NVIDIA GeForce RTX 5080" (or your GPU), and nvidia-smi showing the card.

1 · Phase 1 — Detector (YOLO11n) · ~2 hr

python train_apex.py detector \
  --dataset dataset_gates_mega \
  --epochs 200

Output: models/apex_detector_best.pt, models/apex_detector_best.onnx.

2 · Phase 2 — Keypoints (YOLO11n-pose) · ~1.5 hr

python train_apex.py keypoints \
  --dataset dataset_gates_mega \
  --epochs 150

Output: models/apex_keypoints_best.pt, models/apex_keypoints_best.onnx.

3 · Phase 3 — Policy (submission-ready obs) · ~4 hr

python train_apex.py policy \
  --steps 10000000 \
  --observation-mode detector_telemetry

Output: output/apex_policy/apex_policy_best.zip, output/apex_policy/apex_policy.onnx.

Never ship --observation-mode privileged weights. The privileged 24D obs uses NED gate bearings that don't exist in the real AIGP sim. It's for dev iteration only. detector_telemetry is what transfers.

4 · Export ONNX + package · ~1 min

python train_apex.py export

5 · Benchmark against previous best · ~2 min

python benchmark_models.py

Appends to benchmark_results.json. If the new detector regresses, don't promote.

6 · Package for submission (when ready)

python submit_check.py
python submit_check.py package

Output: submission.zip with code + weights.

§ 02Automated path — overnight_autotrainer.py

One command runs everything above, backs up current weights, benchmarks, and promotes if improved.

# Simplest form (uses overnight_config.yaml)
python overnight_autotrainer.py nightly

Subcommands

CommandWhat it does
nightlyFull pipeline: precheck → backup → train 1 → train 2 → train 3 → export → benchmark → promote-or-hold. Default.
precheckEnvironment sanity only. Safe to run anytime.
once --phase detectorRun one phase. Useful for debugging one knob change.
once --phase policy --observation-mode privilegedDev-only legacy obs for A/B comparisons.
benchBenchmark only, no training.
statusPrint last run's status.json.
promote --run <RUN_ID>Manually promote a specific run's weights.
rollbackRestore the most recent backup into models/latest/.

What each run produces

output/overnight_runs/
  2026-04-22_00-00-12/
    status.json              # machine-readable summary
    summary.md               # human-readable pass/fail table
    1_detector.log           # full training log (streamed)
    2_keypoints.log
    3_policy_detector-telemetry.log
    4_export.log
    5_benchmark.log

models/
  latest/                    # promoted weights (used by submit_check)
    apex_detector_best.pt
    apex_detector_best.onnx
    apex_keypoints_best.pt
    apex_keypoints_best.onnx
    apex_policy_best.zip
    apex_policy.onnx
    manifest.json            # which run promoted these
  backup/
    2026-04-22_00-00-12/     # pre-run snapshot (rollback target)

Example status.json

{
  "run_id": "2026-04-22_00-00-12",
  "overall": "DONE",
  "finished_at": "2026-04-22T07:42:18",
  "total_duration_s": 27 453,
  "phases": [
    {"name": "1_detector",  "success": true, "duration_s": 7 180, ...},
    {"name": "2_keypoints", "success": true, "duration_s": 5 412, ...},
    {"name": "3_policy_detector-telemetry", "success": true, "duration_s": 14 322, ...},
    {"name": "4_export",    "success": true, "duration_s": 31, ...},
    {"name": "5_benchmark", "success": true, "duration_s": 108, ...}
  ],
  "promotion": {
    "promoted": true,
    "decision": "PROMOTED",
    "reason": "improvement +0.0162 ≥ +0.0050",
    "prev_metric": 0.951,
    "new_metric": 0.968,
    "files": ["apex_detector_best.pt", "apex_detector_best.onnx", ...]
  }
}

§ 03Config knobs (overnight_config.yaml)

min_free_gb: 30
require_clean_git: false

detector:
  dataset: dataset_gates_mega
  epochs: 200
  timeout_s: 14400
  min_output_mb: 5

keypoints:
  dataset: dataset_gates_mega
  epochs: 150
  timeout_s: 10800
  min_output_mb: 5

policy:
  steps: 10000000
  modes: [detector_telemetry]      # add "privileged" for A/B
  timeout_s: 21600
  min_output_mb: 0.5

benchmark:
  timeout_s: 600
  laps: 2

promotion:
  metric: detection_rate            # or mAP@50, total_time_s, gates_passed
  min_improvement: 0.005            # +0.5% absolute
Sweep discipline. Change one knob per overnight run. Compare the next morning's benchmark against the previous. The promotion.min_improvement threshold prevents promoting noise.

§ 04Schedule it (run every night)

Windows Task Scheduler

PRIMARY
  1. Open Task Scheduler → Create Task
  2. Name: AIGP Overnight Autotrain
  3. Trigger: Daily · 23:00 · enabled
  4. Action: Start a program
    • Program: C:\Users\pc\aigp\Scripts\python.exe
    • Arguments: overnight_autotrainer.py nightly
    • Start in: C:\Users\pc\Downloads\grandprix-latest
  5. Conditions → uncheck "Start only on AC power"
  6. Settings → "Do not start new instance if already running"
  7. Save, test once: right-click task → Run

Linux cron

ALT
# crontab -e
0 23 * * * cd /home/pc/grandprix && \
  /home/pc/aigp/bin/python overnight_autotrainer.py \
  nightly >> logs/cron.log 2>&1

Make sure the venv Python is pinned (not system Python). Verify cron inherits CUDA_HOME or PATH correctly.

Morning check (phone-friendly)

python overnight_autotrainer.py status | jq -r '.overall, .promotion.decision'

Serve the run-dir over HTTP (any static server) if you want to view summary.md from a browser on a different machine.

§ 05What to run when (calendar)

Phase of the yearNightly configNotes
Pre-sim (now → May)Full nightly with existing dataset_gates_megaBaseline detector + policy. Build the data pipeline.
VQ1 sim open (May)Nightly + sim-frame dataset extensionFine-tune detector on captured VQ1 frames · run PPO with detector_telemetry
VQ1 → VQ2 gap (May–Jun)Nightly with growing datasetEvery attempt logs frames · weekly retrain catches improvements
VQ2 open (Jun–Jul)Nightly + adversarial augmentationLighting / obstacles · hard negatives · chaos attempts
VQ2 cutoff → Physical (Aug)Stop nightly. Start real-flight capture.Sim-to-real residual dynamics · DIY rig tuning
Between VQ2 cutoff and Physical, do not keep burning overnight runs on sim tuning. That's when real-flight data capture + residual-dynamics work wins the Physical slot. See playbook §04.

§ 06Promotion + rollback

Each nightly run either promotes the new weights to models/latest/ or holds, based on the benchmark delta. Promotion criteria:

  1. All 5 phases (detector, keypoints, policy, export, benchmark) succeed.
  2. The promotion.metric (default detection_rate) in this run's benchmark ≥ the previous best + min_improvement.

If either fails, the run is HELD — previous models/latest/ stays in place.

Manual promote / rollback

# Promote a specific run
python overnight_autotrainer.py promote --run 2026-04-22_00-00-12

# Restore most-recent backup
python overnight_autotrainer.py rollback

# Restore a specific backup
python overnight_autotrainer.py rollback --to 2026-04-21_00-00-00

§ 07Troubleshooting

SymptomLikely causeFix
Precheck fails: "CUDA not available"GPU driver / wrong torch buildRe-install torch with cu128 index URL
Precheck fails: "Low disk"Overnight runs accumulate logs + backupsPrune output/overnight_runs/ > 30 days old · prune models/backup/
Phase 3 policy hangs at frame 43VRAM thrash · 16 GB ceilingDrop n_envs in train_apex.py from 4 → 2
Phase times out without finishingTimeout too tight for dataset sizeRaise timeout_s in config
Promotion never firesBenchmark metric not actually improving · threshold too tightCheck benchmark_results.json history · lower min_improvement temporarily
Status JSON shows phase "success" but weights missingValidator min_output_mb too low, silent failureRaise min_output_mb for that phase
Task Scheduler fires but nothing happens"Start in" dir wrong · Python path wrongTest manually first: right-click → Run

§ 08Related

Winning Playbook

STRATEGY

The why: effort budget, reliability math, data pipeline moat, sim-to-real.

Local GPU Training

HARDWARE

RTX 5080 install + venv + raw train_apex.py usage.

APEX Pipeline

REFERENCE

Per-phase details: models, obs schemas, reward terms.

Tuning Reference

REF

Every gain, threshold, timeout in the runtime and the trainer.

TRAINING-RUNBOOK · v1.0 2026-04-21 · ← Index · Playbook