Runbook · exact commands + overnight automation

Training runbook.

The commands we run to get from "today's dataset" to "submission-ready weights." One manual path (you type each command) and one automated path (overnight_autotrainer.py runs everything unattended, backs up + promotes weights, and writes a status file by morning).

Manual

~7.5 hr · you watch each phase

early days, first runs

Automated

python overnight_autotrainer.py nightly

unattended overnight

Backup

Every run snapshots models/latest/

one-command rollback

Promote

Only if benchmark improves by ≥ 0.5%

regression gate

Read Winning Playbook first. The why lives there: effort budget, data-pipeline moat, reliability math, sim-to-real. This doc is the how.

§ 01Manual path — exact commands

Use this path the first few times, or when tuning a single knob and watching output live. Every command is copy-pasteable.

0 · Precheck (30 s)

python submit_check.py
python -c "import torch; print(torch.cuda.get_device_name(0))"
nvidia-smi

Expect: 0 failures, "NVIDIA GeForce RTX 5080" (or your GPU), and nvidia-smi showing the card.

1 · Phase 1 — Detector (YOLO11n) · ~2 hr

python train_apex.py detector \
  --dataset dataset_gates_mega \
  --epochs 200

Output: models/apex_detector_best.pt, models/apex_detector_best.onnx.

2 · Phase 2 — Keypoints (YOLO11n-pose) · ~1.5 hr

python train_apex.py keypoints \
  --dataset dataset_gates_mega \
  --epochs 150

Output: models/apex_keypoints_best.pt, models/apex_keypoints_best.onnx.

3 · Phase 3 — Policy (submission-ready obs) · ~4 hr

python train_apex.py policy \
  --steps 10000000 \
  --observation-mode detector_telemetry

Output: output/apex_policy/apex_policy_best.zip, output/apex_policy/apex_policy.onnx.

Never ship --observation-mode privileged weights. The privileged 24D obs uses NED gate bearings that don't exist in the real AIGP sim. It's for dev iteration only. detector_telemetry is what transfers.

4 · Export ONNX + package · ~1 min

python train_apex.py export

5 · Benchmark against previous best · ~2 min

python benchmark_models.py

Appends to benchmark_results.json. If the new detector regresses, don't promote.

6 · Package for submission (when ready)

python submit_check.py
python submit_check.py package

Output: submission.zip with code + weights.

§ 02Automated path — `overnight_autotrainer.py`

One command runs everything above, backs up current weights, benchmarks, and promotes if improved.

# Simplest form (uses overnight_config.yaml)
python overnight_autotrainer.py nightly

Subcommands

Command	What it does
`nightly`	Full pipeline: precheck → backup → train 1 → train 2 → train 3 → export → benchmark → promote-or-hold. Default.
`precheck`	Environment sanity only. Safe to run anytime.
`once --phase detector`	Run one phase. Useful for debugging one knob change.
`once --phase policy --observation-mode privileged`	Dev-only legacy obs for A/B comparisons.
`bench`	Benchmark only, no training.
`status`	Print last run's `status.json`.
`promote --run <RUN_ID>`	Manually promote a specific run's weights.
`rollback`	Restore the most recent backup into `models/latest/`.

What each run produces

output/overnight_runs/
  2026-04-22_00-00-12/
    status.json              # machine-readable summary
    summary.md               # human-readable pass/fail table
    1_detector.log           # full training log (streamed)
    2_keypoints.log
    3_policy_detector-telemetry.log
    4_export.log
    5_benchmark.log

models/
  latest/                    # promoted weights (used by submit_check)
    apex_detector_best.pt
    apex_detector_best.onnx
    apex_keypoints_best.pt
    apex_keypoints_best.onnx
    apex_policy_best.zip
    apex_policy.onnx
    manifest.json            # which run promoted these
  backup/
    2026-04-22_00-00-12/     # pre-run snapshot (rollback target)

Example status.json

{
  "run_id": "2026-04-22_00-00-12",
  "overall": "DONE",
  "finished_at": "2026-04-22T07:42:18",
  "total_duration_s": 27 453,
  "phases": [
    {"name": "1_detector",  "success": true, "duration_s": 7 180, ...},
    {"name": "2_keypoints", "success": true, "duration_s": 5 412, ...},
    {"name": "3_policy_detector-telemetry", "success": true, "duration_s": 14 322, ...},
    {"name": "4_export",    "success": true, "duration_s": 31, ...},
    {"name": "5_benchmark", "success": true, "duration_s": 108, ...}
  ],
  "promotion": {
    "promoted": true,
    "decision": "PROMOTED",
    "reason": "improvement +0.0162 ≥ +0.0050",
    "prev_metric": 0.951,
    "new_metric": 0.968,
    "files": ["apex_detector_best.pt", "apex_detector_best.onnx", ...]
  }
}

§ 03Config knobs (`overnight_config.yaml`)

min_free_gb: 30
require_clean_git: false

detector:
  dataset: dataset_gates_mega
  epochs: 200
  timeout_s: 14400
  min_output_mb: 5

keypoints:
  dataset: dataset_gates_mega
  epochs: 150
  timeout_s: 10800
  min_output_mb: 5

policy:
  steps: 10000000
  modes: [detector_telemetry]      # add "privileged" for A/B
  timeout_s: 21600
  min_output_mb: 0.5

benchmark:
  timeout_s: 600
  laps: 2

promotion:
  metric: detection_rate            # or mAP@50, total_time_s, gates_passed
  min_improvement: 0.005            # +0.5% absolute

Sweep discipline. Change one knob per overnight run. Compare the next morning's benchmark against the previous. The promotion.min_improvement threshold prevents promoting noise.

§ 04Schedule it (run every night)

Windows Task Scheduler

PRIMARY

Open Task Scheduler → Create Task
Name: AIGP Overnight Autotrain
Trigger: Daily · 23:00 · enabled
Action: Start a program
- Program: C:\Users\pc\aigp\Scripts\python.exe
- Arguments: overnight_autotrainer.py nightly
- Start in: C:\Users\pc\Downloads\grandprix-latest
Conditions → uncheck "Start only on AC power"
Settings → "Do not start new instance if already running"
Save, test once: right-click task → Run

Linux cron

ALT

# crontab -e
0 23 * * * cd /home/pc/grandprix && \
  /home/pc/aigp/bin/python overnight_autotrainer.py \
  nightly >> logs/cron.log 2>&1

Make sure the venv Python is pinned (not system Python). Verify cron inherits CUDA_HOME or PATH correctly.

Morning check (phone-friendly)

python overnight_autotrainer.py status | jq -r '.overall, .promotion.decision'

Serve the run-dir over HTTP (any static server) if you want to view summary.md from a browser on a different machine.

§ 05What to run when (calendar)

Phase of the year	Nightly config	Notes
Pre-sim (now → May)	Full nightly with existing `dataset_gates_mega`	Baseline detector + policy. Build the data pipeline.
VQ1 sim open (May)	Nightly + sim-frame dataset extension	Fine-tune detector on captured VQ1 frames · run PPO with detector_telemetry
VQ1 → VQ2 gap (May–Jun)	Nightly with growing dataset	Every attempt logs frames · weekly retrain catches improvements
VQ2 open (Jun–Jul)	Nightly + adversarial augmentation	Lighting / obstacles · hard negatives · chaos attempts
VQ2 cutoff → Physical (Aug)	Stop nightly. Start real-flight capture.	Sim-to-real residual dynamics · DIY rig tuning

Between VQ2 cutoff and Physical, do not keep burning overnight runs on sim tuning. That's when real-flight data capture + residual-dynamics work wins the Physical slot. See playbook §04.

§ 06Promotion + rollback

Each nightly run either promotes the new weights to models/latest/ or holds, based on the benchmark delta. Promotion criteria:

All 5 phases (detector, keypoints, policy, export, benchmark) succeed.
The promotion.metric (default detection_rate) in this run's benchmark ≥ the previous best + min_improvement.

If either fails, the run is HELD — previous models/latest/ stays in place.

Manual promote / rollback

# Promote a specific run
python overnight_autotrainer.py promote --run 2026-04-22_00-00-12

# Restore most-recent backup
python overnight_autotrainer.py rollback

# Restore a specific backup
python overnight_autotrainer.py rollback --to 2026-04-21_00-00-00

§ 07Troubleshooting

Symptom	Likely cause	Fix
Precheck fails: "CUDA not available"	GPU driver / wrong torch build	Re-install torch with cu128 index URL
Precheck fails: "Low disk"	Overnight runs accumulate logs + backups	Prune `output/overnight_runs/` > 30 days old · prune `models/backup/`
Phase 3 policy hangs at frame 43	VRAM thrash · 16 GB ceiling	Drop `n_envs` in `train_apex.py` from 4 → 2
Phase times out without finishing	Timeout too tight for dataset size	Raise `timeout_s` in config
Promotion never fires	Benchmark metric not actually improving · threshold too tight	Check `benchmark_results.json` history · lower `min_improvement` temporarily
Status JSON shows phase "success" but weights missing	Validator `min_output_mb` too low, silent failure	Raise `min_output_mb` for that phase
Task Scheduler fires but nothing happens	"Start in" dir wrong · Python path wrong	Test manually first: right-click → Run

§ 08Related

Winning Playbook

STRATEGY

The why: effort budget, reliability math, data pipeline moat, sim-to-real.

Local GPU Training

HARDWARE

RTX 5080 install + venv + raw train_apex.py usage.

APEX Pipeline

REFERENCE

Per-phase details: models, obs schemas, reward terms.

Tuning Reference

REF

Every gain, threshold, timeout in the runtime and the trainer.

TRAINING-RUNBOOK · v1.0 2026-04-21 · ← Index · Playbook

Training runbook.

§ 01Manual path — exact commands

0 · Precheck (30 s)

1 · Phase 1 — Detector (YOLO11n) · ~2 hr

2 · Phase 2 — Keypoints (YOLO11n-pose) · ~1.5 hr

3 · Phase 3 — Policy (submission-ready obs) · ~4 hr

4 · Export ONNX + package · ~1 min

5 · Benchmark against previous best · ~2 min

6 · Package for submission (when ready)

§ 02Automated path — overnight_autotrainer.py

Subcommands

What each run produces

Example status.json

§ 03Config knobs (overnight_config.yaml)

§ 04Schedule it (run every night)

Windows Task Scheduler

Linux cron

Morning check (phone-friendly)

§ 05What to run when (calendar)

§ 06Promotion + rollback

Manual promote / rollback

§ 07Troubleshooting

§ 08Related

Winning Playbook

Local GPU Training

APEX Pipeline

Tuning Reference

§ 02Automated path — `overnight_autotrainer.py`

§ 03Config knobs (`overnight_config.yaml`)