The commands we run to get from "today's dataset" to "submission-ready weights." One manual path (you type each command) and one automated path (overnight_autotrainer.py runs everything unattended, backs up + promotes weights, and writes a status file by morning).
python overnight_autotrainer.py nightlymodels/latest/Use this path the first few times, or when tuning a single knob and watching output live. Every command is copy-pasteable.
python submit_check.py
python -c "import torch; print(torch.cuda.get_device_name(0))"
nvidia-smi
Expect: 0 failures, "NVIDIA GeForce RTX 5080" (or your GPU), and nvidia-smi showing the card.
python train_apex.py detector \
--dataset dataset_gates_mega \
--epochs 200
Output: models/apex_detector_best.pt, models/apex_detector_best.onnx.
python train_apex.py keypoints \
--dataset dataset_gates_mega \
--epochs 150
Output: models/apex_keypoints_best.pt, models/apex_keypoints_best.onnx.
python train_apex.py policy \
--steps 10000000 \
--observation-mode detector_telemetry
Output: output/apex_policy/apex_policy_best.zip, output/apex_policy/apex_policy.onnx.
--observation-mode privileged weights. The privileged 24D obs uses NED gate bearings that don't exist in the real AIGP sim. It's for dev iteration only. detector_telemetry is what transfers.
python train_apex.py export
python benchmark_models.py
Appends to benchmark_results.json. If the new detector regresses, don't promote.
python submit_check.py
python submit_check.py package
Output: submission.zip with code + weights.
overnight_autotrainer.pyOne command runs everything above, backs up current weights, benchmarks, and promotes if improved.
# Simplest form (uses overnight_config.yaml)
python overnight_autotrainer.py nightly
| Command | What it does |
|---|---|
nightly | Full pipeline: precheck → backup → train 1 → train 2 → train 3 → export → benchmark → promote-or-hold. Default. |
precheck | Environment sanity only. Safe to run anytime. |
once --phase detector | Run one phase. Useful for debugging one knob change. |
once --phase policy --observation-mode privileged | Dev-only legacy obs for A/B comparisons. |
bench | Benchmark only, no training. |
status | Print last run's status.json. |
promote --run <RUN_ID> | Manually promote a specific run's weights. |
rollback | Restore the most recent backup into models/latest/. |
output/overnight_runs/
2026-04-22_00-00-12/
status.json # machine-readable summary
summary.md # human-readable pass/fail table
1_detector.log # full training log (streamed)
2_keypoints.log
3_policy_detector-telemetry.log
4_export.log
5_benchmark.log
models/
latest/ # promoted weights (used by submit_check)
apex_detector_best.pt
apex_detector_best.onnx
apex_keypoints_best.pt
apex_keypoints_best.onnx
apex_policy_best.zip
apex_policy.onnx
manifest.json # which run promoted these
backup/
2026-04-22_00-00-12/ # pre-run snapshot (rollback target)
{
"run_id": "2026-04-22_00-00-12",
"overall": "DONE",
"finished_at": "2026-04-22T07:42:18",
"total_duration_s": 27 453,
"phases": [
{"name": "1_detector", "success": true, "duration_s": 7 180, ...},
{"name": "2_keypoints", "success": true, "duration_s": 5 412, ...},
{"name": "3_policy_detector-telemetry", "success": true, "duration_s": 14 322, ...},
{"name": "4_export", "success": true, "duration_s": 31, ...},
{"name": "5_benchmark", "success": true, "duration_s": 108, ...}
],
"promotion": {
"promoted": true,
"decision": "PROMOTED",
"reason": "improvement +0.0162 ≥ +0.0050",
"prev_metric": 0.951,
"new_metric": 0.968,
"files": ["apex_detector_best.pt", "apex_detector_best.onnx", ...]
}
}
overnight_config.yaml)min_free_gb: 30
require_clean_git: false
detector:
dataset: dataset_gates_mega
epochs: 200
timeout_s: 14400
min_output_mb: 5
keypoints:
dataset: dataset_gates_mega
epochs: 150
timeout_s: 10800
min_output_mb: 5
policy:
steps: 10000000
modes: [detector_telemetry] # add "privileged" for A/B
timeout_s: 21600
min_output_mb: 0.5
benchmark:
timeout_s: 600
laps: 2
promotion:
metric: detection_rate # or mAP@50, total_time_s, gates_passed
min_improvement: 0.005 # +0.5% absolute
promotion.min_improvement threshold prevents promoting noise.
AIGP Overnight AutotrainStart a program
C:\Users\pc\aigp\Scripts\python.exeovernight_autotrainer.py nightlyC:\Users\pc\Downloads\grandprix-latest# crontab -e
0 23 * * * cd /home/pc/grandprix && \
/home/pc/aigp/bin/python overnight_autotrainer.py \
nightly >> logs/cron.log 2>&1
Make sure the venv Python is pinned (not system Python). Verify cron inherits CUDA_HOME or PATH correctly.
python overnight_autotrainer.py status | jq -r '.overall, .promotion.decision'
Serve the run-dir over HTTP (any static server) if you want to view summary.md from a browser on a different machine.
| Phase of the year | Nightly config | Notes |
|---|---|---|
| Pre-sim (now → May) | Full nightly with existing dataset_gates_mega | Baseline detector + policy. Build the data pipeline. |
| VQ1 sim open (May) | Nightly + sim-frame dataset extension | Fine-tune detector on captured VQ1 frames · run PPO with detector_telemetry |
| VQ1 → VQ2 gap (May–Jun) | Nightly with growing dataset | Every attempt logs frames · weekly retrain catches improvements |
| VQ2 open (Jun–Jul) | Nightly + adversarial augmentation | Lighting / obstacles · hard negatives · chaos attempts |
| VQ2 cutoff → Physical (Aug) | Stop nightly. Start real-flight capture. | Sim-to-real residual dynamics · DIY rig tuning |
Each nightly run either promotes the new weights to models/latest/ or holds, based on the benchmark delta. Promotion criteria:
promotion.metric (default detection_rate) in this run's benchmark ≥ the previous best + min_improvement.If either fails, the run is HELD — previous models/latest/ stays in place.
# Promote a specific run
python overnight_autotrainer.py promote --run 2026-04-22_00-00-12
# Restore most-recent backup
python overnight_autotrainer.py rollback
# Restore a specific backup
python overnight_autotrainer.py rollback --to 2026-04-21_00-00-00
| Symptom | Likely cause | Fix |
|---|---|---|
| Precheck fails: "CUDA not available" | GPU driver / wrong torch build | Re-install torch with cu128 index URL |
| Precheck fails: "Low disk" | Overnight runs accumulate logs + backups | Prune output/overnight_runs/ > 30 days old · prune models/backup/ |
| Phase 3 policy hangs at frame 43 | VRAM thrash · 16 GB ceiling | Drop n_envs in train_apex.py from 4 → 2 |
| Phase times out without finishing | Timeout too tight for dataset size | Raise timeout_s in config |
| Promotion never fires | Benchmark metric not actually improving · threshold too tight | Check benchmark_results.json history · lower min_improvement temporarily |
| Status JSON shows phase "success" but weights missing | Validator min_output_mb too low, silent failure | Raise min_output_mb for that phase |
| Task Scheduler fires but nothing happens | "Start in" dir wrong · Python path wrong | Test manually first: right-click → Run |
The why: effort budget, reliability math, data pipeline moat, sim-to-real.
RTX 5080 install + venv + raw train_apex.py usage.
Per-phase details: models, obs schemas, reward terms.
Every gain, threshold, timeout in the runtime and the trainer.