Training · RTX 5080 · $0 cost

Local GPU training.

Train all three APEX phases on the RTX 5080 box. One command, ~7.5 hours overnight. The sim-testing environment is the same box (Windows-only AIGP sim), so local beats cloud for this year.

GPU

RTX 5080 · 16 GB GDDR7 · Blackwell

our dev box

Command

python train_apex.py

runs all 3 phases

Total time

~7.5 hours

overnight fits

Constraint

Final validation on Windows

AIGP sim Windows-only

Training vs the AIGP sim. Training runs on this RTX 5080 box against the SimDrone proxy env. The actual AIGP sim (May 2026) is Windows-only and is a separate downloaded package — see submission guide. Training can run on Linux if you want, but the final submission has to be validated on Windows.

For unattended overnight runs use overnight_autotrainer.py nightly (backup · train · benchmark · auto-promote). See the training runbook for the exact commands, the config, and how to schedule it with Task Scheduler / cron.

§ 01GPU compatibility

GPU	VRAM	Phase 1	Phase 2	Phase 3 (PPO 10M)	Total
RTX 5080 (yours)	16 GB GDDR7	~2 hr	~1.5 hr	~4 hr	~7.5 hr
RTX 5090	32 GB	~1.5 hr	~1 hr	~2.5 hr	~5 hr
RTX 4090	24 GB	~2.5 hr	~1.7 hr	~4.5 hr	~8.7 hr
RTX 4080	16 GB	~3 hr	~2 hr	~5 hr	~10 hr
RTX 4070 Ti	12 GB	~4 hr	~2.5 hr	~6 hr	~12.5 hr
RTX 3080	10 GB	~5 hr	~3.5 hr	~8 hr	~16.5 hr

§ 02Install

Windows (target)

PRIMARY

python -m venv aigp
aigp\Scripts\activate

pip install torch torchvision \
  --index-url https://download.pytorch.org/whl/cu128

pip install opencv-python numpy scipy pyyaml \
  ultralytics stable-baselines3 gymnasium onnxruntime

Linux

TRAINING ONLY

python3 -m venv aigp
source aigp/bin/activate

pip install torch torchvision \
  --index-url https://download.pytorch.org/whl/cu128

pip install opencv-python numpy scipy pyyaml \
  ultralytics stable-baselines3 gymnasium onnxruntime

Transport is MAVSDK over UDP per VADR-TS-002 §4 (pip install mavsdk). Vision stream is UDP:5600 JPEG-chunked — handled by UDPVisionCamera in camera_adapter.py. Runtime: Python 3.14.2 known-good per §5.1 (other versions allowed). Windows 11 only; no Linux support.

§ 03Run

# All three phases sequentially (~7.5 hr RTX 5080)
python train_apex.py

# Individual phases
python train_apex.py detector     # Phase 1: YOLO11n (~2 hr)
python train_apex.py keypoints    # Phase 2: YOLO11n-pose (~1.5 hr)
python train_apex.py policy       # Phase 3: PPO (~4 hr)

# Smoke tests
python train_apex.py detector --epochs 5
python train_apex.py policy --steps 500000

Phase 3 observation mode. Default --observation-mode=privileged is dev-only (legacy 24D with NED bearings, does not transfer). For a submission-ready policy:

python train_apex.py policy --steps 10000000 \
  --observation-mode=detector_telemetry

28D obs matches the AIGP sim's input surface (FPV + telemetry, no GPS).

§ 04Outputs

File	Model	Size	Purpose
`models/apex_detector_best.pt`	YOLO11n	~6 MB	Phase 1 (VQ1 + VQ2)
`models/apex_detector_best.onnx`	YOLO11n	~6 MB	ONNX for submission
`models/apex_keypoints_best.pt`	YOLO11n-pose	~6 MB	Phase 2 (VQ1 + VQ2)
`models/apex_keypoints_best.onnx`	YOLO11n-pose	~6 MB	ONNX for submission
`output/apex_policy/apex_policy_best.zip`	Perception-aware PPO	~1 MB	Phase 3 (VQ2 only)
`output/apex_policy/apex_policy.onnx`	PPO	~1 MB	ONNX-exported for inference

Don't commit model weights to the repo. .gitignore already excludes *.pt, *.onnx, *.pth, rl_models/. Move weights via HF Hub, Git LFS, scp, or the submission package — not git.

§ 05Troubleshooting

Problem	Cause	Fix
`torch.cuda.is_available() == False`	CPU-only PyTorch	Re-install with cu128 index URL
CUDA OOM during PPO	Parallel envs too many for 16 GB	Drop `n_envs` 4 → 2 in `train_apex_policy`
CUDA illegal-memory-access at frame ~43	VRAM exhaustion (we hit this during lingbot)	Lower `camera_num_iterations`, close other GPU apps
`FileNotFoundError: train_apex.py`	Not in repo dir	`cd grandprix`
Training CPU-bound instead of GPU	DataLoader workers misconfigured	`num_workers=8`, pinned memory
GPU temp >85°C	Sustained compute load	Check case airflow, drop batch size

§ 06Remote training from Mac

On the PC (one-time)

SETUP

# Enable OpenSSH Server (Windows)
# Settings → Apps → Optional Features
# → Add: OpenSSH Server

# Then (PowerShell admin):
Start-Service sshd
Set-Service -Name sshd -StartupType Automatic

ipconfig  # find IPv4 (e.g. 192.168.1.42)

From Mac

USE

ssh user@192.168.1.42
cd grandprix
python train_apex.py

# Disconnect-safe:
nohup python train_apex.py > training.log 2>&1 &
tail -f training.log

§ 07Local vs cloud cost

	RTX 5080 (yours)	RunPod A4000	RunPod A100
Cost per full run	$0	~$6	~$13
Total time	~7.5 hr	~17.5 hr	~9 hr
Setup	10 min (first time)	5 min	5 min
Sim testing	Same box	Separate Windows box	Separate Windows box
Best for	Default	No local GPU	Multi-GPU distributed

RTX 5080 matches A100 on time for single-GPU APEX and costs $0. The tiebreaker for AIGP: final submission validates against the Windows-only sim, so you'll be on the 5080 box anyway. Cloud only helps for multi-GPU distributed training, which APEX doesn't need.

LOCAL-GPU · v2.0 2026-04-21 · ← Index · APEX