Known issues · verified fixes

Things that break
and how we fix them.

Verified issues across install, training, inference, and sim integration. Ordered by how often they bite. New symptoms welcome — add them as you find them.

Scope

Install · train · infer · sim

end-to-end

Verified

On RTX 5080 · Windows 11 · Py 3.10/3.12

our dev box

Unverified

Anti-cheat handshake · real AIGP sim

gated on sim release

§ 01Install / environment

Symptom	Cause	Fix
`torch.cuda.is_available() == False`	CPU-only Torch	`pip install torch --index-url https://download.pytorch.org/whl/cu128`
`UnicodeDecodeError: 'charmap' codec` reading Python files	Windows cp1252 default encoding	Open with `encoding='utf-8'`, or set `PYTHONUTF8=1`
NVIDIA driver vs CUDA toolkit mismatch	RTX 5080 needs 570+ driver, CUDA 12.8	Update from nvidia.com/drivers
`ModuleNotFoundError: ultralytics`	Dep missing	`pip install ultralytics`
Conda env creation silently aborts mid-transaction	Bash subprocess killed before commit finalises	Remove half-created env dir, retry with longer timeout (min 5 min)
Conda channel ToS prompt	Fresh Miniconda, Anaconda default channels	`conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main` (and `/r`, `/msys2`)

§ 02Training (APEX)

Symptom	Cause	Fix
CUDA illegal memory access during PPO	VRAM exhausted (15+ GB / 16 GB used)	Drop parallel envs 4 → 2, drop `camera_num_iterations` to 1, or close other GPU apps
PPO crashes at step ~43 reproducibly	KV cache blow-up in Phase 3 with expensive obs	Stay on `--observation-mode=privileged` for dev iteration; only flip to `detector_telemetry` when submitting
YOLO training says `ImportError: libGL`	opencv-python-headless vs full on Linux	`apt install libgl1` or `pip install opencv-python-headless`
Policy converges to hover or crashing	Reward not weighted correctly	Check crash penalty: `-500`, not `-50`; check perception-aware term wired
ONNX export fails on PPO policy	stable_baselines3 saves as zip, not ONNX	Use `train_apex.py export` which handles the SB3 → ONNX conversion
Training proceeds but validation mAP stuck	Dataset too small / no hard negatives	Generate more with `generate_training_sets.py`; add `dataset_gates_hardneg`

§ 03Inference / pilot

Symptom	Cause	Fix
Pilot commands all zero	Detector returns None (stub or bad weights)	Pass real `--detector` / `--keypoints` paths; verify `ApexDetectorChain._loaded`
PnP depth wildly wrong	Gate dimensions off (1.5m assumption)	Set `gate_width_m` / `gate_height_m` to actual value from sim spec
Controller saturates at ±1 permanently	PID gains too hot for telemetry latency	Drop `kp_*` 20–40%. See tuning reference
Flight oscillates on straights	Derivative kick from noisy body rates	Add low-pass on telemetry.body_rates before PID; `kd` smaller
Sudden 180° yaw flip	Target-gate tracker picked a rear-facing detection	Filter detections by gate-forward dot with camera-forward > 0

§ 04AIGP sim integration (expected)

Symptom	Likely cause	Mitigation
Anti-cheat handshake fails	Firewall / proxy	Run on direct-net connection for first install
Telemetry payload fields don't match	Attitude as Euler instead of quaternion	Adapter in pilot: `euler_to_quat()` helper
Command rate mismatch	Sim expects 50 Hz, we push 10 Hz	Duplicate last command until new frame; or increase capture rate
Parallel instance port conflict	Default port 6000 used by all instances	Pass `port=6000 + instance_id` to `AIGPSimEnv`

§ 05Submission

Symptom	Cause	Fix
`submit_check.py` fails on hover_thrust	Default 0.5 without calibration	Calibrate on SimDrone: `hover_thrust = mass·g / (4·k_f·rpm_max²)`; expect <0.5
`submission.zip` missing weights	Weights gitignored	Explicitly copy `models/*.onnx` into zip via `submit_check.py package --include-weights`
Package contains stray `course_map_test.json`	Leaks privileged data to reviewer	Ensure `course_map_*.json` is in the package-exclude list
Reviewer asks for repro	Deps not pinned	Ship `reproduce.sh` + pinned `requirements.txt`; see submission guide

If you hit a symptom not listed here, log it in troubleshooting.html with the exact error, cause, and fix. No "re-run and it worked" entries — those rot fast.

TROUBLESHOOTING · v2.0 2026-04-21 · ← Index

Things that breakand how we fix them.

§ 01Install / environment

§ 02Training (APEX)

§ 03Inference / pilot

§ 04AIGP sim integration (expected)

§ 05Submission

Things that break
and how we fix them.