Known issues · verified fixes

Things that break
and how we fix them.

Verified issues across install, training, inference, and sim integration. Ordered by how often they bite. New symptoms welcome — add them as you find them.

Scope
Install · train · infer · sim
end-to-end
Verified
On RTX 5080 · Windows 11 · Py 3.10/3.12
our dev box
Unverified
Anti-cheat handshake · real AIGP sim
gated on sim release

§ 01Install / environment

SymptomCauseFix
torch.cuda.is_available() == FalseCPU-only Torchpip install torch --index-url https://download.pytorch.org/whl/cu128
UnicodeDecodeError: 'charmap' codec reading Python filesWindows cp1252 default encodingOpen with encoding='utf-8', or set PYTHONUTF8=1
NVIDIA driver vs CUDA toolkit mismatchRTX 5080 needs 570+ driver, CUDA 12.8Update from nvidia.com/drivers
ModuleNotFoundError: ultralyticsDep missingpip install ultralytics
Conda env creation silently aborts mid-transactionBash subprocess killed before commit finalisesRemove half-created env dir, retry with longer timeout (min 5 min)
Conda channel ToS promptFresh Miniconda, Anaconda default channelsconda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main (and /r, /msys2)

§ 02Training (APEX)

SymptomCauseFix
CUDA illegal memory access during PPOVRAM exhausted (15+ GB / 16 GB used)Drop parallel envs 4 → 2, drop camera_num_iterations to 1, or close other GPU apps
PPO crashes at step ~43 reproduciblyKV cache blow-up in Phase 3 with expensive obsStay on --observation-mode=privileged for dev iteration; only flip to detector_telemetry when submitting
YOLO training says ImportError: libGLopencv-python-headless vs full on Linuxapt install libgl1 or pip install opencv-python-headless
Policy converges to hover or crashingReward not weighted correctlyCheck crash penalty: -500, not -50; check perception-aware term wired
ONNX export fails on PPO policystable_baselines3 saves as zip, not ONNXUse train_apex.py export which handles the SB3 → ONNX conversion
Training proceeds but validation mAP stuckDataset too small / no hard negativesGenerate more with generate_training_sets.py; add dataset_gates_hardneg

§ 03Inference / pilot

SymptomCauseFix
Pilot commands all zeroDetector returns None (stub or bad weights)Pass real --detector / --keypoints paths; verify ApexDetectorChain._loaded
PnP depth wildly wrongGate dimensions off (1.5m assumption)Set gate_width_m / gate_height_m to actual value from sim spec
Controller saturates at ±1 permanentlyPID gains too hot for telemetry latencyDrop kp_* 20–40%. See tuning reference
Flight oscillates on straightsDerivative kick from noisy body ratesAdd low-pass on telemetry.body_rates before PID; kd smaller
Sudden 180° yaw flipTarget-gate tracker picked a rear-facing detectionFilter detections by gate-forward dot with camera-forward > 0

§ 04AIGP sim integration (expected)

SymptomLikely causeMitigation
Anti-cheat handshake failsFirewall / proxyRun on direct-net connection for first install
Telemetry payload fields don't matchAttitude as Euler instead of quaternionAdapter in pilot: euler_to_quat() helper
Command rate mismatchSim expects 50 Hz, we push 10 HzDuplicate last command until new frame; or increase capture rate
Parallel instance port conflictDefault port 6000 used by all instancesPass port=6000 + instance_id to AIGPSimEnv

§ 05Submission

SymptomCauseFix
submit_check.py fails on hover_thrustDefault 0.5 without calibrationCalibrate on SimDrone: hover_thrust = mass·g / (4·k_f·rpm_max²); expect <0.5
submission.zip missing weightsWeights gitignoredExplicitly copy models/*.onnx into zip via submit_check.py package --include-weights
Package contains stray course_map_test.jsonLeaks privileged data to reviewerEnsure course_map_*.json is in the package-exclude list
Reviewer asks for reproDeps not pinnedShip reproduce.sh + pinned requirements.txt; see submission guide
If you hit a symptom not listed here, log it in troubleshooting.html with the exact error, cause, and fix. No "re-run and it worked" entries — those rot fast.
TROUBLESHOOTING · v2.0 2026-04-21 · ← Index