Known issues · verified fixes
Things that break
and how we fix them.
Verified issues across install, training, inference, and sim integration. Ordered by how often they bite. New symptoms welcome — add them as you find them.
Scope
Install · train · infer · sim
end-to-end
Verified
On RTX 5080 · Windows 11 · Py 3.10/3.12
our dev box
Unverified
Anti-cheat handshake · real AIGP sim
gated on sim release
§ 01Install / environment
| Symptom | Cause | Fix |
torch.cuda.is_available() == False | CPU-only Torch | pip install torch --index-url https://download.pytorch.org/whl/cu128 |
UnicodeDecodeError: 'charmap' codec reading Python files | Windows cp1252 default encoding | Open with encoding='utf-8', or set PYTHONUTF8=1 |
| NVIDIA driver vs CUDA toolkit mismatch | RTX 5080 needs 570+ driver, CUDA 12.8 | Update from nvidia.com/drivers |
ModuleNotFoundError: ultralytics | Dep missing | pip install ultralytics |
| Conda env creation silently aborts mid-transaction | Bash subprocess killed before commit finalises | Remove half-created env dir, retry with longer timeout (min 5 min) |
| Conda channel ToS prompt | Fresh Miniconda, Anaconda default channels | conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main (and /r, /msys2) |
§ 02Training (APEX)
| Symptom | Cause | Fix |
| CUDA illegal memory access during PPO | VRAM exhausted (15+ GB / 16 GB used) | Drop parallel envs 4 → 2, drop camera_num_iterations to 1, or close other GPU apps |
| PPO crashes at step ~43 reproducibly | KV cache blow-up in Phase 3 with expensive obs | Stay on --observation-mode=privileged for dev iteration; only flip to detector_telemetry when submitting |
YOLO training says ImportError: libGL | opencv-python-headless vs full on Linux | apt install libgl1 or pip install opencv-python-headless |
| Policy converges to hover or crashing | Reward not weighted correctly | Check crash penalty: -500, not -50; check perception-aware term wired |
| ONNX export fails on PPO policy | stable_baselines3 saves as zip, not ONNX | Use train_apex.py export which handles the SB3 → ONNX conversion |
| Training proceeds but validation mAP stuck | Dataset too small / no hard negatives | Generate more with generate_training_sets.py; add dataset_gates_hardneg |
§ 03Inference / pilot
| Symptom | Cause | Fix |
| Pilot commands all zero | Detector returns None (stub or bad weights) | Pass real --detector / --keypoints paths; verify ApexDetectorChain._loaded |
| PnP depth wildly wrong | Gate dimensions off (1.5m assumption) | Set gate_width_m / gate_height_m to actual value from sim spec |
| Controller saturates at ±1 permanently | PID gains too hot for telemetry latency | Drop kp_* 20–40%. See tuning reference |
| Flight oscillates on straights | Derivative kick from noisy body rates | Add low-pass on telemetry.body_rates before PID; kd smaller |
| Sudden 180° yaw flip | Target-gate tracker picked a rear-facing detection | Filter detections by gate-forward dot with camera-forward > 0 |
§ 04AIGP sim integration (expected)
| Symptom | Likely cause | Mitigation |
| Anti-cheat handshake fails | Firewall / proxy | Run on direct-net connection for first install |
| Telemetry payload fields don't match | Attitude as Euler instead of quaternion | Adapter in pilot: euler_to_quat() helper |
| Command rate mismatch | Sim expects 50 Hz, we push 10 Hz | Duplicate last command until new frame; or increase capture rate |
| Parallel instance port conflict | Default port 6000 used by all instances | Pass port=6000 + instance_id to AIGPSimEnv |
§ 05Submission
| Symptom | Cause | Fix |
submit_check.py fails on hover_thrust | Default 0.5 without calibration | Calibrate on SimDrone: hover_thrust = mass·g / (4·k_f·rpm_max²); expect <0.5 |
submission.zip missing weights | Weights gitignored | Explicitly copy models/*.onnx into zip via submit_check.py package --include-weights |
Package contains stray course_map_test.json | Leaks privileged data to reviewer | Ensure course_map_*.json is in the package-exclude list |
| Reviewer asks for repro | Deps not pinned | Ship reproduce.sh + pinned requirements.txt; see submission guide |
If you hit a symptom not listed here, log it in troubleshooting.html with the exact error, cause, and fix. No "re-run and it worked" entries — those rot fast.
TROUBLESHOOTING · v2.0
2026-04-21 · ← Index