Evaluation harness

Benchmark every detector
against the same course.

Apples-to-apples comparison of APEX Phase 1 detectors (YOLO11n, RF-DETR, U-Net, Color) on the same SimDrone course. Outputs gate-pass count, lap time, per-frame latency, detection rate, and false-positive rate. All results recorded to benchmark_results.json for trend tracking.

Harness
benchmark_models.py
one command
Duration
Full: ~2 min · Quick: ~1 min
SimDrone proxy
Output
benchmark_results.json
append-only history
Target
mAP@50 > 99% · <5 ms GPU
VQ2-ready

§ 01Run the benchmark

python benchmark_models.py                # full (2 laps)
python benchmark_models.py --quick        # 1 lap
python benchmark_models.py --modes color yolo unet  # specific detectors
python benchmark_models.py --export html  # also write model-eval-dashboard.html

§ 02What it measures

MetricMeaningGood value
Gates passedGates cleared in order during the runAll
Total timeWall-clock seconds to finish<60 s competitive
Avg gate timeTime between consecutive gate passages<3 s
Detection rate% of frames with ≥1 gate detected>80%
Vision latencyPer-frame inference ms (median)<10 ms VQ1 · <5 ms VQ2
Vision FPS1 / vision_latency>100 Hz
False-positive rateDetections on non-gate frames<1%

§ 03Example output

  Mode       Gates     Time   Avg Gate   Vision  Det %    FPS   FP %
  ─────────────────────────────────────────────────────────────────
  color         22    45.2s      2.05s     0.3ms   92%  3333   2.1
  unet          22    42.8s      1.95s     4.8ms   96%   208   0.4
  yolo11n       22    44.1s      2.01s     5.2ms   99%   192   0.2
  rfdetr-nano   22    41.6s      1.89s     2.3ms  100%   435   0.1

  RECOMMENDED: rfdetr-nano
  Set vision.mode: "rfdetr_nano" in race_config.py
Benchmark recommends the best mode based on gate-pass count (primary) and total time (tie-break). Results append to benchmark_results.json so we can track detector regressions over time.

§ 04When to run

SituationWhat to run
After training a new detectorbenchmark_models.py --modes new_detector vs current champion
Before every submissionFull benchmark, capture benchmark_results.json
Dataset growth milestone (every +50K frames)Full benchmark, check for regression
After a code change outside visionQuick benchmark, confirm no stealth breakage

§ 05Integration with data-capture loop

Benchmark runs on SimDrone today. Once the AIGP sim drops (May 2026), the same harness points at VQ1-sim proxy frames. Every captured run becomes a benchmark point:

python benchmark_models.py \
  --frames recordings/vq1_captured/frames \
  --telemetry recordings/vq1_captured/telemetry.jsonl

This closes the loop: detector improves → benchmark confirms → capture more frames → detector improves. See playbook §03.

MODEL-EVAL · v2.0 2026-04-21 · ← Index · Vision