Evaluation harness

Benchmark every detector
against the same course.

Apples-to-apples comparison of APEX Phase 1 detectors (YOLO11n, RF-DETR, U-Net, Color) on the same SimDrone course. Outputs gate-pass count, lap time, per-frame latency, detection rate, and false-positive rate. All results recorded to benchmark_results.json for trend tracking.

Harness

benchmark_models.py

one command

Duration

Full: ~2 min · Quick: ~1 min

SimDrone proxy

Output

benchmark_results.json

append-only history

Target

mAP@50 > 99% · <5 ms GPU

VQ2-ready

§ 01Run the benchmark

python benchmark_models.py                # full (2 laps)
python benchmark_models.py --quick        # 1 lap
python benchmark_models.py --modes color yolo unet  # specific detectors
python benchmark_models.py --export html  # also write model-eval-dashboard.html

§ 02What it measures

Metric	Meaning	Good value
Gates passed	Gates cleared in order during the run	All
Total time	Wall-clock seconds to finish	<60 s competitive
Avg gate time	Time between consecutive gate passages	<3 s
Detection rate	% of frames with ≥1 gate detected	>80%
Vision latency	Per-frame inference ms (median)	<10 ms VQ1 · <5 ms VQ2
Vision FPS	1 / vision_latency	>100 Hz
False-positive rate	Detections on non-gate frames	<1%

§ 03Example output

  Mode       Gates     Time   Avg Gate   Vision  Det %    FPS   FP %
  ─────────────────────────────────────────────────────────────────
  color         22    45.2s      2.05s     0.3ms   92%  3333   2.1
  unet          22    42.8s      1.95s     4.8ms   96%   208   0.4
  yolo11n       22    44.1s      2.01s     5.2ms   99%   192   0.2
  rfdetr-nano   22    41.6s      1.89s     2.3ms  100%   435   0.1

  RECOMMENDED: rfdetr-nano
  Set vision.mode: "rfdetr_nano" in race_config.py

Benchmark recommends the best mode based on gate-pass count (primary) and total time (tie-break). Results append to benchmark_results.json so we can track detector regressions over time.

§ 04When to run

Situation	What to run
After training a new detector	`benchmark_models.py --modes new_detector` vs current champion
Before every submission	Full benchmark, capture `benchmark_results.json`
Dataset growth milestone (every +50K frames)	Full benchmark, check for regression
After a code change outside vision	Quick benchmark, confirm no stealth breakage

§ 05Integration with data-capture loop

Benchmark runs on SimDrone today. Once the AIGP sim drops (May 2026), the same harness points at VQ1-sim proxy frames. Every captured run becomes a benchmark point:

python benchmark_models.py \
  --frames recordings/vq1_captured/frames \
  --telemetry recordings/vq1_captured/telemetry.jsonl

This closes the loop: detector improves → benchmark confirms → capture more frames → detector improves. See playbook §03.

MODEL-EVAL · v2.0 2026-04-21 · ← Index · Vision

Benchmark every detectoragainst the same course.

§ 01Run the benchmark

§ 02What it measures

§ 03Example output

§ 04When to run

§ 05Integration with data-capture loop

Benchmark every detector
against the same course.