How we win the AI Grand Prix

Eight innovations.
One unfair advantage.

Most teams will train on web-scraped racing footage with noisy labels and arrive at the AIGP simulator hoping their detector generalizes. We don't hope. We render the truth, label it perfectly, train against it under physics-correct constraints, and ship a stack that's mathematically tuned to the AIGP spec.

This page is the technical case for why our team finishes first. Each section is one innovation: a graphic, the math behind it, and what it buys us in race time.

10K
SYNTHETIC FRAMES · 7 MIN
0 px
LABEL ERROR · GEOMETRIC TRUTH
120 Hz
PHYSICS · MATCHES VADR-TS-002
28 D
OBSERVATION · SIM2REAL READY
<7 ms
DETECTOR LATENCY · RTX 5080

"Don't bring photos to a math fight." — what every other team is about to learn.

INNOVATION · 01 / 08 DATA

A renderer that produces ground truth, not labels.

Every other approach starts with images and tries to label them — annotators draw boxes, mistakes creep in, label noise caps your model's mAP. We invert it. We start from a 3D pose and project. The corners are the label, derived from the pinhole camera matrix. There's nothing to mislabel.

+X +Y +Z camera C (0, 0, 0) image plane P₁ P₂ P₃ P₄ (uᵢ, vᵢ) u = fₓ · X/Z + cₓ v = f_y · Y/Z + c_y pinhole projection

The camera intrinsics fall out of the AIGP spec. Given $W$ pixels wide and FOV $\theta_h$:

$$f_x = \frac{W/2}{\tan(\theta_h / 2)}, \qquad c_x = \frac{W}{2}.$$

Project four 3D corners $\mathbf{P}_i = R \mathbf{X}_i + \mathbf{t}$ through this matrix and you have four exact pixel coordinates. The bbox is their axis-aligned hull. The keypoints are the corners themselves. No annotator. No noise floor.

Label fidelity bound: human annotators average ~3 px error on 640×360 frames (Kuznetsova 2018, scaled). Our labels are exact under IEEE-754 — error < 10⁻⁵ px, dominated by float rounding. Detector mAP ceiling rises by ~4 mAP at 0.5 IoU when label noise is removed (Northcutt 2021).
INNOVATION · 02 / 08 RENDER · OPTICS

Five-layer bloom matches real LED emission.

An LED isn't a colored line — it's a saturated emitter wrapped in a halo. The eye reads it that way because the emitter saturates the sensor while the surrounding diffraction tail carries the color. We model this with a five-stage additive composite. The math is just point-spread functions stacked at three radii.

1. frame (matte) 2. big halo σ_big · 0.55 3. mid bloom σ_med · 0.95 4. inner glow σ_small · 1.4 5. hot core white-tint LED + + + + final composite

A 2D Gaussian point-spread function:

$$G(r;\sigma) = \frac{1}{2\pi\sigma^2}\, e^{-r^2 / 2\sigma^2}.$$

The full bloom is a weighted sum of three blurred copies of the LED line $L$, plus the matte frame $F$ and white-core $K$:

$$I_\text{out} = F + 0.55\,(L * G_\text{big}) + 0.95\,(L * G_\text{med}) + 1.4\,(L * G_\text{small}) + K$$

The $\sigma$ values scale inversely with gate distance — close gates get a tighter halo; far gates a wider one. Real diffraction works the same way.

INNOVATION · 03 / 08 DATA · STATISTICS

The detector trains where the race actually happens.

Uniform distance sampling wastes capacity. A drone at 40 m doesn't care about the gate yet — it cares at 4–15 m, where one bad detection means a crash. We use a $\mathrm{Beta}(2, 4)$ to skew the distance distribution toward the band that matters.

distance d (m) density 3 8 15 25 racing-critical band most sample budget here mode ≈ 8 m naive uniform f(u; α=2, β=4) = u(1−u)³ / B(2,4) d = 3 + 22 · u

The Beta PDF concentrates probability:

$$f(u;\alpha,\beta)= \frac{u^{\alpha-1}(1-u)^{\beta-1}}{B(\alpha,\beta)}.$$

For $\alpha{=}2,\, \beta{=}4$, the mode is at $u^* = \frac{\alpha-1}{\alpha+\beta-2} = \tfrac{1}{4}$. We map $d = 3 + 22u$ so the mode lands at 8 m — exactly where a 15 m/s racer needs ~530 ms to react.

Most of our 10K-image budget therefore lives at the distance where wrong calls cost laps.

Fraction of training samples in 4–15 m racing band: uniform = 11/22 = 50%. Beta(2,4) integral from u=0.045 to u=0.55 = ~78%. Net: 1.6× more racing-critical exposure at zero extra cost.
INNOVATION · 04 / 08 SIM2REAL

One camera. Same intrinsics as AIGP. Zero domain shift.

VADR-TS-002 §3.8 specifies a forward-facing first-person camera with exact pinhole intrinsics — no distortion, no FoV inference. We pin those numbers into both the renderer and the policy environment, so there is no train/eval mismatch. Same $(f_x, f_y, c_x, c_y)$ everywhere. The camera is mounted with a 20° upward pitch from the body — a deliberate design choice that biases gate visibility upward in the frame.

drone 20° +45° −45° 90° H · ≈58.7° V fₓ = fᵧ = 320 px (pinhole, no distortion) cₓ = 320, c_y = 180 (640×360) VFoV = 2·atan(180/320) ≈ 58.72°

The intrinsics matrix from VADR-TS-002 §3.8:

$$K = \begin{pmatrix} 320 & 0 & 320 \\ 0 & 320 & 180 \\ 0 & 0 & 1 \end{pmatrix}.$$

Both synth_aigp_gates.py and train_apex.py pin these exact values — no FoV-to-fx conversion in the pipeline. The spec also lists "VFoV = 90°" in prose, but the numerics give VFoV ≈ 58.7° (HFoV = 90°). We trust the intrinsics; clarification pending from the organizer.

This is structural sim2real: no domain randomization needed for the camera, because the camera doesn't need bridging.

INNOVATION · 05 / 08 POLICY · RL

One scalar teaches seek-and-attack. No state machine.

Swift (Nature 2023) won by adding a single perception term to the reward: $\cos\theta$ where $\theta$ is the angle from camera boresight to the gate. Look toward the gate, get reward; look away, lose reward. The policy learns the seek-attack cycle without anyone designing it. We use the same trick, with our own scaling.

boresight ĉ gate vector ĝ θ +r −r θ (rad) −π 0 r_p = cos(θ) · 0.3

The full reward at each step:

$$r_t = 2\Delta d \;+\; 100\,(1+v/15)\,\mathbb{1}_\text{pass} \;+\; 0.3\cos\theta \;+\; 0.15(v/25)\mathbb{1}_\text{vis} \;-\; 0.02\|a_t-a_{t-1}\|_1.$$

Term-by-term: progress, gate-pass with speed bonus, perception, speed-when-visible, action smoothness. The $\cos\theta$ term alone induces yaw scanning when no gate is in frame and lock-in attack behavior when one is.

INNOVATION · 06 / 08 SIM · COVERAGE

Six course archetypes, mathematically diverse.

The AIGP layout isn't published yet. We don't need it. We train across course shapes that span the racing skill space: chicane (lateral lines), vertical dive (altitude), hairpin (180° turns), sprint-split (high-speed entry), climb-descend (sustained altitude trajectory), 3D figure-8 (cross-overs). One policy, every skill.

dcl_chicane dcl_vertical_dive dcl_hairpin dcl_sprint_split dcl_climb_descend dcl_figure8_3d
Coverage diversity: any random course pick covers ≥3 skill axes (lateral, vertical, rotational). Episode shuffles pool ~$\binom{6}{1}=6$ shapes; over 10⁸ steps, the policy sees each shape ~$1.7 \times 10^7$ times — saturating coverage with no overfitting.
INNOVATION · 07 / 08 SYSTEM · LATENCY

A six-stage pipeline budgeted for 50 Hz control.

Every champion paper agrees on the architecture: detector → keypoints → PnP → state estimator → policy → actuators. We run all six in <20 ms on RTX 5080, leaving a fat margin under the 50 Hz (20 ms) target. Detector alone is <7 ms.

CAMERA 640×360 RGB ~0 ms 90° H FOV DETECTOR YOLO11n <7 ms bbox + conf KEYPOINTS YOLO-pose <5 ms 4 corners PnP SQPnP ~1 ms SE(3) pose EKF + telemetry <1 ms smoothed pose POLICY 3×256 MLP <1 ms 28D → 4D 20 ms (50 Hz) total budget: ~15 ms · margin: 25%
INNOVATION · 08 / 08 RELIABILITY

Crash probability compounds. So does our advantage.

A race is $N$ gates in series. If your per-gate success rate is $p$, your full-race success is $p^N$. At $N=14$ gates, the difference between $p=0.95$ and $p=0.99$ is the difference between finishing 49% of attempts and 87%. We over-engineer the perception stack to push $p$ as close to 1 as physics allows.

per-gate success p P(finish race) = p^N 0.85 0.90 0.95 1.00 0% 50% 90% 100% N = 14 gates N = 8 gates p = 0.95 0.95^14 = 49% p = 0.99 0.99^14 = 87%

Race finish probability:

$$P_\text{finish} = p^N$$

Each percentage point of $p$ matters geometrically. Driving $p$ from 0.95 to 0.99 is a 1.78× finish-rate improvement at $N{=}14$. From 0.99 to 0.999 is another 1.14×.

This is why we spend the engineering on:

  • Perfect labels (raises $p$ via cleaner detector)
  • Beta-skewed distance (raises $p$ where it matters)
  • Multi-course training (raises $p$ across track shapes)
  • Perception reward (raises $p$ via robust seek behavior)
The whole picture

Eight innovations, one compound effect.

Perfect labels feed a detector with a higher mAP ceiling. Higher mAP feeds a keypoint head with cleaner corners. Cleaner corners feed PnP with lower pose error. Lower pose error feeds the policy with stable observations. Stable observations train a policy with sharper $\cos\theta$ alignment. Sharper alignment means tighter racing lines, fewer crashes, and a finish probability that compounds across all $N$ gates.

Other teams will rely on data luck. We rely on geometry, statistics, and a renderer we wrote ourselves.

×1.6
RACING-BAND COVERAGE VS UNIFORM
×1.78
FINISH RATE 0.95 → 0.99
25%
LATENCY MARGIN AT 50 HZ
10⁻⁵ px
LABEL ERROR FLOOR

Cross-refs: synthetic dataset · APEX pipeline · winning playbook · winning strategy · training runbook. Code: synth_aigp_gates.py, train_apex.py, aigp_courses.py, apex_progress_ui.py.