Playbook · Master Strategy

How we win the
AI Grand Prix.

The opinionated consolidated strategy. Every other document flows from the claims made here. Written for a team with finite hours, optimizing for a finalist slot at Ohio (November). The win isn't VQ2 laptimes. The win is reliable sim-to-real transfer and a data pipeline the other 1,000 teams don't build.

Goal

Finalist slot in Ohio (Nov)

$500K + Anduril offer

Theme

Reliability × sim-to-real > raw speed

not a lap-time race

Moat

Data pipeline + observation shape

nobody else will build this

Horizon

Apr 2026 → Nov 2026 Final

7 months

§ 01What actually wins

The AI Grand Prix looks like a four-stage race: VQ1 → VQ2 → Physical → Final. It isn't. It's a filter with a prize at the end:

Stage	Function	Decides	Effort budget
VQ1	Filter · completion pass/fail	Who moves to VQ2	10%
VQ2	Filter · fastest valid time	Who moves to Physical	25%
Physical (Sep, CA)	Filter · controlled real-world	Who moves to Final	30%
Final (Nov, Ohio)	Race · real drones + audience	Prize pool + job offers	20%
Reserve	Unexpected · rule changes · bugs	Survives the year	15%

Unintuitive claim: winning VQ2 is worth less than consistently passing VQ2 with a pipeline that transfers to real hardware. Teams that over-optimize VQ2 laptimes will walk into Physical with a stack they can't re-tune in a week. We don't.

§ 02Reliability math > speed math

Each VQ2 crash costs roughly 15 minutes: the failed 8-minute run + restart + warm-up + anti-cheat rehandshake. In expected-time terms:

E[lap_time] = p(clear) · best_time + p(crash) · 15 min

A 45s pipeline with 5% crash → E = 0.95·45 + 0.05·900 =  87.75s equivalent
A 50s pipeline with 0.5% crash → E = 0.995·50 + 0.005·900 = 54.25s equivalent

                   slower pipeline wins by 33s per attempt

Target: <1% crash rate at ~80% of peak feasible speed. Chase reliability until the curve plateaus, then chase speed.

How to operationalize this:

Run identical conditions 10+ times before tuning any parameter, so you can tell signal from noise.
Log every attempt. A single crash at run 87 in an otherwise-clean regime = investigate, don't ignore.
Penalty weighting in PPO reward: r_crash = -500, not -50. Make the policy genuinely crash-averse.
Fallback path when detection drops: slow forward + gentle yaw sweep (not "hold attitude and pray"). We already do this in the VQ1 pilot; port the logic into the PPO env.

§ 03The data pipeline is the moat

Unlimited attempts × 8-min runs × 10 fps camera = 4,800 frames per attempt. By the VQ2 deadline, a team that captures and auto-labels every attempt will have 500K+ in-distribution training frames. A team that doesn't will still be training on the initial 2,000 synthetic frames.

The plumbing that unlocks this:

Record everything. Frame + telemetry + detection + our controller output + outcome, every run, every attempt. Zstd-compressed, ~500 MB per attempt.
Auto-label with the current detector. Run Phase-1 YOLO on every captured frame overnight. Keep detections with conf > 0.7 as labels.
Human-correct the 5% edge cases. Hard negatives (false positives), low-confidence hits, frames where the pipeline crashed. 10 min a day, someone on the team.
Retrain weekly. Fine-tune detector on the growing dataset. Track mAP@50 on a held-out validation slice every week.
Feed the improved detector back into Phase 3 PPO training. Observation quality improves with the detector; policy gets stronger in lockstep.

This only works if Step 1 is done on day one. Capturing frames from a VQ1 attempt costs nothing. Going back and re-recording later costs you the attempt budget. Instrument the pilot before the first submission, not after.

§ 04Sim-to-real transfer is the hidden gate

Between VQ2 close (~late July) and Physical Qualifier (early September) is a ~6-week window. Most teams will relax here — bask in qualifier rankings, maybe tune one or two params. This is when the Physical is actually won.

What happens in the September gap

Week	Task	Deliverable
W1–2	Capture ~30 min of real-flight telemetry on the DIY 5-inch rig	Real-drone dataset · indoor netted space
W2–3	Fit residual-dynamics model (Swift method): delta between sim predictions and real outcomes → small MLP correction	`sim2real_residual.pt`
W3–4	Validate PPO policy with residual model inserted. Re-tune PID for real IMU latency.	VQ2 policy on DIY drone passes 90%+ gate clears
W4–5	Adversarial lighting tests (outdoor, dusk, shadow). Augment detector training set with real hard cases.	Detector mAP>95% on real-world eval slice
W5–6	Crew procedure drills: 10-min-window drone restart, battery swap, crash recovery, re-upload weights.	<90s from crash to re-arm

Budget: 30 minutes of real flight data — not 30 hours. Swift beat human champions with 50 seconds. MonoRace used 5 minutes. The quantity matters less than the residual-dynamics model that consumes it.

Why modular architecture was the right bet for this

Our stack is detect → keypoints → PnP → controller. Every boundary is swappable. At the sim-to-real step we keep the detector + keypoints (they transfer well because game-engine texture is already in-distribution for real-world-trained backbones), we update PnP intrinsics to the real camera, and we insert the residual-dynamics model in front of the controller. That's three files touched, not a full retrain. End-to-end pixel-to-motor policies don't have these seams — they retrain or they lose.

§ 05Using the unlimited-attempt budget

Most teams will treat submissions as hero runs — maybe 5 total, chasing a personal best. That wastes the single biggest experimental asset we have. Our plan:

Experimental attempts

~80%

Identical conditions × 10 repeats per parameter change. Statistical confidence before declaring a win. Log every frame.

stat power10+ repeats

Ranking attempts

~15%

Best-known configuration, clean environment. Submit for scoring. Run sparingly — scoreboard entries invite other teams to study us.

scored runs

Chaos attempts

~5%

Deliberate edge cases: lights off, max-speed cruise, detection-drop induced. Catch failure modes before VQ2 does.

stress tests

Scoreboard discipline. Don't submit your best run until it's been validated at least 20 times offline. Leaderboard reveals your strategy to competitors. The first scored attempt should be conservative; saved the killer configuration for the last submission window.

§ 06Compute envelope (on-drone)

Physical Qualifier uses the Neros Archer platform. Compute spec published closer to the event, but based on class (100 TOPS @ ~15 W) assume Jetson Orin NX-class. Design for it now:

Stage	Budget (ms)	Notes
Detector (YOLO11n INT8 TRT)	5	Already ~5ms PT. INT8 TRT halves it.
Keypoints (YOLO11n-pose)	3	Same backbone, pose head
PnP (SOLVEPNP_IPPE_SQUARE)	0.5	CPU, 4 points, closed-form
Target-gate tracker	0.1	Python dict lookup
Controller (PID or PPO ONNX)	0.2	Tiny MLP, CPU
Transport (sim/MAVLink)	1	UDP local or serial
Overhead (logging, telemetry)	0.2	Async background
Total	10 ms	100 FPS control loop

Target 10 ms gives us 2× headroom before the 20 ms per-frame budget (matches 50 FPS camera). Anything more is wasted on this hardware class.

§ 07Explicit anti-patterns (don't build these)

End-to-end pixel→motor

WON'T TRANSFER

SkyDreamer is beautiful research. It does not transfer sim-to-real with 30 min of flight data. We'd need thousands of real flight hours, which we won't have. Stay modular.

Course-specific memorization

GATES CHANGE

Gates change between VQ1 and VQ2. Start / finish gates may differ. No persistent global map. Anything that over-fits the VQ1 layout breaks in VQ2 or at Physical.

Global vision SLAM

RETIRED 2026-04-17

LingBot-Map / VGGT / DUSt3R can't track features at racing velocities (38°/frame rotation). We proved this at 17m ATE. Archived, not revisited.

Imitation learning from humans

RULE VIOLATION

"No human interaction during runs." Reviewer-visible code that calls human-recorded trajectories is likely flagged. Pure RL or deterministic only.

Hero-run attempt strategy

WASTED BUDGET

Saving attempts for "the perfect run" throws away the experimental signal. Use unlimited attempts the way they're meant: statistical testing, continuous improvement.

Privileged-obs PPO → submission

LANDMINE

PPO trained against NED gate positions won't run on the real sim (no absolute positioning). Train with --observation-mode=detector_telemetry. See apex-pipeline.

§ 08Calendar (from now → Ohio)

NOW · Apr 21

Detector · pilot ready

VQ1 OPEN

May · submit completion

VQ2 OPEN

Jun · train on real sim frames

VQ2 CUTOFF

~late Jul · final submissions

PHYSICAL

Sep · CA · 2 wk on-site

FINAL

Nov · Ohio · $500K

Window	Primary focus	Deliverable
NOW → VQ1 open	VQ1 completion pilot ready · telemetry adapter stubbed · sim package installed · data capture harness	Pass VQ1 on first sim-day run
VQ1 → VQ2 open	Fine-tune detector on VQ1-sim frames · observation-swap PPO training starts · SubprocVecEnv fan-out	PPO baseline beats PID on VQ1 course
VQ2 open → cutoff	PPO policy tuning · adversarial detector · chaos testing · final submission	Top-30 seed into Physical
Cutoff → Physical	Sim-to-real. Real flight data capture · residual dynamics · DIY rig validation · crew drills	DIY drone clears gates autonomously
Physical Qualifier	On-site tuning · hardware-specific PID · crash recovery · 10-min window discipline	Top-10 into Final
Physical → Final	Adversarial lighting · audience-noise perception stress · backup policy versions · restart procedures	Win the Final

§ 09Team discipline — what we commit to

No hero work. Every change goes through a reproducible command (train_apex.py phase, captured config, logged results). No "I just tweaked it in a notebook."
Every attempt logged. JSONL per-frame dump. Zstd-compressed. Never deleted.
Weekly retrain cadence. Friday evening: fine-tune detector on the week's captures, push, evaluate Sunday.
One canonical repo. No forks, no side experiments in private. The submission always builds from main.
Reviewable submission. Anduril can ask for the code. Pin dep versions, include reproduce.sh, structure the repo so a reviewer can trace any output back to deterministic inputs.
Abandon-early rule. If a direction hasn't produced a signal in 5 days of focused work, stop. Reallocate to the fallback. (This is how we killed LingBot-Map in 48 hours.)

§ 10Scorecard (how we know we're winning)

Signal	Green	Amber	Red
Detector mAP@50 on VQ1-sim frames	>98%	95–98%	<95%
VQ1 completion rate (attempts)	100%	95–99%	<95%
VQ2 crash rate	<1%	1–5%	>5%
End-to-end latency on Orin NX	<15 ms	15–30 ms	>30 ms
Captured-frame dataset size by VQ2	>200K	50K–200K	<50K
Real-flight data captured by Aug	>20 min	5–20 min	<5 min
DIY rig autonomous gate-clear rate	>90%	70–90%	<70%

Green across the board by mid-August = we are positioned to win the Physical Qualifier and show up to Ohio as one of the final 10. Amber on more than two = reprioritize. Red on any = stop current work and triage.

WINNING-PLAYBOOK · v1.0 2026-04-21 · ← Index · Strategy · APEX

How we win theAI Grand Prix.

§ 01What actually wins

§ 02Reliability math > speed math

§ 03The data pipeline is the moat

§ 04Sim-to-real transfer is the hidden gate

What happens in the September gap

Why modular architecture was the right bet for this

§ 05Using the unlimited-attempt budget

Experimental attempts

Ranking attempts

Chaos attempts

§ 06Compute envelope (on-drone)

§ 07Explicit anti-patterns (don't build these)

End-to-end pixel→motor

Course-specific memorization

Global vision SLAM

Imitation learning from humans

Hero-run attempt strategy

Privileged-obs PPO → submission

§ 08Calendar (from now → Ohio)

§ 09Team discipline — what we commit to

§ 10Scorecard (how we know we're winning)

How we win the
AI Grand Prix.