AI Grand Prix — Vision & Detection Guide

Gate detection, RANSAC corner extraction, PnP depth estimation, and model training

1. Why Vision Matters

The single 12MP FPV camera is the ONLY sensor. There is no GPS, no LiDAR, no depth camera. Every piece of spatial awareness the drone has comes from vision.

The entire perception-to-control loop is a three-step chain:

  1. Detect gate in the camera frame (2D pixel coordinates)
  2. Extract 4 corners at sub-pixel accuracy
  3. PnP solve to recover distance and 6DoF pose relative to the gate

Corner accuracy directly determines depth accuracy. A 2px error at 10m range produces roughly 0.5m of depth error — enough to cause a collision or a missed gate. This is why we use U-Net segmentation with RANSAC corner extraction instead of simple bounding boxes.

Key insight: Bounding box detectors (YOLO) return axis-aligned rectangles, not the true gate edge positions. When those bbox corners are fed to PnP, the resulting depth estimate is systematically biased. U-Net + RANSAC recovers the actual gate edges.

2. Detection Mode Comparison

Three detection backends are available. Selection is controlled by VisionSettings.mode in race_config.py:

Feature Color (VQ1) YOLO (VQ2) U-Net (Primary)
Method HSV threshold + contours Neural network bbox Pixel segmentation + RANSAC
Speed ~0.5ms ~12ms ~5ms (GPU)
Corner source approxPolyDP or bbox bbox corners (inaccurate) RANSAC line fit + intersection
PnP accuracy Good (if quad fit succeeds) Poor (bbox != gate edges) Best (sub-pixel corners)
Partial gates No Partial Yes
Training needed No (just HSV range) Yes (labeled dataset) Yes (segmentation masks)
Best for Highlighted gates (VQ1) Complex backgrounds Racing (accuracy + speed)
# race_config.py — VisionSettings
mode: str = "unet"    # "color" (VQ1), "yolo" (VQ2), or "unet" (primary)
Default recommendation: Use "unet" for racing. Fall back to "color" for VQ1 qualifiers where gates are highlighted with a known color. Use "yolo" only when U-Net is unavailable and gates are not highlighted.

3. U-Net Architecture (GateSegNet)

Overview

GateSegNet is a lightweight 4-level encoder-decoder with skip connections. It predicts a binary gate mask at full input resolution.

Input / Output

Inference Latency

Architecture Diagram

3 -> 32 ConvBlock + Pool 32 -> 64 ConvBlock + Pool 64 -> 128 ConvBlock + Pool 128 -> 256 ConvBlock + Pool BOTTLENECK 256 -> 256 256 -> 128 Up + ConvBlock 128 -> 64 Up + ConvBlock 64 -> 32 Up + ConvBlock 32 -> 1 Conv1x1 + Sigmoid ENCODER DECODER --- Skip connections (concatenate feature maps) --- 640x480 BGR 640x480 Mask Encoder (Conv+BN+ReLU x2, MaxPool) Decoder (ConvTranspose + ConvBlock) Skip connections (feature concat)

ConvBlock Detail

Each ConvBlock consists of two sequential Conv3x3 + BatchNorm + ReLU layers:

class ConvBlock(nn.Module):
    # Conv2d(in_ch, out_ch, 3, padding=1)
    # BatchNorm2d(out_ch)
    # ReLU(inplace=True)
    # Conv2d(out_ch, out_ch, 3, padding=1)
    # BatchNorm2d(out_ch)
    # ReLU(inplace=True)

Encoder Path

LevelChannelsResolutionOperation
13 → 32640x480ConvBlock + MaxPool2d(2)
232 → 64320x240ConvBlock + MaxPool2d(2)
364 → 128160x120ConvBlock + MaxPool2d(2)
4128 → 25680x60ConvBlock + MaxPool2d(2)
Bottleneck256 → 25640x30ConvBlock (no pool)

Decoder Path

LevelChannelsResolutionOperation
4256+256 → 12880x60ConvTranspose2d(2) + skip concat + ConvBlock
3128+128 → 64160x120ConvTranspose2d(2) + skip concat + ConvBlock
264+64 → 32320x240ConvTranspose2d(2) + skip concat + ConvBlock
132+32 → 32640x480ConvTranspose2d(2) + skip concat + ConvBlock
Output32 → 1640x480Conv2d(1x1) + Sigmoid

4. RANSAC Corner Extraction Algorithm

This is the critical innovation. Standard contour simplification (approxPolyDP) finds polygon vertices on the contour boundary. If the mask is noisy or the contour has bumps, the vertices jump around unpredictably. RANSAC fits lines to clusters of edge points, and intersects adjacent lines for mathematically exact corner positions — robust to outliers.

Step-by-Step Pipeline

Step 1 — Contour extraction. Threshold the U-Net sigmoid mask at 127 to produce a binary image, then run cv2.findContours(binary, RETR_EXTERNAL, CHAIN_APPROX_SIMPLE). Filter by minimum area (200px) and aspect ratio (0.2 to 5.0).

Step 2 — Edge template. For each valid contour, compute the minimum-area rotated rectangle via cv2.minAreaRect. Extract 4 box corners with cv2.boxPoints, then order them as TL, TR, BR, BL. These define 4 template edge segments: TL→TR, TR→BR, BR→BL, BL→TL.

Step 3 — Point assignment. Each contour point is assigned to its nearest edge using point-to-line-segment distance. This partitions the contour into 4 clusters, one per gate edge.

for p in contour_points:
    best_edge = argmin([point_line_dist(p, edge_a, edge_b)
                        for edge_a, edge_b in edges])
    buckets[best_edge].append(p)

Step 4 — RANSAC line fitting (per edge cluster):

  1. Sample 2 random points from the cluster → define a candidate line
  2. Compute perpendicular distance from all cluster points to the candidate
  3. Count inliers within threshold (3px)
  4. Repeat for 50 iterations, keep the line with the most inliers
  5. Refit on inliers using SVD: compute mean of inlier points, then principal direction via np.linalg.svd(inliers - mean)
# RANSAC core loop (simplified)
for _ in range(50):
    p1, p2 = random_sample(edge_pts, 2)
    direction = normalize(p2 - p1)
    normal = [-direction[1], direction[0]]
    dists = abs((pts - p1) @ normal)
    inliers = pts[dists < 3.0]
    if len(inliers) > best_count:
        mean = inliers.mean(axis=0)
        _, _, vt = np.linalg.svd(inliers - mean)
        best_line = (mean, vt[0])  # point + direction

Step 5 — Line intersection. Adjacent RANSAC-fitted lines are intersected analytically to produce 4 sub-pixel corner points. The intersection of lines Li and Li+1 gives corner C(i+1) mod 4.

# Line intersection: p1 + t*d1 = p2 + s*d2
det = d1[0]*(-d2[1]) - d1[1]*(-d2[0])
if abs(det) < 1e-8: return None  # parallel
dp = p2 - p1
t = (-d2[1]*dp[0] + d2[0]*dp[1]) / det
corner = p1 + t * d1

Step 6 — cornerSubPix polish. OpenCV sub-pixel refinement on the mask image with a 5x5 search window, 30 iterations, epsilon 0.01:

criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.01)
cv2.cornerSubPix(gray_mask, corners, (5, 5), (-1, -1), criteria)

Step 7 — Corner ordering. Final corners are sorted into TL, TR, BR, BL order using coordinate sum and difference:

RANSAC Corner Extraction Diagram

1. Contour Points 2. Edge Clusters 3. RANSAC Lines 4. Intersection Corners TL TR BR BL Why RANSAC over approxPolyDP? approxPolyDP simplifies the contour polygon — works if the shape is clean, fails on noisy masks. RANSAC fits lines to edge clusters, inherently robust to outliers from mask noise or partial occlusion. Line intersection gives a mathematically exact corner position — not a polygon vertex chosen from noisy contour points. Result: sub-pixel corner accuracy even with imperfect segmentation masks, leading to significantly better PnP depth estimates.

5. PnP Depth Estimation

The Problem

Given 4 known 3D points (gate corners) and their 2D pixel positions (from RANSAC), recover the gate's 6DoF pose relative to the camera — specifically, the translation vector whose magnitude gives the distance.

Knowns

3D Gate Corners (World Frame)

A 2.0m x 2.0m square, centered at origin, lying in the XY plane:

GATE_CORNERS_3D = [
  [-1.0,  1.0,  0],  # TL
  [ 1.0,  1.0,  0],  # TR
  [ 1.0, -1.0,  0],  # BR
  [-1.0, -1.0,  0],  # BL
]

Camera Intrinsics

Resolution: 640 x 480
FOV: 90deg H, 60deg V

fx = 640 / (2 * tan(45deg))
   = 320 / 1.0
   = 320.0

fy = 480 / (2 * tan(30deg))
   = 240 / 0.5774
   = 415.7

K = [[320.0,   0, 320],
     [  0, 415.7, 240],
     [  0,     0,   1]]

Solver

success, rvec, tvec = cv2.solvePnP(
    GATE_CORNERS_3D,     # (4,3) float64 — known 3D positions
    corners_2d,          # (4,2) float64 — detected pixel corners
    K,                   # 3x3 camera matrix
    dist_coeffs,         # distortion (zeros for now)
    flags=cv2.SOLVEPNP_IPPE_SQUARE  # optimized for coplanar squares
)

The SOLVEPNP_IPPE_SQUARE flag uses the Infinitesimal Plane-based Pose Estimation method, specifically optimized for coplanar square markers. It is both faster and more accurate than the general SOLVEPNP_ITERATIVE for this geometry.

Output

VariableShapeMeaning
rvec(3,1)Rodrigues rotation vector (gate orientation relative to camera)
tvec(3,1)Translation vector [x, y, z] — gate center in camera frame (meters)
distancescalarnp.linalg.norm(tvec) — Euclidean distance to gate center

Fallback Distance Estimation

When PnP fails (degenerate corners, colinear points), a simple pinhole-model estimate is used:

distance = (GATE_WIDTH * fx) / pixel_width
         = (2.0 * 320.0) / bbox_w
Fallback is coarse: It assumes the gate is fronto-parallel (facing the camera head-on). For oblique views, this over-estimates the distance. Always prefer PnP when corners are available.

6. Color Detection (VQ1 Mode)

In VQ1, gates are visually highlighted with a saturated color against a desaturated environment. Color detection is trivially fast (~0.5ms) and needs no training.

Pipeline

  1. Convert BGR → HSV: cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)
  2. Threshold: cv2.inRange(hsv, hsv_lower, hsv_upper)
  3. Morphological cleanup: close(5x5) then open(5x5) — fills small gaps, removes speckle
  4. Contour extraction: cv2.findContours(mask, RETR_EXTERNAL, CHAIN_APPROX_SIMPLE)
  5. Filter: area > 500px, aspect ratio 0.3 to 3.0
  6. Corner fitting: approxPolyDP at 4% arc length epsilon; falls back to bounding rect if not 4 vertices

Color Presets

PresetHSV LowerHSV UpperNotes
green(40, 100, 100)(90, 255, 255)Default — most common gate highlight
cyan(80, 100, 100)(100, 255, 255)Blue-green tinted gates
magenta(140, 100, 100)(170, 255, 255)Pink/purple highlights
red(0, 120, 100)(10, 255, 255)Red highlight (wraps at 180)
yellow(20, 100, 100)(40, 255, 255)Yellow/gold highlights
white(0, 0, 200)(180, 30, 255)Bright desaturated (white glow)
# Switch preset at runtime
pipe.detector.set_color_preset("cyan")
Tip: Run the simulator, screenshot a gate, and check HSV values in an image editor to select the correct preset. The default "green" range covers most VQ1 gate highlight colors.

7. YOLO Detection (VQ2 Mode)

For VQ2 environments where gates are not highlighted, a trained neural network detector is required. We use YOLOv8n or YOLOv11n for single-class gate detection.

Model Format Priority

FormatExtensionSpeedNotes
TensorRT.engineFastestGPU-optimized, FP16, hardware-specific
ONNX.onnxFastCross-platform, good baseline
PyTorch.ptModerateFor training/debugging only

The detector automatically resolves the best available format at load time (engine > onnx > pt).

Configuration

# race_config.py
yolo_model_path: str = "gate_detector.pt"
yolo_conf_threshold: float = 0.5
Corner limitation: YOLO returns axis-aligned bounding boxes. The 4 bbox corners (x1,y1), (x2,y1), (x2,y2), (x1,y2) are NOT the actual gate edge positions. When a gate is rotated or viewed at an angle, the bbox inflates beyond the gate edges, and PnP receives systematically incorrect corner inputs. This is why YOLO mode has worse depth accuracy than U-Net + RANSAC.

When to Use YOLO

8. Training Workflows

U-Net Training

python gate_segmentation.py train --data dataset_gates_seg

Dataset structure:

dataset_gates_seg/
  train/
    images/    # BGR frames (PNG/JPG)
    masks/     # Binary masks (white = gate, black = background)
  val/
    images/
    masks/
ParameterValueNotes
Loss functionDice + BCE combinedDice handles class imbalance; BCE provides gradient everywhere
OptimizerAdam, lr=1e-3Default PyTorch Adam
Epochs100Checkpoints every 25 epochs
Batch size8Increase if VRAM allows
Input size640x480Matches camera resolution
Best modelSaved by validation lossdataset_gates_seg/weights/best.pt

Export to ONNX:

python gate_segmentation.py export --weights best.pt
# Produces: gate_seg.onnx (opset 17, dynamic batch axis)

YOLO Training

# Step 1: Auto-label using VQ1 color detection
python yolo-auto-label.py

# Step 2: Train
python yolo-train.py train

# Step 3: Export for deployment
python yolo-train.py export
# Produces: gate_detector.engine (TensorRT FP16) or gate_detector.onnx
ParameterValue
Base modelyolo11n.pt (nano — speed-optimized)
Epochs150
Batch size16
Image size640
Patience30 (early stopping)
AugmentationHSV jitter, rotation (10deg), translate, scale, mosaic, mixup, random erasing
No vertical flipGates have orientation — vertical flip creates invalid training examples
Auto-labeling workflow: The yolo-auto-label.py script runs the VQ1 color detector on simulator frames and automatically generates YOLO-format bounding box labels. This bootstraps training data without manual annotation — you fly through VQ1, record frames, and the color detector produces labels for free.

9. Camera Calibration

Current Approach: Computed from FOV

Camera intrinsics are derived from the declared field-of-view angles:

fx = width  / (2 * tan(fov_h / 2))  # 640 / (2 * tan(45deg)) = 320.0
fy = height / (2 * tan(fov_v / 2))  # 480 / (2 * tan(30deg)) = 415.7
cx = width  / 2                       # 320.0
cy = height / 2                       # 240.0

# Distortion coefficients assumed zero
dist_coeffs = [0, 0, 0, 0, 0]

This is implemented in vision_pipeline.py as the CameraConfig class:

@property
def fx(self) -> float:
    return self.width / (2.0 * np.tan(np.radians(self.fov_h / 2)))

@property
def fy(self) -> float:
    return self.height / (2.0 * np.tan(np.radians(self.fov_v / 2)))

Better Approach: Checkerboard Calibration

For real hardware or high-fidelity simulation, a proper calibration produces a more accurate intrinsic matrix K and distortion coefficients:

  1. Print a checkerboard pattern (e.g. 9x6 inner corners)
  2. Capture 15-20 images from varied angles and distances
  3. Run cv2.calibrateCamera() to solve for K and distortion
  4. Update CameraConfig with the calibrated values
# OpenCV calibration (sketch)
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001)
objp = np.zeros((6*9, 3), np.float32)
objp[:, :2] = np.mgrid[0:9, 0:6].T.reshape(-1, 2) * square_size

obj_points, img_points = [], []
for img in calibration_images:
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    ret, corners = cv2.findChessboardCorners(gray, (9, 6), None)
    if ret:
        corners = cv2.cornerSubPix(gray, corners, (11, 11), (-1, -1), criteria)
        obj_points.append(objp)
        img_points.append(corners)

ret, K, dist, rvecs, tvecs = cv2.calibrateCamera(
    obj_points, img_points, gray.shape[::-1], None, None
)
print(f"Reprojection error: {ret:.4f} px")  # should be < 0.5
Why this matters: Lens distortion shifts pixel positions, especially near image edges. If the drone detects a gate in the periphery, uncorrected distortion will shift the corner positions and bias the PnP depth estimate. Calibrated distortion coefficients allow cv2.undistortPoints() to correct this before PnP solving.

10. Pipeline Integration Summary

The complete vision pipeline is orchestrated by VisionPipeline in vision_pipeline.py:

from vision_pipeline import VisionPipeline, CameraConfig

cam = CameraConfig(width=640, height=480, fov_h=90.0, fov_v=60.0)
pipe = VisionPipeline(cam, mode="unet")  # or "color", "yolo"
pipe.setup()  # loads model weights

# Per frame:
detections = pipe.process(frame)       # detect + PnP in one call
nearest = pipe.get_nearest_gate()      # closest valid gate
latency = pipe.inference_latency_ms    # total ms for detect + PnP

Data Flow

StageInputOutputTime
1. Detection640x480 BGR frameList of GateDetection (bbox, confidence, corners_2d)0.5-12ms
2. PnP Estimation4 corner pixel positionsrvec, tvec, distance (meters)<0.1ms
3. Gate SelectionAll detectionsNearest valid gate (confidence > 0.3, distance < 100m)negligible

Key Source Files

FilePurpose
vision_pipeline.pyCore pipeline: ColorGateDetector, YOLOGateDetector, PnPEstimator, VisionPipeline
gate_segmentation.pyGateSegNet (U-Net), GateSegDetector (RANSAC corners), DiceBCELoss, GateSegTrainer
race_config.pyCameraSettings, VisionSettings, GateSettings — all tunable parameters
yolo-train.pyYOLO training configuration and runner
yolo-auto-label.pyAuto-labeling: color detection generates YOLO training labels