The entire perception-to-control loop is a three-step chain:
Corner accuracy directly determines depth accuracy. A 2px error at 10m range produces roughly 0.5m of depth error — enough to cause a collision or a missed gate. This is why we use U-Net segmentation with RANSAC corner extraction instead of simple bounding boxes.
Three detection backends are available. Selection is controlled by VisionSettings.mode in race_config.py:
| Feature | Color (VQ1) | YOLO (VQ2) | U-Net (Primary) |
|---|---|---|---|
| Method | HSV threshold + contours | Neural network bbox | Pixel segmentation + RANSAC |
| Speed | ~0.5ms | ~12ms | ~5ms (GPU) |
| Corner source | approxPolyDP or bbox |
bbox corners (inaccurate) | RANSAC line fit + intersection |
| PnP accuracy | Good (if quad fit succeeds) | Poor (bbox != gate edges) | Best (sub-pixel corners) |
| Partial gates | No | Partial | Yes |
| Training needed | No (just HSV range) | Yes (labeled dataset) | Yes (segmentation masks) |
| Best for | Highlighted gates (VQ1) | Complex backgrounds | Racing (accuracy + speed) |
# race_config.py — VisionSettings
mode: str = "unet" # "color" (VQ1), "yolo" (VQ2), or "unet" (primary)
"unet" for racing. Fall back to "color" for VQ1 qualifiers where gates are highlighted with a known color. Use "yolo" only when U-Net is unavailable and gates are not highlighted.
GateSegNet is a lightweight 4-level encoder-decoder with skip connections. It predicts a binary gate mask at full input resolution.
Each ConvBlock consists of two sequential Conv3x3 + BatchNorm + ReLU layers:
class ConvBlock(nn.Module):
# Conv2d(in_ch, out_ch, 3, padding=1)
# BatchNorm2d(out_ch)
# ReLU(inplace=True)
# Conv2d(out_ch, out_ch, 3, padding=1)
# BatchNorm2d(out_ch)
# ReLU(inplace=True)
| Level | Channels | Resolution | Operation |
|---|---|---|---|
| 1 | 3 → 32 | 640x480 | ConvBlock + MaxPool2d(2) |
| 2 | 32 → 64 | 320x240 | ConvBlock + MaxPool2d(2) |
| 3 | 64 → 128 | 160x120 | ConvBlock + MaxPool2d(2) |
| 4 | 128 → 256 | 80x60 | ConvBlock + MaxPool2d(2) |
| Bottleneck | 256 → 256 | 40x30 | ConvBlock (no pool) |
| Level | Channels | Resolution | Operation |
|---|---|---|---|
| 4 | 256+256 → 128 | 80x60 | ConvTranspose2d(2) + skip concat + ConvBlock |
| 3 | 128+128 → 64 | 160x120 | ConvTranspose2d(2) + skip concat + ConvBlock |
| 2 | 64+64 → 32 | 320x240 | ConvTranspose2d(2) + skip concat + ConvBlock |
| 1 | 32+32 → 32 | 640x480 | ConvTranspose2d(2) + skip concat + ConvBlock |
| Output | 32 → 1 | 640x480 | Conv2d(1x1) + Sigmoid |
approxPolyDP) finds polygon vertices on the contour boundary. If the mask is noisy or the contour has bumps, the vertices jump around unpredictably. RANSAC fits lines to clusters of edge points, and intersects adjacent lines for mathematically exact corner positions — robust to outliers.
Step 1 — Contour extraction. Threshold the U-Net sigmoid mask at 127 to produce a binary image, then run cv2.findContours(binary, RETR_EXTERNAL, CHAIN_APPROX_SIMPLE). Filter by minimum area (200px) and aspect ratio (0.2 to 5.0).
Step 2 — Edge template. For each valid contour, compute the minimum-area rotated rectangle via cv2.minAreaRect. Extract 4 box corners with cv2.boxPoints, then order them as TL, TR, BR, BL. These define 4 template edge segments: TL→TR, TR→BR, BR→BL, BL→TL.
Step 3 — Point assignment. Each contour point is assigned to its nearest edge using point-to-line-segment distance. This partitions the contour into 4 clusters, one per gate edge.
for p in contour_points:
best_edge = argmin([point_line_dist(p, edge_a, edge_b)
for edge_a, edge_b in edges])
buckets[best_edge].append(p)
Step 4 — RANSAC line fitting (per edge cluster):
np.linalg.svd(inliers - mean)# RANSAC core loop (simplified)
for _ in range(50):
p1, p2 = random_sample(edge_pts, 2)
direction = normalize(p2 - p1)
normal = [-direction[1], direction[0]]
dists = abs((pts - p1) @ normal)
inliers = pts[dists < 3.0]
if len(inliers) > best_count:
mean = inliers.mean(axis=0)
_, _, vt = np.linalg.svd(inliers - mean)
best_line = (mean, vt[0]) # point + direction
Step 5 — Line intersection. Adjacent RANSAC-fitted lines are intersected analytically to produce 4 sub-pixel corner points. The intersection of lines Li and Li+1 gives corner C(i+1) mod 4.
# Line intersection: p1 + t*d1 = p2 + s*d2
det = d1[0]*(-d2[1]) - d1[1]*(-d2[0])
if abs(det) < 1e-8: return None # parallel
dp = p2 - p1
t = (-d2[1]*dp[0] + d2[0]*dp[1]) / det
corner = p1 + t * d1
Step 6 — cornerSubPix polish. OpenCV sub-pixel refinement on the mask image with a 5x5 search window, 30 iterations, epsilon 0.01:
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.01)
cv2.cornerSubPix(gray_mask, corners, (5, 5), (-1, -1), criteria)
Step 7 — Corner ordering. Final corners are sorted into TL, TR, BR, BL order using coordinate sum and difference:
Given 4 known 3D points (gate corners) and their 2D pixel positions (from RANSAC), recover the gate's 6DoF pose relative to the camera — specifically, the translation vector whose magnitude gives the distance.
A 2.0m x 2.0m square, centered at origin, lying in the XY plane:
GATE_CORNERS_3D = [
[-1.0, 1.0, 0], # TL
[ 1.0, 1.0, 0], # TR
[ 1.0, -1.0, 0], # BR
[-1.0, -1.0, 0], # BL
]
Resolution: 640 x 480
FOV: 90deg H, 60deg V
fx = 640 / (2 * tan(45deg))
= 320 / 1.0
= 320.0
fy = 480 / (2 * tan(30deg))
= 240 / 0.5774
= 415.7
K = [[320.0, 0, 320],
[ 0, 415.7, 240],
[ 0, 0, 1]]
success, rvec, tvec = cv2.solvePnP(
GATE_CORNERS_3D, # (4,3) float64 — known 3D positions
corners_2d, # (4,2) float64 — detected pixel corners
K, # 3x3 camera matrix
dist_coeffs, # distortion (zeros for now)
flags=cv2.SOLVEPNP_IPPE_SQUARE # optimized for coplanar squares
)
The SOLVEPNP_IPPE_SQUARE flag uses the Infinitesimal Plane-based Pose Estimation method, specifically optimized for coplanar square markers. It is both faster and more accurate than the general SOLVEPNP_ITERATIVE for this geometry.
| Variable | Shape | Meaning |
|---|---|---|
rvec | (3,1) | Rodrigues rotation vector (gate orientation relative to camera) |
tvec | (3,1) | Translation vector [x, y, z] — gate center in camera frame (meters) |
distance | scalar | np.linalg.norm(tvec) — Euclidean distance to gate center |
When PnP fails (degenerate corners, colinear points), a simple pinhole-model estimate is used:
distance = (GATE_WIDTH * fx) / pixel_width
= (2.0 * 320.0) / bbox_w
In VQ1, gates are visually highlighted with a saturated color against a desaturated environment. Color detection is trivially fast (~0.5ms) and needs no training.
cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)cv2.inRange(hsv, hsv_lower, hsv_upper)cv2.findContours(mask, RETR_EXTERNAL, CHAIN_APPROX_SIMPLE)approxPolyDP at 4% arc length epsilon; falls back to bounding rect if not 4 vertices| Preset | HSV Lower | HSV Upper | Notes |
|---|---|---|---|
| green | (40, 100, 100) | (90, 255, 255) | Default — most common gate highlight |
| cyan | (80, 100, 100) | (100, 255, 255) | Blue-green tinted gates |
| magenta | (140, 100, 100) | (170, 255, 255) | Pink/purple highlights |
| red | (0, 120, 100) | (10, 255, 255) | Red highlight (wraps at 180) |
| yellow | (20, 100, 100) | (40, 255, 255) | Yellow/gold highlights |
| white | (0, 0, 200) | (180, 30, 255) | Bright desaturated (white glow) |
# Switch preset at runtime
pipe.detector.set_color_preset("cyan")
For VQ2 environments where gates are not highlighted, a trained neural network detector is required. We use YOLOv8n or YOLOv11n for single-class gate detection.
| Format | Extension | Speed | Notes |
|---|---|---|---|
| TensorRT | .engine | Fastest | GPU-optimized, FP16, hardware-specific |
| ONNX | .onnx | Fast | Cross-platform, good baseline |
| PyTorch | .pt | Moderate | For training/debugging only |
The detector automatically resolves the best available format at load time (engine > onnx > pt).
# race_config.py
yolo_model_path: str = "gate_detector.pt"
yolo_conf_threshold: float = 0.5
python gate_segmentation.py train --data dataset_gates_seg
Dataset structure:
dataset_gates_seg/
train/
images/ # BGR frames (PNG/JPG)
masks/ # Binary masks (white = gate, black = background)
val/
images/
masks/
| Parameter | Value | Notes |
|---|---|---|
| Loss function | Dice + BCE combined | Dice handles class imbalance; BCE provides gradient everywhere |
| Optimizer | Adam, lr=1e-3 | Default PyTorch Adam |
| Epochs | 100 | Checkpoints every 25 epochs |
| Batch size | 8 | Increase if VRAM allows |
| Input size | 640x480 | Matches camera resolution |
| Best model | Saved by validation loss | dataset_gates_seg/weights/best.pt |
Export to ONNX:
python gate_segmentation.py export --weights best.pt
# Produces: gate_seg.onnx (opset 17, dynamic batch axis)
# Step 1: Auto-label using VQ1 color detection
python yolo-auto-label.py
# Step 2: Train
python yolo-train.py train
# Step 3: Export for deployment
python yolo-train.py export
# Produces: gate_detector.engine (TensorRT FP16) or gate_detector.onnx
| Parameter | Value |
|---|---|
| Base model | yolo11n.pt (nano — speed-optimized) |
| Epochs | 150 |
| Batch size | 16 |
| Image size | 640 |
| Patience | 30 (early stopping) |
| Augmentation | HSV jitter, rotation (10deg), translate, scale, mosaic, mixup, random erasing |
| No vertical flip | Gates have orientation — vertical flip creates invalid training examples |
yolo-auto-label.py script runs the VQ1 color detector on simulator frames and automatically generates YOLO-format bounding box labels. This bootstraps training data without manual annotation — you fly through VQ1, record frames, and the color detector produces labels for free.
Camera intrinsics are derived from the declared field-of-view angles:
fx = width / (2 * tan(fov_h / 2)) # 640 / (2 * tan(45deg)) = 320.0
fy = height / (2 * tan(fov_v / 2)) # 480 / (2 * tan(30deg)) = 415.7
cx = width / 2 # 320.0
cy = height / 2 # 240.0
# Distortion coefficients assumed zero
dist_coeffs = [0, 0, 0, 0, 0]
This is implemented in vision_pipeline.py as the CameraConfig class:
@property
def fx(self) -> float:
return self.width / (2.0 * np.tan(np.radians(self.fov_h / 2)))
@property
def fy(self) -> float:
return self.height / (2.0 * np.tan(np.radians(self.fov_v / 2)))
For real hardware or high-fidelity simulation, a proper calibration produces a more accurate intrinsic matrix K and distortion coefficients:
cv2.calibrateCamera() to solve for K and distortionCameraConfig with the calibrated values# OpenCV calibration (sketch)
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001)
objp = np.zeros((6*9, 3), np.float32)
objp[:, :2] = np.mgrid[0:9, 0:6].T.reshape(-1, 2) * square_size
obj_points, img_points = [], []
for img in calibration_images:
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
ret, corners = cv2.findChessboardCorners(gray, (9, 6), None)
if ret:
corners = cv2.cornerSubPix(gray, corners, (11, 11), (-1, -1), criteria)
obj_points.append(objp)
img_points.append(corners)
ret, K, dist, rvecs, tvecs = cv2.calibrateCamera(
obj_points, img_points, gray.shape[::-1], None, None
)
print(f"Reprojection error: {ret:.4f} px") # should be < 0.5
cv2.undistortPoints() to correct this before PnP solving.
The complete vision pipeline is orchestrated by VisionPipeline in vision_pipeline.py:
from vision_pipeline import VisionPipeline, CameraConfig
cam = CameraConfig(width=640, height=480, fov_h=90.0, fov_v=60.0)
pipe = VisionPipeline(cam, mode="unet") # or "color", "yolo"
pipe.setup() # loads model weights
# Per frame:
detections = pipe.process(frame) # detect + PnP in one call
nearest = pipe.get_nearest_gate() # closest valid gate
latency = pipe.inference_latency_ms # total ms for detect + PnP
| Stage | Input | Output | Time |
|---|---|---|---|
| 1. Detection | 640x480 BGR frame | List of GateDetection (bbox, confidence, corners_2d) | 0.5-12ms |
| 2. PnP Estimation | 4 corner pixel positions | rvec, tvec, distance (meters) | <0.1ms |
| 3. Gate Selection | All detections | Nearest valid gate (confidence > 0.3, distance < 100m) | negligible |
| File | Purpose |
|---|---|
vision_pipeline.py | Core pipeline: ColorGateDetector, YOLOGateDetector, PnPEstimator, VisionPipeline |
gate_segmentation.py | GateSegNet (U-Net), GateSegDetector (RANSAC corners), DiceBCELoss, GateSegTrainer |
race_config.py | CameraSettings, VisionSettings, GateSettings — all tunable parameters |
yolo-train.py | YOLO training configuration and runner |
yolo-auto-label.py | Auto-labeling: color detection generates YOLO training labels |