AI Grand Prix — Local GPU Training Guide

Train all AI models on your MSI RTX 5080 16G • $0 cost • ~2 hours total

Your GPU: MSI Gaming RTX 5080 Shadow 3X OC — 16 GB GDDR7, Blackwell architecture, ~1800 CUDA cores boost. This is a beast for training. Faster than a cloud A100 for single-GPU workloads, and you own it. No cloud account, no hourly billing, no SSH. Just clone, run one script, get weights.

GPU Compatibility

GPUVRAMU-NetYOLORL (2M steps)Total Time
RTX 5080 (yours)16 GB GDDR7~8 min~25 min~1.2 hr~2 hr
RTX 509032 GB GDDR7~6 min~18 min~50 min~1.3 hr
RTX 409024 GB~12 min~35 min~1.5 hr~2.5 hr
RTX 408016 GB~18 min~50 min~2.5 hr~3.5 hr
RTX 4070 Ti12 GB~22 min~1 hr~3 hr~4 hr
RTX 308010 GB~30 min~1.5 hr~4 hr~6 hr
RTX 306012 GB~40 min~2 hr~5 hr~7.5 hr
Your RTX 5080 is Blackwell architecture with GDDR7 memory — significantly faster than Ada Lovelace (40-series). The 16 GB VRAM handles all model sizes easily. Training all three models takes about 2 hours total and costs $0. This beats a $1.39/hr cloud A100 for single-GPU jobs.

One-Shot Setup & Training ~10 min setup + ~3 hr training

Prerequisites — NVIDIA drivers + CUDA skip if already installed

Check if CUDA is already working:

nvidia-smi

If you see your GPU listed, skip to Step 2. If not:

Windows (RTX 5080)

# RTX 5080 requires driver 570+ and CUDA 12.8+
# Download latest Game Ready driver from nvidia.com/drivers
# (search: RTX 5080, Windows 11)
# Install, restart, then verify:
nvidia-smi
# Should show: NVIDIA GeForce RTX 5080, Driver 570.xx+, CUDA 12.8

Linux (Ubuntu)

sudo apt update
sudo apt install -y nvidia-driver-570
sudo reboot

# After reboot:
nvidia-smi
# Should show: RTX 5080, Driver 570+, CUDA 12.8

Install Python + PyTorch 5 min

Windows (PowerShell or CMD)

# Install Python 3.11+ from python.org if needed

# Create venv
python -m venv aigp
aigp\Scripts\activate

# PyTorch with CUDA 12.8 (required for RTX 5080 Blackwell)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

# Everything else
pip install opencv-python numpy scipy pyyaml ultralytics stable-baselines3 gymnasium onnxruntime mavsdk

Linux

# Create venv
python3 -m venv aigp
source aigp/bin/activate

# PyTorch with CUDA 12.8 (required for RTX 5080 Blackwell)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

# Everything else
pip install opencv-python numpy scipy pyyaml ultralytics stable-baselines3 gymnasium onnxruntime mavsdk

Verify GPU is visible to PyTorch:

python -c "import torch; print(torch.cuda.get_device_name(0))"
# Should print: NVIDIA GeForce RTX 5080

Clone the repo 1 min

git clone https://github.com/blakefarabi/grandprix.git
cd grandprix

Run the training script ~3 hours on 4080

Windows

# Git Bash or WSL recommended:
bash train_all.sh

# Or run each model manually in CMD:
python gate_segmentation.py train --data dataset_gates_seg
python -c "from ultralytics import YOLO; m=YOLO('yolov8n.pt'); m.train(data='dataset_gates_yolo/data.yaml', epochs=50, imgsz=640, device=0)"
python rl_train.py train --steps 2000000

Linux

bash train_all.sh

That's it. The script handles everything: creates synthetic data, trains U-Net, YOLO, and RL policy, exports to ONNX.

Windows users without Git Bash: The train_all.sh script is a bash script. Install Git for Windows (includes Git Bash) or use WSL. Alternatively, run the three Python commands individually in CMD/PowerShell.

Monitor training progress check occasionally

The script prints progress for each model. You can also monitor GPU usage:

# In a separate terminal, watch GPU utilization:
nvidia-smi -l 5

# Expected during training:
#   GPU Util: 80-100%
#   Memory:   6-12 GB (varies by model)
#   Power:    200-320W (4080)
#   Temp:     60-80°C (normal)
If GPU temp exceeds 85°C: Check case airflow. Training is compute-heavy and will push the GPU hard for hours. Ensure fans are running and case is ventilated.

Collect trained weights 1 min

After training completes, you'll have these files:

FileModelSizePurpose
gate_seg.onnxU-Net~15 MBGate segmentation (primary detector)
dataset_gates_seg/weights/best.ptU-Net~15 MBPyTorch weights (for further training)
yolo_runs/gates/weights/best.onnxYOLO~6 MBGate detection (backup detector)
yolo_runs/gates/weights/best.ptYOLO~6 MBPyTorch weights
policy.onnxRL (PPO)~0.4 MBLearned racing policy

Copy these to your Mac for deployment:

# From your Mac (replace PC_IP with your Windows machine's IP):
scp user@PC_IP:~/grandprix/gate_seg.onnx .
scp user@PC_IP:~/grandprix/yolo_runs/gates/weights/best.onnx .

# Or simpler — just push to GitHub from the PC:
cd grandprix
git add gate_seg.onnx yolo_runs/gates/weights/best.onnx
git commit -m "Add trained model weights (4080)"
git push

# Then pull on Mac:
git pull

Training Each Model Individually

If you want to train just one model instead of all three:

U-Net Only (~18 min)

python gate_segmentation.py train \
  --data dataset_gates_seg

# Export to ONNX
python gate_segmentation.py export \
  --weights dataset_gates_seg/weights/best.pt

Trains pixel-level gate segmentation. Produces the most accurate corners for PnP depth via RANSAC.

YOLO Only (~50 min)

# Auto-label from VQ1 footage (if available)
python yolo-auto-label.py

# Or train on synthetic data
python yolo-train.py train

# Export to ONNX / TensorRT
python yolo-train.py export

Trains YOLOv8n bounding box detector. Faster inference but less accurate corners.

RL Policy Only (~2.5 hr)

# Train PPO (2M steps)
python rl_train.py train --steps 2000000

# Export to ONNX
python rl_train.py export

Trains a neural racing controller that replaces the classical proportional pursuit. Potentially faster laps but needs more tuning.

Custom Training

# Fewer steps (faster, less converged)
python rl_train.py train --steps 500000

# More steps (slower, better policy)
python rl_train.py train --steps 10000000

# Resume from checkpoint
python rl_train.py train --resume \
  --checkpoint rl_checkpoints/best_model

Troubleshooting

ProblemCauseFix
torch.cuda.is_available() returns FalseCUDA not installed or wrong PyTorch buildReinstall PyTorch with CUDA: pip install torch --index-url https://download.pytorch.org/whl/cu121
CUDA out of memoryBatch size too large for VRAMReduce batch size: edit gate_segmentation.py batch_size=4 (default 8)
bash: train_all.sh: not foundNot in repo directorycd grandprix first
Windows: bash not recognizedNo Git Bash / WSLInstall Git for Windows or run Python commands individually
Training is very slowRunning on CPU instead of GPUVerify: python -c "import torch; print(torch.cuda.get_device_name(0))"
GPU temp >85°CSustained compute loadImprove case airflow, or add --batch 4 to reduce load
ModuleNotFoundErrorMissing dependencypip install <module-name>

SSH from Mac to PC (optional, train remotely)

If you want to kick off training from your Mac without touching the PC:

On the PC (one-time setup)

# Windows: Enable OpenSSH Server
# Settings → Apps → Optional Features
# → Add: OpenSSH Server → Install
# Then in PowerShell (admin):
Start-Service sshd
Set-Service -Name sshd -StartupType Automatic

# Find your PC's IP:
ipconfig
# Look for IPv4 Address (e.g., 192.168.1.42)

From your Mac

# SSH into the PC
ssh your-username@192.168.1.42

# Clone and train
git clone https://github.com/blakefarabi/grandprix.git
cd grandprix
bash train_all.sh

# Or run in background (disconnect-safe):
nohup bash train_all.sh > training.log 2>&1 &
# Check progress later:
tail -f training.log
Pro tip: Use nohup so training continues even if you close the SSH session. Check back with tail -f training.log anytime.

Local GPU vs Cloud — Cost Comparison

RTX 5080 (Yours)RunPod A4000RunPod A100
Cost per session$0 (free)$1.15$4.18
U-Net time~8 min~30 min~15 min
YOLO time~25 min~1.5 hr~45 min
RL time~1.2 hr~4 hr~2 hr
Total~2 hr~6 hr~3 hr
Setup time10 min (first time)5 min5 min
Data transferInstant (local)Download weightsDownload weights
Best forEverything (fastest + free)No local GPUMulti-GPU only
Bottom line: Your RTX 5080 is faster than a cloud A100 for single-GPU training and costs $0. It's your best option for everything. Use cloud GPUs only when you need multi-GPU distributed training or can't access your PC. The 5080's GDDR7 bandwidth and Blackwell CUDA cores crush all cloud options at this price point (free).

Back to Documentation Hub