Computer Vision Best Practices
Practical guidance for building robust CV systems — from dataset creation to production deployment and ethical considerations.
Dataset Creation and Annotation
- Quality over quantity: A well-annotated dataset of 1,000 images often outperforms a poorly labeled dataset of 10,000.
- Annotation tools: Use tools like CVAT, LabelStudio, Roboflow, or VGG Image Annotator for efficient labeling.
- Consistency: Define clear annotation guidelines. What counts as a "partial occlusion"? How tight should bounding boxes be?
- Balance: Ensure balanced representation across classes. Use oversampling or weighted loss functions for imbalanced datasets.
- Validation: Have multiple annotators label the same images and measure inter-annotator agreement.
Data Augmentation Strategies
| Category | Techniques | When to Use |
|---|---|---|
| Geometric | Flip, rotate, crop, scale, translate | Almost always; fundamental augmentations |
| Color | Brightness, contrast, saturation, hue jitter | When lighting varies in real-world conditions |
| Noise | Gaussian noise, blur, JPEG compression | When input quality varies |
| Advanced | Cutout, MixUp, CutMix, Mosaic | When you need stronger regularization |
| Generative | Synthetic data generation with diffusion models | When real data is scarce or expensive |
Model Selection Guide
| Scenario | Recommended Model | Reasoning |
|---|---|---|
| Small dataset (<1K images) | Pretrained ResNet-18 or EfficientNet-B0 | Smaller models overfit less on small datasets |
| Large dataset (>10K images) | ResNet-50, EfficientNet-B4, or ViT | Larger models can leverage more data |
| Real-time inference | MobileNet, YOLOv8-nano | Optimized for speed on edge devices |
| Maximum accuracy | ViT-Large, ConvNeXt-XL, EfficientNet-B7 | Larger models with more compute budget |
| Object detection | YOLOv8 (start with nano/small) | Best speed-accuracy tradeoff, easy to use |
| Segmentation | U-Net with ResNet encoder | Strong baseline, well-understood architecture |
Training Tips
- Start with transfer learning: Always start with pretrained weights. Training from scratch is rarely justified.
- Learning rate: Use a learning rate finder. Typical values: 1e-3 for new heads, 1e-5 for fine-tuning backbones.
- Batch size: Larger batches are faster but may generalize worse. Use gradient accumulation if your GPU cannot fit large batches.
- Mixed precision: Use FP16 training (PyTorch AMP) to nearly double throughput and halve memory usage.
- Early stopping: Monitor validation loss and stop when it plateaus to prevent overfitting.
- GPU/TPU selection: An NVIDIA RTX 3090 or A100 is ideal. Google Colab provides free T4 GPUs for prototyping.
Evaluation Metrics
| Task | Primary Metric | Description |
|---|---|---|
| Classification | Accuracy, Top-5 Accuracy | Percentage of correctly classified images |
| Detection | mAP (mean Average Precision) | Average precision across all classes at various IoU thresholds |
| Segmentation | mIoU (mean IoU) | Average IoU between predicted and ground truth masks across classes |
| All tasks | Precision, Recall, F1 | Trade-off between false positives and false negatives |
Deployment
Model Optimization
Quantize (INT8), prune, or distill your model for faster inference. Use ONNX Runtime or TensorRT for production.
Edge Deployment
For mobile/IoT: use TensorFlow Lite, CoreML (Apple), or ONNX Runtime Mobile. Consider model size and latency constraints.
Cloud Deployment
Serve with TorchServe, TensorFlow Serving, or Triton Inference Server. Use batch inference for throughput-intensive workloads.
Monitoring
Track inference latency, accuracy drift, and data distribution shifts in production.
Ethical Considerations
- Surveillance: Facial recognition and tracking technologies raise significant privacy and civil liberties concerns. Consider whether your application could enable mass surveillance.
- Bias: CV models can exhibit demographic biases (e.g., lower accuracy on darker skin tones). Test across diverse populations and lighting conditions.
- Consent: Ensure proper consent for collecting and using images of people, especially for training data.
- Deepfakes: Generative CV models can create convincing fake images and videos. Consider misuse potential when deploying generative capabilities.
- Transparency: Be clear about what your CV system can and cannot do. Avoid overstating capabilities.
Frequently Asked Questions
With transfer learning and good augmentation, you can achieve reasonable results with as few as 100-500 images per class. For production quality, aim for 1,000-5,000 images per class. More complex tasks like segmentation may need even more annotated data.
Both are excellent choices. PyTorch is more popular in research and has a more Pythonic API. TensorFlow has stronger production/deployment tooling (TFLite, TF Serving, TFX). Most modern CV libraries and pretrained models support both. Choose whichever your team is more comfortable with.
Most models require fixed input sizes. Resize images to the model's expected input (e.g., 224x224 for ResNet, 640x640 for YOLO). Use letterboxing (padding with gray) to preserve aspect ratio. For segmentation, you can use sliding window inference or process at the original resolution with fully convolutional networks.
Yes, for inference. Smaller models like MobileNet or YOLOv8-nano run at reasonable speeds on modern CPUs. For training, a GPU is strongly recommended. Use model optimization (quantization, ONNX Runtime) to speed up CPU inference.
Start with this course and OpenCV tutorials. Then work through a practical project (build a classifier, train a YOLO model on custom data). Study Stanford CS231n for deeper theory. Join Kaggle competitions for practice with real datasets. Most importantly, build projects that solve problems you care about.
Lilly Tech Systems