Intermediate

Deploying Models to Edge Devices

Package, optimize, and deploy ML models to edge hardware using TensorRT, TensorFlow Lite, ONNX Runtime, and containerized deployment strategies.

Model Optimization Pipeline

  1. Export from Training Framework

    Export your trained model to an intermediate format: ONNX from PyTorch, SavedModel from TensorFlow, or Core ML from Apple frameworks.

  2. Quantization

    Convert FP32 weights to INT8 or FP16. This reduces model size by 2-4x and inference time by 2-3x with minimal accuracy loss on most tasks.

  3. Hardware-Specific Compilation

    Use TensorRT for NVIDIA Jetson, Edge TPU Compiler for Coral, or OpenVINO for Intel. These tools fuse layers, optimize memory layout, and generate hardware-specific instructions.

  4. Benchmarking

    Measure inference latency, throughput, memory usage, and accuracy on the target device. Compare against your requirements before deploying.

TensorRT Optimization for Jetson

Python - TensorRT Conversion
import tensorrt as trt
import torch

# Export PyTorch model to ONNX
model = torch.load("detector.pt")
dummy = torch.randn(1, 3, 640, 640).cuda()
torch.onnx.export(model, dummy, "detector.onnx",
    opset_version=17,
    dynamic_axes={'input': {0: 'batch'}})

# Build TensorRT engine with INT8 quantization
# trtexec --onnx=detector.onnx --saveEngine=detector.engine \
#   --int8 --workspace=4096 --best

Containerized Edge Deployment

Containers provide consistent deployment across diverse edge hardware. Use NVIDIA's L4T-based containers for Jetson or lightweight Alpine-based containers for CPU-only edge devices.

Dockerfile - Jetson Edge Container
FROM nvcr.io/nvidia/l4t-tensorrt:r8.6.2-runtime

COPY detector.engine /models/
COPY inference_server.py /app/
COPY requirements.txt /app/

RUN pip3 install -r /app/requirements.txt
WORKDIR /app

CMD ["python3", "inference_server.py"]

Runtime Comparison

RuntimeHardwareQuantizationDeployment
TensorRTNVIDIA GPUFP16, INT8Engine file
TFLiteCPU, Coral TPU, GPU delegateINT8, FP16.tflite file
ONNX RuntimeCPU, CUDA, DirectML, TensorRTINT8, FP16.onnx file
OpenVINOIntel CPU, GPU, VPUINT8, FP16IR model
Best practice: Always benchmark on the actual target hardware. Inference performance varies significantly between development machines and edge devices. Build a CI/CD pipeline that includes edge device benchmarking as a gate before deployment approval.