Intermediate

Edge Inference Runtimes

The inference runtime is the engine that executes your optimized model on edge hardware. Choosing the wrong runtime means leaving 2-10x performance on the table. This lesson covers every major runtime with production deployment code and head-to-head benchmarks so you can pick the right one for your hardware.

Runtime Comparison

Each runtime is optimized for specific hardware. Use this table to choose:

RuntimeTarget HardwareInput FormatQuantizationLanguageLicense
TFLite Android, RPi, Coral TPU, MCUs .tflite INT8, FP16 C++, Python, Java, Swift Apache 2.0
CoreML iOS, macOS (Neural Engine, GPU) .mlmodel / .mlpackage INT8, FP16 Swift, Objective-C Apple proprietary
ONNX Runtime Mobile Android, iOS, Linux ARM, Windows .onnx / .ort INT8, FP16 C++, Python, Java, C# MIT
TensorRT NVIDIA GPUs (Jetson, dGPU) .engine / .plan INT8, FP16, INT4 C++, Python NVIDIA proprietary
OpenVINO Intel CPUs, iGPUs, VPUs .xml + .bin INT8, FP16 C++, Python Apache 2.0
NNAPI Android (delegates to NPU/GPU/DSP) Via TFLite delegate INT8, FP16 C (via NDK) Android AOSP

TFLite: Android, Raspberry Pi, Coral

TFLite is the most widely deployed edge runtime. It runs on everything from $3 microcontrollers to high-end Android phones:

import numpy as np
import tflite_runtime.interpreter as tflite
import time

class TFLiteInference:
    """Production TFLite inference wrapper with benchmarking."""

    def __init__(self, model_path: str, num_threads: int = 4):
        # Use EdgeTPU delegate for Coral, CPU otherwise
        try:
            delegate = tflite.load_delegate("libedgetpu.so.1")
            self.interpreter = tflite.Interpreter(
                model_path=model_path,
                experimental_delegates=[delegate]
            )
            self.device = "coral_tpu"
        except (ValueError, OSError):
            self.interpreter = tflite.Interpreter(
                model_path=model_path,
                num_threads=num_threads
            )
            self.device = "cpu"

        self.interpreter.allocate_tensors()
        self.input_details = self.interpreter.get_input_details()
        self.output_details = self.interpreter.get_output_details()

        # Get input shape and type for preprocessing
        self.input_shape = self.input_details[0]["shape"]  # e.g. [1,224,224,3]
        self.input_dtype = self.input_details[0]["dtype"]   # uint8 or float32

        # Get quantization params for INT8 models
        if self.input_dtype == np.uint8:
            quant = self.input_details[0]["quantization_parameters"]
            self.input_scale = quant["scales"][0]
            self.input_zero_point = quant["zero_points"][0]

    def preprocess(self, image: np.ndarray) -> np.ndarray:
        """Resize and quantize input image."""
        from PIL import Image
        img = Image.fromarray(image).resize(
            (self.input_shape[2], self.input_shape[1])
        )
        img_array = np.array(img)

        if self.input_dtype == np.uint8:
            return np.expand_dims(img_array, axis=0).astype(np.uint8)
        else:
            return np.expand_dims(img_array / 255.0, axis=0).astype(np.float32)

    def predict(self, image: np.ndarray) -> dict:
        """Run inference and return predictions with timing."""
        input_data = self.preprocess(image)

        start = time.perf_counter()
        self.interpreter.set_tensor(self.input_details[0]["index"], input_data)
        self.interpreter.invoke()
        output = self.interpreter.get_tensor(self.output_details[0]["index"])
        latency_ms = (time.perf_counter() - start) * 1000

        # Dequantize output if INT8
        if self.output_details[0]["dtype"] == np.uint8:
            quant = self.output_details[0]["quantization_parameters"]
            output = (output.astype(np.float32) - quant["zero_points"][0]) * quant["scales"][0]

        top_idx = np.argmax(output[0])
        return {
            "class_id": int(top_idx),
            "confidence": float(output[0][top_idx]),
            "latency_ms": round(latency_ms, 2),
            "device": self.device
        }

# Usage on Raspberry Pi with Coral USB Accelerator
engine = TFLiteInference("mobilenet_v3_int8.tflite", num_threads=4)
result = engine.predict(camera_frame)
# -> {"class_id": 281, "confidence": 0.92, "latency_ms": 3.5, "device": "coral_tpu"}

CoreML: iOS and macOS

CoreML is the native runtime for Apple devices. It automatically uses the Neural Engine, GPU, or CPU based on model characteristics:

import coremltools as ct
import numpy as np

# Step 1: Convert ONNX or PyTorch model to CoreML
import torch
model = torch.load("mobilenet_v3_trained.pt")
model.eval()

# Trace the model
example_input = torch.rand(1, 3, 224, 224)
traced = torch.jit.trace(model, example_input)

# Convert to CoreML with INT8 quantization
mlmodel = ct.convert(
    traced,
    inputs=[ct.ImageType(name="image", shape=(1, 3, 224, 224),
                         scale=1/255.0, bias=[0, 0, 0])],
    compute_units=ct.ComputeUnit.ALL,  # Neural Engine + GPU + CPU
    minimum_deployment_target=ct.target.iOS16
)

# Quantize to INT8 for smaller size and faster inference
from coremltools.optimize.coreml import (
    OpLinearQuantizerConfig, OptimizationConfig, linear_quantize_weights
)
config = OptimizationConfig(
    global_config=OpLinearQuantizerConfig(mode="linear_symmetric", weight_threshold=512)
)
mlmodel_quantized = linear_quantize_weights(mlmodel, config=config)
mlmodel_quantized.save("MobileNetV3.mlpackage")

# Step 2: Swift code for iOS deployment
# -----------------------------------------------
# import CoreML
# import Vision
#
# class EdgeClassifier {
#     private let model: VNCoreMLModel
#
#     init() throws {
#         let config = MLModelConfiguration()
#         config.computeUnits = .all  // Neural Engine preferred
#         let coremlModel = try MobileNetV3(configuration: config)
#         self.model = try VNCoreMLModel(for: coremlModel.model)
#     }
#
#     func classify(image: CGImage) async -> (String, Float, Double) {
#         let request = VNCoreMLRequest(model: model)
#         request.imageCropAndScaleOption = .centerCrop
#
#         let handler = VNImageRequestHandler(cgImage: image)
#         let start = CFAbsoluteTimeGetCurrent()
#         try? handler.perform([request])
#         let latency = (CFAbsoluteTimeGetCurrent() - start) * 1000
#
#         guard let results = request.results as? [VNClassificationObservation],
#               let top = results.first else {
#             return ("unknown", 0.0, latency)
#         }
#         return (top.identifier, top.confidence, latency)
#         // -> ("golden_retriever", 0.94, 2.1)  // 2.1ms on Neural Engine
#     }
# }
# -----------------------------------------------

print("CoreML model saved. Deploy via Xcode to iOS/macOS.")

TensorRT: NVIDIA Jetson

TensorRT squeezes maximum performance from NVIDIA GPUs. It fuses layers, selects optimal CUDA kernels, and exploits hardware-specific features:

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import time

class TensorRTInference:
    """Production TensorRT inference for Jetson devices."""

    def __init__(self, onnx_path: str, precision: str = "fp16"):
        self.engine = self._build_engine(onnx_path, precision)
        self.context = self.engine.create_execution_context()
        self._allocate_buffers()

    def _build_engine(self, onnx_path: str, precision: str):
        """Build TensorRT engine from ONNX model."""
        logger = trt.Logger(trt.Logger.WARNING)
        builder = trt.Builder(logger)
        network = builder.create_network(
            1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
        )
        parser = trt.OnnxParser(network, logger)

        # Parse ONNX model
        with open(onnx_path, "rb") as f:
            if not parser.parse(f.read()):
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
                raise RuntimeError("ONNX parsing failed")

        # Configure builder
        config = builder.create_builder_config()
        config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB

        if precision == "fp16":
            config.set_flag(trt.BuilderFlag.FP16)
        elif precision == "int8":
            config.set_flag(trt.BuilderFlag.INT8)
            # Requires calibration dataset for INT8
            config.int8_calibrator = self._create_calibrator()

        # Build engine (takes 1-5 minutes, cache for reuse)
        engine = builder.build_serialized_network(network, config)
        runtime = trt.Runtime(logger)
        return runtime.deserialize_cuda_engine(engine)

    def _allocate_buffers(self):
        """Allocate GPU memory for input and output tensors."""
        self.inputs = []
        self.outputs = []
        self.bindings = []
        self.stream = cuda.Stream()

        for i in range(self.engine.num_io_tensors):
            name = self.engine.get_tensor_name(i)
            shape = self.engine.get_tensor_shape(name)
            dtype = trt.nptype(self.engine.get_tensor_dtype(name))
            size = np.prod(shape)

            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)

            if self.engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
                self.inputs.append({"host": host_mem, "device": device_mem, "shape": shape})
            else:
                self.outputs.append({"host": host_mem, "device": device_mem, "shape": shape})

    def predict(self, image: np.ndarray) -> dict:
        """Run inference on Jetson GPU."""
        # Preprocess: resize, normalize, NCHW format
        input_data = self._preprocess(image)
        np.copyto(self.inputs[0]["host"], input_data.ravel())

        start = time.perf_counter()

        # Transfer input to GPU
        cuda.memcpy_htod_async(
            self.inputs[0]["device"], self.inputs[0]["host"], self.stream
        )

        # Run inference
        self.context.set_tensor_address(
            self.engine.get_tensor_name(0), int(self.inputs[0]["device"])
        )
        self.context.set_tensor_address(
            self.engine.get_tensor_name(1), int(self.outputs[0]["device"])
        )
        self.context.execute_async_v3(stream_handle=self.stream.handle)

        # Transfer output from GPU
        cuda.memcpy_dtoh_async(
            self.outputs[0]["host"], self.outputs[0]["device"], self.stream
        )
        self.stream.synchronize()

        latency_ms = (time.perf_counter() - start) * 1000
        output = self.outputs[0]["host"].reshape(self.outputs[0]["shape"])

        top_idx = np.argmax(output[0])
        return {
            "class_id": int(top_idx),
            "confidence": float(output[0][top_idx]),
            "latency_ms": round(latency_ms, 2),
            "device": "jetson_gpu"
        }

# Usage on Jetson Orin Nano
engine = TensorRTInference("mobilenet_v3.onnx", precision="fp16")
result = engine.predict(camera_frame)
# -> {"class_id": 281, "confidence": 0.95, "latency_ms": 1.8, "device": "jetson_gpu"}

ONNX Runtime Mobile

ONNX Runtime is the cross-platform option. One model format, one API, multiple hardware backends:

import onnxruntime as ort
import numpy as np
import time

class ONNXEdgeInference:
    """Cross-platform ONNX Runtime inference for edge devices."""

    def __init__(self, model_path: str):
        # Select best available execution provider
        providers = self._get_providers()
        sess_options = ort.SessionOptions()
        sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        sess_options.intra_op_num_threads = 4
        sess_options.inter_op_num_threads = 1

        self.session = ort.InferenceSession(
            model_path, sess_options, providers=providers
        )
        self.input_name = self.session.get_inputs()[0].name
        self.input_shape = self.session.get_inputs()[0].shape
        self.provider = self.session.get_providers()[0]

    def _get_providers(self):
        """Auto-detect best hardware backend."""
        available = ort.get_available_providers()

        # Priority order: CUDA > CoreML > NNAPI > OpenVINO > CPU
        priority = [
            "CUDAExecutionProvider",      # NVIDIA GPUs
            "CoreMLExecutionProvider",     # Apple Neural Engine
            "NnapiExecutionProvider",      # Android NPU/GPU
            "OpenVINOExecutionProvider",   # Intel hardware
            "CPUExecutionProvider",        # Fallback
        ]
        return [p for p in priority if p in available]

    def predict(self, image: np.ndarray) -> dict:
        """Run inference with automatic hardware acceleration."""
        input_data = self._preprocess(image)

        start = time.perf_counter()
        outputs = self.session.run(None, {self.input_name: input_data})
        latency_ms = (time.perf_counter() - start) * 1000

        output = outputs[0]
        top_idx = np.argmax(output[0])

        return {
            "class_id": int(top_idx),
            "confidence": float(output[0][top_idx]),
            "latency_ms": round(latency_ms, 2),
            "provider": self.provider
        }

    def benchmark(self, image: np.ndarray, num_runs: int = 100) -> dict:
        """Benchmark inference performance."""
        input_data = self._preprocess(image)

        # Warm up
        for _ in range(10):
            self.session.run(None, {self.input_name: input_data})

        times = []
        for _ in range(num_runs):
            start = time.perf_counter()
            self.session.run(None, {self.input_name: input_data})
            times.append((time.perf_counter() - start) * 1000)

        return {
            "provider": self.provider,
            "avg_ms": round(np.mean(times), 2),
            "p50_ms": round(np.percentile(times, 50), 2),
            "p95_ms": round(np.percentile(times, 95), 2),
            "p99_ms": round(np.percentile(times, 99), 2),
            "throughput_fps": round(1000 / np.mean(times), 1)
        }

# Same code works on Android, iOS, Linux ARM, Windows
engine = ONNXEdgeInference("mobilenet_v3.onnx")
result = engine.predict(camera_frame)
benchmark = engine.benchmark(camera_frame)
print(benchmark)
# RPi 5:   {"provider": "CPUExecutionProvider", "avg_ms": 15.2, "throughput_fps": 65.8}
# Jetson:  {"provider": "CUDAExecutionProvider", "avg_ms": 2.1, "throughput_fps": 476.2}
# iPhone:  {"provider": "CoreMLExecutionProvider", "avg_ms": 2.8, "throughput_fps": 357.1}

OpenVINO: Intel Hardware

OpenVINO optimizes models for Intel CPUs, integrated GPUs, and VPUs (Myriad). Essential for x86 edge deployments:

from openvino.runtime import Core
import numpy as np
import time

class OpenVINOInference:
    """Intel-optimized inference using OpenVINO."""

    def __init__(self, model_path: str, device: str = "AUTO"):
        """
        device options:
          "CPU"  - Intel CPUs with AVX/VNNI
          "GPU"  - Intel integrated/discrete GPUs
          "AUTO" - Automatically pick best device
        """
        self.core = Core()

        # Read and compile model
        model = self.core.read_model(model_path)
        self.compiled = self.core.compile_model(model, device)
        self.infer_request = self.compiled.create_infer_request()

        self.input_layer = self.compiled.input(0)
        self.output_layer = self.compiled.output(0)
        self.device = device

    def predict(self, image: np.ndarray) -> dict:
        input_data = self._preprocess(image)

        start = time.perf_counter()
        self.infer_request.infer({self.input_layer: input_data})
        output = self.infer_request.get_output_tensor(0).data
        latency_ms = (time.perf_counter() - start) * 1000

        top_idx = np.argmax(output[0])
        return {
            "class_id": int(top_idx),
            "confidence": float(output[0][top_idx]),
            "latency_ms": round(latency_ms, 2),
            "device": self.device
        }

# Usage on Intel NUC or industrial PC
engine = OpenVINOInference("mobilenet_v3.onnx", device="AUTO")
result = engine.predict(camera_frame)
# -> {"class_id": 281, "confidence": 0.93, "latency_ms": 4.2, "device": "AUTO"}
💡
Apply at work: If you target only one platform, use the native runtime (TFLite for Android, CoreML for iOS, TensorRT for Jetson). If you target multiple platforms, use ONNX Runtime — one model file and one API across all hardware. The performance gap between native and ONNX Runtime is typically only 10-20%.

Runtime Benchmark Comparison

MobileNet-v3 Large INT8 benchmarks across runtimes and hardware (batch size 1, single image classification):

HardwareRuntimeLatency (ms)Throughput (FPS)Power (W)
Google Coral USB TFLite + EdgeTPU 3.5 286 2.5
Jetson Orin Nano TensorRT FP16 1.8 556 10
Jetson Orin Nano ONNX Runtime CUDA 2.4 417 10
Raspberry Pi 5 TFLite CPU (4 threads) 18 56 5
RPi 5 + Hailo-8L HailoRT 4.2 238 7
iPhone 15 Pro CoreML (Neural Engine) 2.1 476 1.5
Intel N100 Mini PC OpenVINO CPU 8.5 118 6
Samsung S24 (Snapdragon 8 Gen 3) TFLite + NNAPI 2.8 357 2
📝
Production reality: Benchmarks vary significantly with model architecture, input resolution, batch size, and thermal conditions. Always benchmark on your target hardware under realistic conditions (sustained load, not cold-start single inference). Jetson devices throttle under sustained load if cooling is inadequate — add a fan.

Key Takeaways

  • TFLite is the most versatile edge runtime: runs on Android, RPi, Coral, and microcontrollers. Start here for cross-platform IoT deployments.
  • CoreML delivers the best performance per watt on Apple devices via the Neural Engine. Use it for any iOS/macOS deployment.
  • TensorRT extracts maximum performance from NVIDIA Jetson GPUs. The build step is slow (1-5 min) but inference is 30-50% faster than ONNX Runtime on the same hardware.
  • ONNX Runtime is the best cross-platform option: one API, one model format, automatic hardware detection. Accept a 10-20% performance gap for multi-platform convenience.
  • Always benchmark under sustained load on target hardware. Cold-start benchmarks are misleading — thermal throttling can increase latency 2-3x under continuous operation.

What Is Next

In the next lesson, we will cover edge-cloud synchronization — how to distribute model updates to edge devices via OTA, collect inference data for retraining, and manage bandwidth between edge and cloud.