Edge Inference Runtimes
The inference runtime is the engine that executes your optimized model on edge hardware. Choosing the wrong runtime means leaving 2-10x performance on the table. This lesson covers every major runtime with production deployment code and head-to-head benchmarks so you can pick the right one for your hardware.
Runtime Comparison
Each runtime is optimized for specific hardware. Use this table to choose:
| Runtime | Target Hardware | Input Format | Quantization | Language | License |
|---|---|---|---|---|---|
| TFLite | Android, RPi, Coral TPU, MCUs | .tflite | INT8, FP16 | C++, Python, Java, Swift | Apache 2.0 |
| CoreML | iOS, macOS (Neural Engine, GPU) | .mlmodel / .mlpackage | INT8, FP16 | Swift, Objective-C | Apple proprietary |
| ONNX Runtime Mobile | Android, iOS, Linux ARM, Windows | .onnx / .ort | INT8, FP16 | C++, Python, Java, C# | MIT |
| TensorRT | NVIDIA GPUs (Jetson, dGPU) | .engine / .plan | INT8, FP16, INT4 | C++, Python | NVIDIA proprietary |
| OpenVINO | Intel CPUs, iGPUs, VPUs | .xml + .bin | INT8, FP16 | C++, Python | Apache 2.0 |
| NNAPI | Android (delegates to NPU/GPU/DSP) | Via TFLite delegate | INT8, FP16 | C (via NDK) | Android AOSP |
TFLite: Android, Raspberry Pi, Coral
TFLite is the most widely deployed edge runtime. It runs on everything from $3 microcontrollers to high-end Android phones:
import numpy as np
import tflite_runtime.interpreter as tflite
import time
class TFLiteInference:
"""Production TFLite inference wrapper with benchmarking."""
def __init__(self, model_path: str, num_threads: int = 4):
# Use EdgeTPU delegate for Coral, CPU otherwise
try:
delegate = tflite.load_delegate("libedgetpu.so.1")
self.interpreter = tflite.Interpreter(
model_path=model_path,
experimental_delegates=[delegate]
)
self.device = "coral_tpu"
except (ValueError, OSError):
self.interpreter = tflite.Interpreter(
model_path=model_path,
num_threads=num_threads
)
self.device = "cpu"
self.interpreter.allocate_tensors()
self.input_details = self.interpreter.get_input_details()
self.output_details = self.interpreter.get_output_details()
# Get input shape and type for preprocessing
self.input_shape = self.input_details[0]["shape"] # e.g. [1,224,224,3]
self.input_dtype = self.input_details[0]["dtype"] # uint8 or float32
# Get quantization params for INT8 models
if self.input_dtype == np.uint8:
quant = self.input_details[0]["quantization_parameters"]
self.input_scale = quant["scales"][0]
self.input_zero_point = quant["zero_points"][0]
def preprocess(self, image: np.ndarray) -> np.ndarray:
"""Resize and quantize input image."""
from PIL import Image
img = Image.fromarray(image).resize(
(self.input_shape[2], self.input_shape[1])
)
img_array = np.array(img)
if self.input_dtype == np.uint8:
return np.expand_dims(img_array, axis=0).astype(np.uint8)
else:
return np.expand_dims(img_array / 255.0, axis=0).astype(np.float32)
def predict(self, image: np.ndarray) -> dict:
"""Run inference and return predictions with timing."""
input_data = self.preprocess(image)
start = time.perf_counter()
self.interpreter.set_tensor(self.input_details[0]["index"], input_data)
self.interpreter.invoke()
output = self.interpreter.get_tensor(self.output_details[0]["index"])
latency_ms = (time.perf_counter() - start) * 1000
# Dequantize output if INT8
if self.output_details[0]["dtype"] == np.uint8:
quant = self.output_details[0]["quantization_parameters"]
output = (output.astype(np.float32) - quant["zero_points"][0]) * quant["scales"][0]
top_idx = np.argmax(output[0])
return {
"class_id": int(top_idx),
"confidence": float(output[0][top_idx]),
"latency_ms": round(latency_ms, 2),
"device": self.device
}
# Usage on Raspberry Pi with Coral USB Accelerator
engine = TFLiteInference("mobilenet_v3_int8.tflite", num_threads=4)
result = engine.predict(camera_frame)
# -> {"class_id": 281, "confidence": 0.92, "latency_ms": 3.5, "device": "coral_tpu"}
CoreML: iOS and macOS
CoreML is the native runtime for Apple devices. It automatically uses the Neural Engine, GPU, or CPU based on model characteristics:
import coremltools as ct
import numpy as np
# Step 1: Convert ONNX or PyTorch model to CoreML
import torch
model = torch.load("mobilenet_v3_trained.pt")
model.eval()
# Trace the model
example_input = torch.rand(1, 3, 224, 224)
traced = torch.jit.trace(model, example_input)
# Convert to CoreML with INT8 quantization
mlmodel = ct.convert(
traced,
inputs=[ct.ImageType(name="image", shape=(1, 3, 224, 224),
scale=1/255.0, bias=[0, 0, 0])],
compute_units=ct.ComputeUnit.ALL, # Neural Engine + GPU + CPU
minimum_deployment_target=ct.target.iOS16
)
# Quantize to INT8 for smaller size and faster inference
from coremltools.optimize.coreml import (
OpLinearQuantizerConfig, OptimizationConfig, linear_quantize_weights
)
config = OptimizationConfig(
global_config=OpLinearQuantizerConfig(mode="linear_symmetric", weight_threshold=512)
)
mlmodel_quantized = linear_quantize_weights(mlmodel, config=config)
mlmodel_quantized.save("MobileNetV3.mlpackage")
# Step 2: Swift code for iOS deployment
# -----------------------------------------------
# import CoreML
# import Vision
#
# class EdgeClassifier {
# private let model: VNCoreMLModel
#
# init() throws {
# let config = MLModelConfiguration()
# config.computeUnits = .all // Neural Engine preferred
# let coremlModel = try MobileNetV3(configuration: config)
# self.model = try VNCoreMLModel(for: coremlModel.model)
# }
#
# func classify(image: CGImage) async -> (String, Float, Double) {
# let request = VNCoreMLRequest(model: model)
# request.imageCropAndScaleOption = .centerCrop
#
# let handler = VNImageRequestHandler(cgImage: image)
# let start = CFAbsoluteTimeGetCurrent()
# try? handler.perform([request])
# let latency = (CFAbsoluteTimeGetCurrent() - start) * 1000
#
# guard let results = request.results as? [VNClassificationObservation],
# let top = results.first else {
# return ("unknown", 0.0, latency)
# }
# return (top.identifier, top.confidence, latency)
# // -> ("golden_retriever", 0.94, 2.1) // 2.1ms on Neural Engine
# }
# }
# -----------------------------------------------
print("CoreML model saved. Deploy via Xcode to iOS/macOS.")
TensorRT: NVIDIA Jetson
TensorRT squeezes maximum performance from NVIDIA GPUs. It fuses layers, selects optimal CUDA kernels, and exploits hardware-specific features:
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import time
class TensorRTInference:
"""Production TensorRT inference for Jetson devices."""
def __init__(self, onnx_path: str, precision: str = "fp16"):
self.engine = self._build_engine(onnx_path, precision)
self.context = self.engine.create_execution_context()
self._allocate_buffers()
def _build_engine(self, onnx_path: str, precision: str):
"""Build TensorRT engine from ONNX model."""
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)
# Parse ONNX model
with open(onnx_path, "rb") as f:
if not parser.parse(f.read()):
for error in range(parser.num_errors):
print(parser.get_error(error))
raise RuntimeError("ONNX parsing failed")
# Configure builder
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # 1GB
if precision == "fp16":
config.set_flag(trt.BuilderFlag.FP16)
elif precision == "int8":
config.set_flag(trt.BuilderFlag.INT8)
# Requires calibration dataset for INT8
config.int8_calibrator = self._create_calibrator()
# Build engine (takes 1-5 minutes, cache for reuse)
engine = builder.build_serialized_network(network, config)
runtime = trt.Runtime(logger)
return runtime.deserialize_cuda_engine(engine)
def _allocate_buffers(self):
"""Allocate GPU memory for input and output tensors."""
self.inputs = []
self.outputs = []
self.bindings = []
self.stream = cuda.Stream()
for i in range(self.engine.num_io_tensors):
name = self.engine.get_tensor_name(i)
shape = self.engine.get_tensor_shape(name)
dtype = trt.nptype(self.engine.get_tensor_dtype(name))
size = np.prod(shape)
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
if self.engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
self.inputs.append({"host": host_mem, "device": device_mem, "shape": shape})
else:
self.outputs.append({"host": host_mem, "device": device_mem, "shape": shape})
def predict(self, image: np.ndarray) -> dict:
"""Run inference on Jetson GPU."""
# Preprocess: resize, normalize, NCHW format
input_data = self._preprocess(image)
np.copyto(self.inputs[0]["host"], input_data.ravel())
start = time.perf_counter()
# Transfer input to GPU
cuda.memcpy_htod_async(
self.inputs[0]["device"], self.inputs[0]["host"], self.stream
)
# Run inference
self.context.set_tensor_address(
self.engine.get_tensor_name(0), int(self.inputs[0]["device"])
)
self.context.set_tensor_address(
self.engine.get_tensor_name(1), int(self.outputs[0]["device"])
)
self.context.execute_async_v3(stream_handle=self.stream.handle)
# Transfer output from GPU
cuda.memcpy_dtoh_async(
self.outputs[0]["host"], self.outputs[0]["device"], self.stream
)
self.stream.synchronize()
latency_ms = (time.perf_counter() - start) * 1000
output = self.outputs[0]["host"].reshape(self.outputs[0]["shape"])
top_idx = np.argmax(output[0])
return {
"class_id": int(top_idx),
"confidence": float(output[0][top_idx]),
"latency_ms": round(latency_ms, 2),
"device": "jetson_gpu"
}
# Usage on Jetson Orin Nano
engine = TensorRTInference("mobilenet_v3.onnx", precision="fp16")
result = engine.predict(camera_frame)
# -> {"class_id": 281, "confidence": 0.95, "latency_ms": 1.8, "device": "jetson_gpu"}
ONNX Runtime Mobile
ONNX Runtime is the cross-platform option. One model format, one API, multiple hardware backends:
import onnxruntime as ort
import numpy as np
import time
class ONNXEdgeInference:
"""Cross-platform ONNX Runtime inference for edge devices."""
def __init__(self, model_path: str):
# Select best available execution provider
providers = self._get_providers()
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.intra_op_num_threads = 4
sess_options.inter_op_num_threads = 1
self.session = ort.InferenceSession(
model_path, sess_options, providers=providers
)
self.input_name = self.session.get_inputs()[0].name
self.input_shape = self.session.get_inputs()[0].shape
self.provider = self.session.get_providers()[0]
def _get_providers(self):
"""Auto-detect best hardware backend."""
available = ort.get_available_providers()
# Priority order: CUDA > CoreML > NNAPI > OpenVINO > CPU
priority = [
"CUDAExecutionProvider", # NVIDIA GPUs
"CoreMLExecutionProvider", # Apple Neural Engine
"NnapiExecutionProvider", # Android NPU/GPU
"OpenVINOExecutionProvider", # Intel hardware
"CPUExecutionProvider", # Fallback
]
return [p for p in priority if p in available]
def predict(self, image: np.ndarray) -> dict:
"""Run inference with automatic hardware acceleration."""
input_data = self._preprocess(image)
start = time.perf_counter()
outputs = self.session.run(None, {self.input_name: input_data})
latency_ms = (time.perf_counter() - start) * 1000
output = outputs[0]
top_idx = np.argmax(output[0])
return {
"class_id": int(top_idx),
"confidence": float(output[0][top_idx]),
"latency_ms": round(latency_ms, 2),
"provider": self.provider
}
def benchmark(self, image: np.ndarray, num_runs: int = 100) -> dict:
"""Benchmark inference performance."""
input_data = self._preprocess(image)
# Warm up
for _ in range(10):
self.session.run(None, {self.input_name: input_data})
times = []
for _ in range(num_runs):
start = time.perf_counter()
self.session.run(None, {self.input_name: input_data})
times.append((time.perf_counter() - start) * 1000)
return {
"provider": self.provider,
"avg_ms": round(np.mean(times), 2),
"p50_ms": round(np.percentile(times, 50), 2),
"p95_ms": round(np.percentile(times, 95), 2),
"p99_ms": round(np.percentile(times, 99), 2),
"throughput_fps": round(1000 / np.mean(times), 1)
}
# Same code works on Android, iOS, Linux ARM, Windows
engine = ONNXEdgeInference("mobilenet_v3.onnx")
result = engine.predict(camera_frame)
benchmark = engine.benchmark(camera_frame)
print(benchmark)
# RPi 5: {"provider": "CPUExecutionProvider", "avg_ms": 15.2, "throughput_fps": 65.8}
# Jetson: {"provider": "CUDAExecutionProvider", "avg_ms": 2.1, "throughput_fps": 476.2}
# iPhone: {"provider": "CoreMLExecutionProvider", "avg_ms": 2.8, "throughput_fps": 357.1}
OpenVINO: Intel Hardware
OpenVINO optimizes models for Intel CPUs, integrated GPUs, and VPUs (Myriad). Essential for x86 edge deployments:
from openvino.runtime import Core
import numpy as np
import time
class OpenVINOInference:
"""Intel-optimized inference using OpenVINO."""
def __init__(self, model_path: str, device: str = "AUTO"):
"""
device options:
"CPU" - Intel CPUs with AVX/VNNI
"GPU" - Intel integrated/discrete GPUs
"AUTO" - Automatically pick best device
"""
self.core = Core()
# Read and compile model
model = self.core.read_model(model_path)
self.compiled = self.core.compile_model(model, device)
self.infer_request = self.compiled.create_infer_request()
self.input_layer = self.compiled.input(0)
self.output_layer = self.compiled.output(0)
self.device = device
def predict(self, image: np.ndarray) -> dict:
input_data = self._preprocess(image)
start = time.perf_counter()
self.infer_request.infer({self.input_layer: input_data})
output = self.infer_request.get_output_tensor(0).data
latency_ms = (time.perf_counter() - start) * 1000
top_idx = np.argmax(output[0])
return {
"class_id": int(top_idx),
"confidence": float(output[0][top_idx]),
"latency_ms": round(latency_ms, 2),
"device": self.device
}
# Usage on Intel NUC or industrial PC
engine = OpenVINOInference("mobilenet_v3.onnx", device="AUTO")
result = engine.predict(camera_frame)
# -> {"class_id": 281, "confidence": 0.93, "latency_ms": 4.2, "device": "AUTO"}
Runtime Benchmark Comparison
MobileNet-v3 Large INT8 benchmarks across runtimes and hardware (batch size 1, single image classification):
| Hardware | Runtime | Latency (ms) | Throughput (FPS) | Power (W) |
|---|---|---|---|---|
| Google Coral USB | TFLite + EdgeTPU | 3.5 | 286 | 2.5 |
| Jetson Orin Nano | TensorRT FP16 | 1.8 | 556 | 10 |
| Jetson Orin Nano | ONNX Runtime CUDA | 2.4 | 417 | 10 |
| Raspberry Pi 5 | TFLite CPU (4 threads) | 18 | 56 | 5 |
| RPi 5 + Hailo-8L | HailoRT | 4.2 | 238 | 7 |
| iPhone 15 Pro | CoreML (Neural Engine) | 2.1 | 476 | 1.5 |
| Intel N100 Mini PC | OpenVINO CPU | 8.5 | 118 | 6 |
| Samsung S24 (Snapdragon 8 Gen 3) | TFLite + NNAPI | 2.8 | 357 | 2 |
Key Takeaways
- TFLite is the most versatile edge runtime: runs on Android, RPi, Coral, and microcontrollers. Start here for cross-platform IoT deployments.
- CoreML delivers the best performance per watt on Apple devices via the Neural Engine. Use it for any iOS/macOS deployment.
- TensorRT extracts maximum performance from NVIDIA Jetson GPUs. The build step is slow (1-5 min) but inference is 30-50% faster than ONNX Runtime on the same hardware.
- ONNX Runtime is the best cross-platform option: one API, one model format, automatic hardware detection. Accept a 10-20% performance gap for multi-platform convenience.
- Always benchmark under sustained load on target hardware. Cold-start benchmarks are misleading — thermal throttling can increase latency 2-3x under continuous operation.
What Is Next
In the next lesson, we will cover edge-cloud synchronization — how to distribute model updates to edge devices via OTA, collect inference data for retraining, and manage bandwidth between edge and cloud.
Lilly Tech Systems