Intermediate

AWS Lambda for AI Inference

Deploy machine learning models on AWS Lambda using container images, EFS for model storage, and optimized inference runtimes for production workloads.

Lambda Limits for ML

Resource	Limit	ML Impact
Memory	Up to 10 GB	Fits most sklearn, XGBoost, and small PyTorch models
Container Image	10 GB	Enough for PyTorch/TF runtime + model weights
Ephemeral Storage	Up to 10 GB	Cache models downloaded from S3 or EFS
Timeout	15 minutes	Sufficient for most single-request inference
vCPUs	Up to 6 (at 10 GB memory)	Enables multi-threaded inference

Container-Based ML Deployment

Dockerfile

FROM public.ecr.aws/lambda/python:3.11

# Install ML dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt --no-cache-dir

# Copy model and inference code
COPY model/ ./model/
COPY app.py .

# Set the handler
CMD ["app.handler"]

Python - Lambda Handler

import json
import onnxruntime as ort

# Load model OUTSIDE handler for reuse across invocations
session = ort.InferenceSession("model/classifier.onnx")

def handler(event, context):
    body = json.loads(event["body"])
    inputs = {session.get_inputs()[0].name: [body["features"]]}

    result = session.run(None, inputs)

    return {
        "statusCode": 200,
        "body": json.dumps({
            "prediction": result[0].tolist(),
            "model_version": "v2.1"
        })
    }

Using EFS for Large Models

When model weights exceed the container image size limit or you want to share models across multiple Lambda functions, mount an EFS file system. Models are loaded from EFS on cold start and cached in memory for subsequent invocations.

Provisioned Concurrency

For latency-sensitive inference, configure provisioned concurrency to keep a specified number of Lambda instances warm and ready. This eliminates cold starts at the cost of paying for idle capacity.

✅

Best practice: Use ONNX Runtime instead of full PyTorch or TensorFlow for Lambda inference. ONNX Runtime has a much smaller footprint, loads faster, and often delivers better inference performance on CPU. Convert your models to ONNX format before deployment.

← Previous Introduction Next → Azure Functions