Intermediate

AWS Lambda for AI Inference

Deploy machine learning models on AWS Lambda using container images, EFS for model storage, and optimized inference runtimes for production workloads.

Lambda Limits for ML

ResourceLimitML Impact
MemoryUp to 10 GBFits most sklearn, XGBoost, and small PyTorch models
Container Image10 GBEnough for PyTorch/TF runtime + model weights
Ephemeral StorageUp to 10 GBCache models downloaded from S3 or EFS
Timeout15 minutesSufficient for most single-request inference
vCPUsUp to 6 (at 10 GB memory)Enables multi-threaded inference

Container-Based ML Deployment

Dockerfile
FROM public.ecr.aws/lambda/python:3.11

# Install ML dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt --no-cache-dir

# Copy model and inference code
COPY model/ ./model/
COPY app.py .

# Set the handler
CMD ["app.handler"]
Python - Lambda Handler
import json
import onnxruntime as ort

# Load model OUTSIDE handler for reuse across invocations
session = ort.InferenceSession("model/classifier.onnx")

def handler(event, context):
    body = json.loads(event["body"])
    inputs = {session.get_inputs()[0].name: [body["features"]]}

    result = session.run(None, inputs)

    return {
        "statusCode": 200,
        "body": json.dumps({
            "prediction": result[0].tolist(),
            "model_version": "v2.1"
        })
    }

Using EFS for Large Models

When model weights exceed the container image size limit or you want to share models across multiple Lambda functions, mount an EFS file system. Models are loaded from EFS on cold start and cached in memory for subsequent invocations.

Provisioned Concurrency

For latency-sensitive inference, configure provisioned concurrency to keep a specified number of Lambda instances warm and ready. This eliminates cold starts at the cost of paying for idle capacity.

Best practice: Use ONNX Runtime instead of full PyTorch or TensorFlow for Lambda inference. ONNX Runtime has a much smaller footprint, loads faster, and often delivers better inference performance on CPU. Convert your models to ONNX format before deployment.