Intermediate
AWS Lambda for AI Inference
Deploy machine learning models on AWS Lambda using container images, EFS for model storage, and optimized inference runtimes for production workloads.
Lambda Limits for ML
| Resource | Limit | ML Impact |
|---|---|---|
| Memory | Up to 10 GB | Fits most sklearn, XGBoost, and small PyTorch models |
| Container Image | 10 GB | Enough for PyTorch/TF runtime + model weights |
| Ephemeral Storage | Up to 10 GB | Cache models downloaded from S3 or EFS |
| Timeout | 15 minutes | Sufficient for most single-request inference |
| vCPUs | Up to 6 (at 10 GB memory) | Enables multi-threaded inference |
Container-Based ML Deployment
Dockerfile
FROM public.ecr.aws/lambda/python:3.11 # Install ML dependencies COPY requirements.txt . RUN pip install -r requirements.txt --no-cache-dir # Copy model and inference code COPY model/ ./model/ COPY app.py . # Set the handler CMD ["app.handler"]
Python - Lambda Handler
import json import onnxruntime as ort # Load model OUTSIDE handler for reuse across invocations session = ort.InferenceSession("model/classifier.onnx") def handler(event, context): body = json.loads(event["body"]) inputs = {session.get_inputs()[0].name: [body["features"]]} result = session.run(None, inputs) return { "statusCode": 200, "body": json.dumps({ "prediction": result[0].tolist(), "model_version": "v2.1" }) }
Using EFS for Large Models
When model weights exceed the container image size limit or you want to share models across multiple Lambda functions, mount an EFS file system. Models are loaded from EFS on cold start and cached in memory for subsequent invocations.
Provisioned Concurrency
For latency-sensitive inference, configure provisioned concurrency to keep a specified number of Lambda instances warm and ready. This eliminates cold starts at the cost of paying for idle capacity.
Best practice: Use ONNX Runtime instead of full PyTorch or TensorFlow for Lambda inference. ONNX Runtime has a much smaller footprint, loads faster, and often delivers better inference performance on CPU. Convert your models to ONNX format before deployment.
Lilly Tech Systems