AWS Infrastructure for LLM Workloads
There are three broad ways to run LLMs on AWS: (1) use a fully‑managed foundation‑model service, (2) host models yourself, or (3) mix both. Your choice depends on latency, compliance, customization, and cost constraints.
Managed FM access (Amazon Bedrock)
Managed API access to multiple foundation models, unified auth and billing, serverless inference, tooling for guardrails, evaluation, and agents. Great for fast prototyping and production without managing GPUs.
- Single API to invoke and evaluate different models.
- Built‑in features for safety (guardrails), prompt management, knowledge bases (RAG), and orchestration.
- Fine‑tuning options for supported models; data remains in your AWS account boundaries.
Self‑hosting (Amazon SageMaker / EKS / EC2)
Bring your own model or container. Full control over runtime, networking, and scaling.
- SageMaker for training, hosting, multi‑model endpoints, async & real‑time inference, Model Registry & Pipelines.
- EKS for Kubernetes‑native serving stacks (e.g., vLLM, TensorRT‑LLM) with auto‑scaling and custom routing.
- EC2 for bespoke setups; choose GPU instances (e.g., *p4/p5/g* families) and optimize with inference accelerators.
Hybrid (best of both)
Use managed models for general tasks and self‑host for specialized or data‑sensitive workloads. Route traffic via an API gateway or service mesh and evaluate continuously.
Core supporting services: VPC, PrivateLink, IAM, CloudWatch/CloudTrail, KMS, Secrets Manager, S3/EFS for artifacts and vector stores, DynamoDB/OpenSearch for metadata and retrieval, API Gateway + Lambda or ECS for edge APIs, Step Functions for orchestration.
Common Approaches & Reference Patterns
Zero‑/Few‑Shot Prompting
Fastest path to value—engineer prompts, constrain outputs with JSON schemas, and evaluate with automatic metrics + human review. Pair with Bedrock guardrails or custom moderation.
Retrieval‑Augmented Generation (RAG)
Ground responses in your data. Use a vector DB with embeddings, chunking strategies, and re‑ranking. Cache successful answers. Monitor context length & token costs.
Fine‑Tuning & Adapters
When outputs must match a domain style or task. Prefer parameter‑efficient methods (LoRA/QLoRA) to reduce cost. Keep a strong eval set; compare against a RAG baseline.
Agents & Tool Use
Have the model call functions/skills (Lambda, Step Functions) for deterministic operations or data fetches. Keep tools idempotent and observable; cap recursion to avoid runaway chains.
Serving patterns
- Real‑time (streaming): Websocket/Server‑Sent Events for low‑latency UX; scale with provisioned concurrency or autoscaling policies.
- Async/batch: Queue requests (SQS), process via Lambda/ECS, write results to S3; great for long‑running jobs and cost control.
- Multi‑region: Put retrieval data close to users; replicate vector indexes; route via Route 53 or Global Accelerator.
Model Landscape on AWS
On AWS you can access a range of proprietary and open models via managed endpoints or by self‑hosting. The options evolve frequently, so treat this as a map, not an exhaustive list.
Via Managed Services
- General LLMs: families such as Claude, Llama, and Amazon Titan for chat, reasoning, and coding.
- Specialized: text embedding models, re‑rankers, code assistants, and image generation (e.g., SDXL‑class).
- Pros: simple ops, consistent auth/billing, safety tooling, rapid iteration.
- Cons: limited low‑level control, model list is curated, egress/latency considerations.
Self‑Hosted (SageMaker/EKS)
- Open models: Llama‑family, Mistral, Mixtral, Phi, Qwen, etc., quantized with vLLM/TensorRT‑LLM.
- Pros: full control over latency, tokenization, scheduling, and custom kernels.
- Cons: you manage scaling, upgrades, GPU scheduling, and incident response.
| Task | Model Examples | Notes |
|---|---|---|
| General Chat/Reasoning | Claude / Llama / Titan‑class | Prioritize safety, long context, and tool‑use quality. |
| Code Generation | Code‑tuned LLMs (e.g., Claude‑coder, Llama‑Coder) | Guard for secrets; enable repo‑aware RAG. |
| Embeddings | Titan Embeddings / Cohere Embed / open‑source | Balance dimensionality vs. latency/cost. |
| Re‑ranking | Cross‑encoder rerankers | Boost retrieval precision for RAG. |
| Vision / Multimodal | MM LLMs or SDXL‑class diffusion | Check input size limits and pricing tiers. |
Model availability varies by region and over time. Verify in your target AWS Region before committing.
Strategies for Model Selection
Picking a model is a product and systems decision. Use a requirements‑first approach; then iterate with evaluations.
- Define constraints: target latency, throughput, budget, compliance, data residency, languages, and safety posture.
- Specify tasks: classify, extract, summarize, generate, multi‑turn chat, tool‑calling, code, vision, or multi‑modal.
- Start with baselines: a strong managed model (for speed) and a cost‑efficient open model (for control).
- Evaluate: create a golden dataset; measure quality, latency, cost, and safety. Track regressions with CI.
- Pilot & guardrail: rollout behind feature flags, rate‑limit, and add circuit‑breakers & fallbacks.
Decision Hints
- Strict compliance or minimal ops? Prefer managed endpoints.
- Hard latency SLO (<200ms end‑to‑first‑token)? Self‑host + optimized serving and caching.
- Spiky load? Use autoscaling and async queues; consider serverless endpoints.
- Highly domain‑specific? Try RAG first; only fine‑tune if needed.
- Tight budget? Use smaller models with good prompts + RAG; enable request & token caching.
Evaluation Matrix (example)
| Criterion | Weight | Model A | Model B |
|---|---|---|---|
| Quality (task score) | 40% | 8.7 | 8.3 |
| Latency (p95) | 20% | 320 ms | 210 ms |
| Cost ($/1k tokens) | 20% | 0.004 | 0.002 |
| Safety & policy fit | 10% | Pass | Pass* |
| Operability | 10% | Managed | Self‑hosted |
Swap in your own metrics (hallucination rate, function‑calling success, multilingual accuracy).
Prompt & policy contracts
Define structured prompts and response schemas. Validate JSON in the gateway. Enforce content policies with pre/post‑filters, and log prompt/response hashes for auditing.
Security, Compliance & Guardrails
- Data boundaries: Keep traffic in‑VPC via PrivateLink where possible; avoid cross‑region PII transfer.
- Encryption: TLS in transit; KMS keys for S3, EBS, and secrets; rotate regularly.
- Least privilege IAM: Separate roles for retrieval, inference, and orchestration; short‑lived credentials.
- Safety layers: Input/output filtering for PII, toxicity, policy; human‑in‑the‑loop for sensitive actions.
- Auditability: Centralize logs (CloudWatch/Firehose to S3); enable CloudTrail; hash prompts/responses.
Observability, Reliability & Cost
Metrics to Track
- Latency: time‑to‑first‑token, p95 end‑to‑end.
- Quality: task success, hallucination rate, function call accuracy.
- Cost: tokens/request, cache hit rate, GPU utilization.
- Safety: blocked rate, policy violations, escalation volume.
Reliability Patterns
- Timeouts with retries and jitter; idempotency keys; circuit breakers.
- Fallbacks: smaller/cheaper model or cached answers when SLOs are threatened.
- Canary & A/B evals tied to automated scorecards; roll back on regression.
Cost Controls
- Request shaping (truncate context, compress history); semantic cache.
- Right‑size instances; schedule downscaling; spot where appropriate.
- Prefer PEFT over full fine‑tunes; distill to smaller models for prod.
LLM Delivery Checklist
Architecture
- Region selected; data residency confirmed.
- Network plan (VPC, endpoints, PrivateLink) documented.
- Scaling policy and concurrency budgets defined.
Safety
- Input/output filtering policies implemented and tested.
- Abuse monitoring, rate limits, and user reporting in place.
Evaluation
- Golden dataset with acceptance criteria.
- Automated regression checks tied to CI/CD.
- Human review for high‑risk scenarios.
Operations
- Runbooks, alerts, dashboards, and on‑call rotation established.
- Cost alarms and monthly budget reviews.
- Data retention & deletion policies configured.
Useful Snippets
Example: Gateway contract for JSON output
{
"type": "object",
"properties": {
"answer": {"type": "string"},
"citations": {"type": "array", "items": {"type": "string"}}
},
"required": ["answer"]
}
Example: IaC sketch (CDK/Terraform – pseudocode)
// API Gateway → Lambda → Bedrock + Vector DB
api = new ApiGateway.RestApi(...)
fn = new Lambda.Function(...)
api.addRoute("POST /chat", fn)
// permissions
fn.allowInvokeBedrock()
fn.allowReadWrite(vectorIndex)
// observability & budgets
new CloudWatch.Dashboard(...)
new Budgets.CostAlarm(...)