LLMs on AWS — Infrastructure, Models, and Selection Strategies

AWS Infrastructure for LLM Workloads

There are three broad ways to run LLMs on AWS: (1) use a fully‑managed foundation‑model service, (2) host models yourself, or (3) mix both. Your choice depends on latency, compliance, customization, and cost constraints.

Managed FM access (Amazon Bedrock)

Managed API access to multiple foundation models, unified auth and billing, serverless inference, tooling for guardrails, evaluation, and agents. Great for fast prototyping and production without managing GPUs.

Single API to invoke and evaluate different models.
Built‑in features for safety (guardrails), prompt management, knowledge bases (RAG), and orchestration.
Fine‑tuning options for supported models; data remains in your AWS account boundaries.

Self‑hosting (Amazon SageMaker / EKS / EC2)

Bring your own model or container. Full control over runtime, networking, and scaling.

SageMaker for training, hosting, multi‑model endpoints, async & real‑time inference, Model Registry & Pipelines.
EKS for Kubernetes‑native serving stacks (e.g., vLLM, TensorRT‑LLM) with auto‑scaling and custom routing.
EC2 for bespoke setups; choose GPU instances (e.g., *p4/p5/g* families) and optimize with inference accelerators.

Hybrid (best of both)

Use managed models for general tasks and self‑host for specialized or data‑sensitive workloads. Route traffic via an API gateway or service mesh and evaluate continuously.

Core supporting services: VPC, PrivateLink, IAM, CloudWatch/CloudTrail, KMS, Secrets Manager, S3/EFS for artifacts and vector stores, DynamoDB/OpenSearch for metadata and retrieval, API Gateway + Lambda or ECS for edge APIs, Step Functions for orchestration.

High‑level reference architecture for a production RAG API on AWS

Common Approaches & Reference Patterns

Zero‑/Few‑Shot Prompting

Fastest path to value—engineer prompts, constrain outputs with JSON schemas, and evaluate with automatic metrics + human review. Pair with Bedrock guardrails or custom moderation.

Retrieval‑Augmented Generation (RAG)

Ground responses in your data. Use a vector DB with embeddings, chunking strategies, and re‑ranking. Cache successful answers. Monitor context length & token costs.

Fine‑Tuning & Adapters

When outputs must match a domain style or task. Prefer parameter‑efficient methods (LoRA/QLoRA) to reduce cost. Keep a strong eval set; compare against a RAG baseline.

Agents & Tool Use

Have the model call functions/skills (Lambda, Step Functions) for deterministic operations or data fetches. Keep tools idempotent and observable; cap recursion to avoid runaway chains.

Serving patterns

Real‑time (streaming): Websocket/Server‑Sent Events for low‑latency UX; scale with provisioned concurrency or autoscaling policies.
Async/batch: Queue requests (SQS), process via Lambda/ECS, write results to S3; great for long‑running jobs and cost control.
Multi‑region: Put retrieval data close to users; replicate vector indexes; route via Route 53 or Global Accelerator.

Model Landscape on AWS

On AWS you can access a range of proprietary and open models via managed endpoints or by self‑hosting. The options evolve frequently, so treat this as a map, not an exhaustive list.

Via Managed Services

General LLMs: families such as Claude, Llama, and Amazon Titan for chat, reasoning, and coding.
Specialized: text embedding models, re‑rankers, code assistants, and image generation (e.g., SDXL‑class).
Pros: simple ops, consistent auth/billing, safety tooling, rapid iteration.
Cons: limited low‑level control, model list is curated, egress/latency considerations.

Self‑Hosted (SageMaker/EKS)

Open models: Llama‑family, Mistral, Mixtral, Phi, Qwen, etc., quantized with vLLM/TensorRT‑LLM.
Pros: full control over latency, tokenization, scheduling, and custom kernels.
Cons: you manage scaling, upgrades, GPU scheduling, and incident response.

Task	Model Examples	Notes
General Chat/Reasoning	Claude / Llama / Titan‑class	Prioritize safety, long context, and tool‑use quality.
Code Generation	Code‑tuned LLMs (e.g., Claude‑coder, Llama‑Coder)	Guard for secrets; enable repo‑aware RAG.
Embeddings	Titan Embeddings / Cohere Embed / open‑source	Balance dimensionality vs. latency/cost.
Re‑ranking	Cross‑encoder rerankers	Boost retrieval precision for RAG.
Vision / Multimodal	MM LLMs or SDXL‑class diffusion	Check input size limits and pricing tiers.

Model availability varies by region and over time. Verify in your target AWS Region before committing.

Strategies for Model Selection

Picking a model is a product and systems decision. Use a requirements‑first approach; then iterate with evaluations.

Define constraints: target latency, throughput, budget, compliance, data residency, languages, and safety posture.
Specify tasks: classify, extract, summarize, generate, multi‑turn chat, tool‑calling, code, vision, or multi‑modal.
Start with baselines: a strong managed model (for speed) and a cost‑efficient open model (for control).
Evaluate: create a golden dataset; measure quality, latency, cost, and safety. Track regressions with CI.
Pilot & guardrail: rollout behind feature flags, rate‑limit, and add circuit‑breakers & fallbacks.

Decision Hints

Strict compliance or minimal ops? Prefer managed endpoints.
Hard latency SLO (<200ms end‑to‑first‑token)? Self‑host + optimized serving and caching.
Spiky load? Use autoscaling and async queues; consider serverless endpoints.
Highly domain‑specific? Try RAG first; only fine‑tune if needed.
Tight budget? Use smaller models with good prompts + RAG; enable request & token caching.

Evaluation Matrix (example)

Criterion	Weight	Model A	Model B
Quality (task score)	40%	8.7	8.3
Latency (p95)	20%	320 ms	210 ms
Cost ($/1k tokens)	20%	0.004	0.002
Safety & policy fit	10%	Pass	Pass*
Operability	10%	Managed	Self‑hosted

Swap in your own metrics (hallucination rate, function‑calling success, multilingual accuracy).

Prompt & policy contracts

Define structured prompts and response schemas. Validate JSON in the gateway. Enforce content policies with pre/post‑filters, and log prompt/response hashes for auditing.

Security, Compliance & Guardrails

Data boundaries: Keep traffic in‑VPC via PrivateLink where possible; avoid cross‑region PII transfer.
Encryption: TLS in transit; KMS keys for S3, EBS, and secrets; rotate regularly.
Least privilege IAM: Separate roles for retrieval, inference, and orchestration; short‑lived credentials.
Safety layers: Input/output filtering for PII, toxicity, policy; human‑in‑the‑loop for sensitive actions.
Auditability: Centralize logs (CloudWatch/Firehose to S3); enable CloudTrail; hash prompts/responses.

Observability, Reliability & Cost

Metrics to Track

Latency: time‑to‑first‑token, p95 end‑to‑end.
Quality: task success, hallucination rate, function call accuracy.
Cost: tokens/request, cache hit rate, GPU utilization.
Safety: blocked rate, policy violations, escalation volume.

Reliability Patterns

Timeouts with retries and jitter; idempotency keys; circuit breakers.
Fallbacks: smaller/cheaper model or cached answers when SLOs are threatened.
Canary & A/B evals tied to automated scorecards; roll back on regression.

Cost Controls

Request shaping (truncate context, compress history); semantic cache.
Right‑size instances; schedule downscaling; spot where appropriate.
Prefer PEFT over full fine‑tunes; distill to smaller models for prod.

LLM Delivery Checklist

Architecture

Region selected; data residency confirmed.
Network plan (VPC, endpoints, PrivateLink) documented.
Scaling policy and concurrency budgets defined.

Safety

Input/output filtering policies implemented and tested.
Abuse monitoring, rate limits, and user reporting in place.

Evaluation

Golden dataset with acceptance criteria.
Automated regression checks tied to CI/CD.
Human review for high‑risk scenarios.

Operations

Runbooks, alerts, dashboards, and on‑call rotation established.
Cost alarms and monthly budget reviews.
Data retention & deletion policies configured.

Useful Snippets

Example: Gateway contract for JSON output

{
  "type": "object",
  "properties": {
    "answer": {"type": "string"},
    "citations": {"type": "array", "items": {"type": "string"}}
  },
  "required": ["answer"]
}

Example: IaC sketch (CDK/Terraform – pseudocode)

// API Gateway → Lambda → Bedrock + Vector DB
api = new ApiGateway.RestApi(...)
fn  = new Lambda.Function(...)
api.addRoute("POST /chat", fn)

// permissions
fn.allowInvokeBedrock()
fn.allowReadWrite(vectorIndex)

// observability & budgets
new CloudWatch.Dashboard(...)
new Budgets.CostAlarm(...)

Building with Large Language Models on AWS

AWS Infrastructure for LLM Workloads

Managed FM access (Amazon Bedrock)

Self‑hosting (Amazon SageMaker / EKS / EC2)

Hybrid (best of both)

Common Approaches & Reference Patterns

Zero‑/Few‑Shot Prompting

Retrieval‑Augmented Generation (RAG)

Fine‑Tuning & Adapters

Agents & Tool Use

Serving patterns

Model Landscape on AWS

Via Managed Services

Self‑Hosted (SageMaker/EKS)

Strategies for Model Selection

Decision Hints

Evaluation Matrix (example)

Prompt & policy contracts

Security, Compliance & Guardrails

Observability, Reliability & Cost

Metrics to Track

Reliability Patterns

Cost Controls

LLM Delivery Checklist

Architecture

Safety

Evaluation

Operations

Useful Snippets

Example: Gateway contract for JSON output

Example: IaC sketch (CDK/Terraform – pseudocode)