A concise, practical tour through soft vs. hard, local vs. global, self/cross/multi‑head, efficient variants — and the silicon that makes it all run.
What is “attention”?
In deep learning, attention lets models focus on the most relevant parts of the input when producing an output, assigning dynamic weights to elements (tokens, regions, channels). Most modern systems use Q, K, and V projections to score and combine context.
Key idea: Compute similarity between a query and a set of keys, turn those scores into a distribution, and use it to mix the values.
A taxonomy of attention
1) Soft vs. Hard
Soft (differentiable): Weighted average over all inputs; trained by backprop; standard in Transformers.
Hard (non‑differentiable): Selects specific elements; trained via RL/sampling (e.g., REINFORCE); more efficient but trickier to train.
2) Global vs. Local
Global: Each token attends to every other token; O(n²) complexity.
Local: Attend within a window; complexity O(nw) with window size w.
3) Self‑, Cross‑, and Multi‑Head
Self‑attention: Q, K, V from the same sequence (intra‑sequence dependencies).
Cross‑attention: Queries from one sequence, keys/values from another (encoder‑decoder).
Multi‑head: Multiple heads learn different subspaces in parallel for stability and capacity.
4) Scoring functions
Additive (Bahdanau): MLP‑based scoring.
Dot‑product (Luong): Simpler dot‑product score.
Scaled dot‑product: Dot‑product with 1/√dk for stability (Transformer default).
Efficient / sparse / structured attention
These methods reduce the cost or memory of global attention, especially important for long sequences.
Sparse patterns: Attend to a subset (block/stride/random) — e.g., Sparse Transformer, BigBird.
Local + global tokens: Windows with a few global tokens — e.g., Longformer.
Low‑rank projection: Approximate the attention map — e.g., Linformer, Nyströmformer.
Kernelized / linear attention: Replace softmax with kernel features to get O(n) — e.g., Performer, Linear Transformer.
Memory / retrieval‑augmented: External memory, caches, or kNN retrieval to avoid quadratic compute.
IO-aware attention: FlashAttention-style kernels that minimize memory reads/writes.
Quantized attention: FP8/FP4/INT4 attention for inference efficiency.
Streaming / chunked attention: Process sequences incrementally for long contexts.
The past years have pushed attention mechanisms beyond their original Transformer formulation, focusing on efficiency, long context, and hardware co-design.
1) IO-aware & fused attention kernels
FlashAttention-2/3: Further reduces memory movement with better tiling, parallelism, and async execution.
Key idea: Avoid materializing the full attention matrix in HBM; compute in SRAM tiles.
Impact: Enables training and inference at much larger sequence lengths.
2) Ultra-low precision attention
FP8 → FP4 → INT4: Aggressive quantization for attention weights and activations.
KV-cache optimization: Reuse past attention states efficiently during inference.
Result: Better factual grounding and reduced compute.
6) Mixture-of-Experts (MoE) + attention
Sparse activation: Only a subset of experts process each token.
Interaction: Attention routes information; MoE scales model capacity.
Outcome: Trillion-parameter models with manageable compute.
Takeaway: Modern progress is less about changing the attention formula itself and more about how it’s computed, stored, and combined with other mechanisms.
Hardware implementations
Attention is dominated by matrix multiplications and memory traffic. Here’s how it lands on silicon:
General‑purpose accelerators
GPUs (NVIDIA, AMD): Highly optimized kernels (e.g., FlashAttention), Tensor Cores for QKᵀ and attention‑V.
TPUs (Google): Large matmul arrays (bfloat16), high‑bandwidth interconnects; attention maps well to their systolic cores.
Specialized AI chips / ASICs
Cerebras Wafer‑Scale Engine: Massive on‑chip memory keeps activations local; high utilization for attention.
Graphcore IPU / Groq / Tenstorrent / SambaNova: Architectures tuned for fine‑grained parallelism and fast attention kernels.
Research prototypes
Attention‑centric accelerators: Designs that compute QKᵀ and softmax in‑place to cut memory traffic.
Sparse‑attention hardware: Dataflows that skip zeros/unselected keys (e.g., dynamic sparsity like SpAtten).
Kernel‑aware kernels: Algorithms such as FlashAttention reduce reads/writes and tile the sequence dimension for SRAM reuse.
Frontiers
Analog / in‑memory compute: Memristive crossbars for vector‑matrix ops (mixed‑signal attention blocks).
Optical compute: Photonic matmuls and optical softmax approximations (early‑stage).
Neuromorphic: Event‑driven selective attention inspired by spiking systems.
Low-precision compute: Native FP8/FP4/INT4 support in next-gen accelerators.
KV-cache optimized inference: Hardware designs focused on fast autoregressive decoding.
Attention-specific kernels: Vendor-optimized fused ops (e.g., FlashAttention in CUDA/ROCm).
Memory hierarchy co-design: Architectures increasingly shaped around attention bandwidth limits.
Rule of thumb: Performance hinges on bandwidth and SRAM reuse. Algorithmic tricks like tiling and fused kernels often beat raw FLOPs increases.
Despite its O(n²) complexity, modern attention implementations often outperform linear variants in practice.
Why? Performance is limited by memory bandwidth, not FLOPs.
IO-aware kernels like FlashAttention minimize memory movement, making them faster than theoretically efficient alternatives.
When should you use which attention?
Standard attention: Best default; highly optimized on GPUs
FlashAttention: Use for long sequences (training + inference)
Linear attention: Useful for extreme sequence lengths but often slower in practice
Sparse attention: Good when structure is known (e.g., documents)