Attention Mechanisms: Types & Hardware Implementations

A concise, practical tour through soft vs. hard, local vs. global, self/cross/multi‑head, efficient variants — and the silicon that makes it all run.

What is “attention”?

In deep learning, attention lets models focus on the most relevant parts of the input when producing an output, assigning dynamic weights to elements (tokens, regions, channels). Most modern systems use Q, K, and V projections to score and combine context.

Key idea: Compute similarity between a query and a set of keys, turn those scores into a distribution, and use it to mix the values.

A taxonomy of attention

1) Soft vs. Hard

  • Soft (differentiable): Weighted average over all inputs; trained by backprop; standard in Transformers.
  • Hard (non‑differentiable): Selects specific elements; trained via RL/sampling (e.g., REINFORCE); more efficient but trickier to train.

2) Global vs. Local

  • Global: Each token attends to every other token; O(n²) complexity.
  • Local: Attend within a window; complexity O(nw) with window size w.

3) Self‑, Cross‑, and Multi‑Head

  • Self‑attention: Q, K, V from the same sequence (intra‑sequence dependencies).
  • Cross‑attention: Queries from one sequence, keys/values from another (encoder‑decoder).
  • Multi‑head: Multiple heads learn different subspaces in parallel for stability and capacity.

4) Scoring functions

  • Additive (Bahdanau): MLP‑based scoring.
  • Dot‑product (Luong): Simpler dot‑product score.
  • Scaled dot‑product: Dot‑product with 1/√dk for stability (Transformer default).

Efficient / sparse / structured attention

These methods reduce the cost or memory of global attention, especially important for long sequences.

Modality‑specific variants

Vision

  • Spatial attention: Focus on image regions.
  • Channel attention: Reweight feature channels (e.g., SENet, CBAM).
  • Vision Transformers (ViT): Self‑attention over patch tokens.

Graphs & Multimodal

  • Graph Attention Networks (GAT): Attention over neighbors.
  • Cross‑modal attention: Bridge text, vision, audio streams.

Recent advances

The past years have pushed attention mechanisms beyond their original Transformer formulation, focusing on efficiency, long context, and hardware co-design.

1) IO-aware & fused attention kernels

2) Ultra-low precision attention

3) Long-context scaling (100K → 1M+ tokens)

4) Hybrid architectures (Attention + State Space)

5) Retrieval-augmented and memory attention

6) Mixture-of-Experts (MoE) + attention

Takeaway: Modern progress is less about changing the attention formula itself and more about how it’s computed, stored, and combined with other mechanisms.

Hardware implementations

Attention is dominated by matrix multiplications and memory traffic. Here’s how it lands on silicon:

General‑purpose accelerators

Specialized AI chips / ASICs

Research prototypes

Frontiers

Rule of thumb: Performance hinges on bandwidth and SRAM reuse. Algorithmic tricks like tiling and fused kernels often beat raw FLOPs increases.

Summary table

Type Examples Why use it?
Soft vs. Hard Bahdanau; REINFORCE‑trained hard attention Differentiability vs. efficiency
Global vs. Local Transformer; Longformer Context range & complexity
Self / Cross / Multi‑head BERT, T5 (self); encoder‑decoder (cross) Intra vs. inter‑sequence relations; capacity
Efficient / Sparse / Linear BigBird, Linformer, Performer Long sequences; lower memory/compute
Recent advances FlashAttention-3, SageAttention, long-context models Scalability, efficiency, and hardware alignment
Hardware FlashAttention ; TPU; IPU; Cerebras High throughput, bandwidth efficiency

📝 Blog: Memory Transformers

📝 Blog: Flash attention

Further reading

Key Insight: Attention is Bandwidth-Bound

Despite its O(n²) complexity, modern attention implementations often outperform linear variants in practice.

Why? Performance is limited by memory bandwidth, not FLOPs. IO-aware kernels like FlashAttention minimize memory movement, making them faster than theoretically efficient alternatives.

When should you use which attention?