State of the Art in Transformers

FlashAttention FlashAttention ;

FlashAttention: IO‑aware attention kernels

IO-awareness

Minimising memory reads and writes between slow high bandwidth memory (HBM) and fast on-chip Static Random Access Memory (SRAM). Standard attention is memory-bound (limited by memory bandwidth rather than compute power). Flash attention reduces memory access complexity to , where N is sequence length, d is head dimension, and M is SRAM size.

Compute:

Memory:

Overhead: