Step New Attention Mechanism KV Cache Reduced

Introduction to the KV Cache Bottleneck

The increasing demand for large language models (LLMs) has highlighted the challenges of efficient large-scale inference. A major hurdle is the Key-Value (KV) cache within conventional attention mechanisms. This cache expands linearly with batch size and sequence length, leading to excessive memory usage, which impedes the scaling and expansion of LLMs. While mechanisms like MQA, GQA, and MLA have been developed to address these issues, they often struggle to maintain performance under stringent memory constraints or introduce complexities that create engineering challenges and compatibility issues.

Multi-matrix Factorization Attention (MFA)

Researchers from Stepes, Tsinghua University, and other institutions have introduced Multi-matrix Factorization Attention (MFA) and its variant MFA-Key-Reuse (MFA-KR), a novel attention mechanism architecture. This mechanism not only reduces the cost of language model inference considerably but also improves performance simultaneously. MFA and MFA-KR have demonstrated their ability to surpass MLA in performance while matching traditional MHA performance, all while reducing KV Cache usage by up to 93.7%. MFA is designed for simplicity, ease of reproduction, low sensitivity to hyperparameters, and compatibility with various positional embedding methods.

MFA Approach and Analysis

The research team analyzed the general design and capacity of attention mechanisms, pinpointing two critical dimensions related to capacity. This analysis led to the development of new analytical methods and design principles. They introduced the concept of Generalized Multi-Head Attention (GMHA) as a unifying framework for understanding different MHA variants. They also explored the computation and storage of key-values from an inference perspective and examined model capacity from a decomposition perspective. Fully Parameterized Bilinear Attention (FPBA) was established as the theoretical upper limit of performance. The researchers found that MHA and its variants are low-rank decompositions of FPBA.

Comparison with MQA and MLA

The analysis focused on two representative improvement schemes: Multi-Query Attention (MQA) and Multi-Head Latent Attention (MLA). MQA uses a more aggressive parameter-sharing strategy, where all attention heads share the same set of key-value parameters, which reduces memory usage but may affect the model’s expressiveness. MLA introduces a shared latent space for parameter compression, but the actual expressive power is limited by the smallest dimension, meaning that increasing intermediate dimensions does not significantly improve performance.

MFA Key Innovations

The development of MFA was driven by the goal to create an attention mechanism that minimizes resource consumption while approaching theoretical performance limits. MFA’s design incorporates three key innovations:

Significantly increasing the number and dimension of attention heads to maximize model capacity.
Employing an aggressive low-rank decomposition strategy to maintain parameter efficiency while expanding attention head count and dimensions.
Utilizing a single key-value head design to keep memory consumption minimal, even with increased model complexity.

Capacity Measurement and Comparison

To further analyze MFA and other attention mechanisms, the team introduced two key metrics:

Total Effective Rank (TER): The product of the number of attention heads and the Factorization rank per head (FRH).
Shared Latent Subspace Dimension (SLSD): The dimension of the hidden space shared by all attention heads.

MFA achieves a higher SLSD and TER compared to MQA. Compared to MLA, MFA achieves a smaller KV cache size and higher TER with similar parameter budgets, while maintaining a comparable SLSD. Compared to traditional MHA, MFA has a higher TER, even though its SLSD is smaller.

Experimental Results

Extensive experiments were conducted to evaluate the new architecture's performance at larger scales, testing models ranging from 1B to 7B parameters and training data from 10B to 1T. MFA demonstrated scaling capabilities comparable to traditional MHA, maintaining excellent performance even at larger scales. While MFA-KR exhibited slightly lower performance, its scaling trend aligned with MHA. The memory-saving advantages of MFA and MFA-KR continued to expand with model size, with MFA achieving 87.5% memory savings and MFA-KR reducing memory usage to 6.25% at the largest scale.

Ablation Studies

Ablation studies validated the effectiveness of MFA and MFA-KR. Their performance advantages were also confirmed across various mainstream positional encoding methods.

Outlook

MFA offers significant improvements with a simple design, effectively addressing the memory bottleneck in LLM inference without adding extra engineering complexity. It integrates seamlessly into the existing Transformer ecosystem, accelerating the application of LLMs across various scenarios.