Stabilizing Quantized KV Caches via Subspace Embeddings and Manifold Interpolation
Large Language Models (LLMs) function as autoregressive predictors, computing attention over a cached history of key-value projections to avoid the O(T^2) cost of recomputation. As context lengths grow, memory constraints necessitate aggressive quantization of these caches to low-precision integer representations. This quantization introduces a metric mismatch between storage and semantic spaces: perturbations measured by Hamming distance in the storage domain do not correspond to bounded perturbations in Euclidean distance after dequantization. A linear-algebraic framework is proposed for stabilizing quantized KV caches using classical error-correcting codes. Quantized values are embedded into higher-dimensional subspaces over F_2 via the generator matrices of Hamming and Golay codes, enabling O(1) syndrome-based detection and correction of bit-flip errors before values are lifted back to R^d. For errors exceeding the code's correction capacity, a geometric recovery strategy is employed that exploits the local smoothness of cache sequences through linear interpolation of detected erasures. Experiments on GPT-2 and LLaMA-3.1-8B demonstrate that at bit error rates up to 10^-2, unprotected caches diverge catastrophically, while protected caches maintain baseline performance.