Back to blog
·8 min read

Diagnosing Attention Sinks in Transformer LLMs

transformersllmsml-researchexperimentsopen-source

This is a work-in-progress experiment. The tools described here implement ideas from a research paper to make them measurable and actionable. This is not original research — it's an engineering exercise to understand and apply the paper's findings.


There's a well-documented phenomenon in transformer language models: certain tokens absorb a disproportionate amount of attention, regardless of their semantic meaning. A newline character, a BOS token, a punctuation mark — tokens that carry little information end up dominating the attention distribution across most heads and layers.

This was formally analyzed in "The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks" by Shangwen Sun, Alfredo Canziani, and Yann LeCun (ICML 2026). The paper identifies two related phenomena:

Spike tokens — tokens with extreme activation outliers (massive hidden-state norms) concentrated in specific channels.

Sink tokens — tokens that absorb disproportionate attention mass across heads and layers, regardless of semantic relevance.

The paper's key finding: in pre-norm transformers (which includes most modern LLMs like Llama, Qwen, Mistral), spike tokens and sink tokens are always the same tokens.

Why This Matters

Attention sinks affect several practical concerns:

  • KV cache efficiency — sink tokens occupy cache slots that could hold semantically relevant context. If a token is absorbing 55% of attention but contributing nothing semantically, that's wasted compute.
  • Quantization — activation spikes cause outliers that break INT8/INT4 quantization. If you don't know which channels spike, quantization can silently degrade output quality.
  • Long-context inference — sinks accumulate across layers and get worse with sequence length.

The paper explains why this happens. I wanted a tool that measures how much it happens for any specific model.

sinkhole: Measuring It

sinkhole is a diagnostic tool that loads any Llama/Qwen/Mistral-family model and reports:

  • Which tokens are spike tokens (extreme activation norms)
  • Which tokens are sink tokens (high attention mass)
  • Whether they overlap (the paper's prediction)
  • How much KV budget is being consumed by sinks
sinkhole analyze \
  --model Qwen/Qwen2.5-7B-Instruct \
  --prompt "Explain the theory of relativity in simple terms." \
  --output report.html

Or via Python:

from sinkhole import ModelProbe, analyze
from sinkhole.extractor import extract

probe = ModelProbe("Qwen/Qwen2.5-7B-Instruct", device="cuda")
capture = probe.run("Explain the theory of relativity in simple terms.")
probe.cleanup()

hidden, attn = extract(capture)
report = analyze(hidden, attn, token_texts=capture.token_texts,
                 model_name="Qwen/Qwen2.5-7B-Instruct",
                 prompt="Explain the theory of relativity in simple terms.")

The output looks like this:

Spike Tokens: 1 found (threshold 10.0×)
  Position 2, '\n', Score: 38.2×, Channels: [2730, 458, 2570, 2718]

Sink Tokens: 1 found
  Position 2, '\n', Attn Mass: 54.9%, Heads Dominated: 754/784 (96.2%)

Spike ∩ Sink: 1 token overlap (Jaccard = 1.00)
  '\n' is both a spike token and a sink token

KV Impact: Sink tokens consume 54.9% of attention budget

It also generates an interactive HTML report with attention heatmaps across all layers and heads.

What the Data Shows: Qwen2.5-7B-Instruct

To verify the results aren't prompt-specific, I ran sinkhole across 400 diverse prompts — factual questions, instructions, coding tasks, reasoning problems. Full data is in eval/results/ in the repo.

The aggregate results:

Metric Mean Std Range
Spike tokens per prompt 1.00 0.00 Always 1
Spike score 38.2× 0.69 [35.9×, 39.8×]
Spike channels [2730, 458, 2570, 2718] Identical in all 400
Sink attention mass 54.9% 0.54% [51.5%, 56.1%]
Heads dominated 96.25% 0.11%
Spike ∩ Sink (Jaccard) 1.00 0.00 Always 1.0

The spike-sink overlap is 1.0 in every single prompt. The same 4 channels spike every time. The \n token at position 2 absorbs ~55% of attention in every run. Prompt content is completely irrelevant — this is a structural property of the model, not a semantic one.

This is exactly what the paper predicts for pre-norm transformers. Running the measurement just confirms it empirically for this specific model and makes the numbers concrete.

A few example prompts to show the consistency:

Prompt Spike Score Sink Mass
"Tell me three short-term effects of smoking marijuana." 37.7× 55.3%
"Generate a website design for a house cleaning company" 38.7× 55.1%
"How many continents are there on Earth?" 38.5× 55.1%
"Explain the concept of socio-economic privilege." 38.1× 55.4%
"List five benefits of going for a walk" 38.7× 55.9%

One interesting finding from the evaluation: longer prompts slightly dilute the sink effect (correlation r=−0.52 between sequence length and sink mass), but never eliminate it.

sinkhole-vllm: Doing Something About It

Measuring sinks is useful. But can we actually use that information?

sinkhole-vllm is a companion project that integrates sink awareness into vLLM's KV cache eviction policy.

The problem: vLLM's default block manager evicts KV cache blocks by position (oldest first) with no awareness of attention sinks. If a sink token's KV entry gets evicted under memory pressure, the attention heads that were pointing at it redistribute mass to wrong tokens, degrading output quality.

The fix: pin sink token blocks so they're never evicted.

from sinkhole_vllm import SinkAwareLLMEngine

engine = SinkAwareLLMEngine.from_model_name(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=8192,
)
# Use exactly like vllm.LLMEngine — sink blocks are automatically protected

It ships with pre-built profiles for common models:

Model Sink Token Attn Mass Spike Score
Qwen/Qwen2.5-7B-Instruct \n 54.9% 38.2
meta-llama/Llama-3.1-8B-Instruct <|begin_of_text|> 45.8% 13.5
mistralai/Mistral-7B-Instruct-v0.3 <s> 47.0% 7.9

You can also build profiles for any model:

from sinkhole_vllm.detector import SinkDetector

detector = SinkDetector("your-org/your-model", device="cuda")
profile = detector.detect()
profile.save("your-model-profile.json")

The project also exposes quantization hints — it tells you which hidden-state channels carry outlier activations, which is useful if you're doing INT4/INT8 quantization and want to handle those channels differently:

from sinkhole_vllm import get_quantization_hints
from sinkhole_vllm.profile import load_profile

profile = load_profile("Qwen/Qwen2.5-7B-Instruct")
hints = get_quantization_hints(profile, backend="bitsandbytes")
# {'outlier_channels': [458, 2570, 2718, 2730], 'recommended_threshold': 6.0}

What This Is (and Isn't)

To be clear about what these tools do and don't do:

What they do: Implement the diagnostic methodology from the paper, make it easy to run on any model, and integrate the findings into a production inference engine (vLLM).

What they don't do: Propose new theory or claim novel research contributions. The science is from Sun, Canziani, and LeCun's paper. These tools are engineering work to make those findings measurable and actionable.

Value added: The 400-prompt evaluation on Qwen2.5-7B confirms the paper's predictions empirically. The vLLM integration turns a research finding into a practical optimization. The quantization hints bridge the gap between understanding spikes and actually handling them during model compression.

Both projects are work in progress. If you find bugs or have ideas, open an issue.

Try It

# Diagnostic tool
git clone https://github.com/sahilmalik27/sinkhole
cd sinkhole && pip install -e .

# vLLM integration
pip install sinkhole-vllm

Paper: "The Spike, the Sparse and the Sink" — Shangwen Sun, Alfredo Canziani, Yann LeCun (ICML 2026).

Both projects are Apache 2.0 licensed.



Thanks for reading. Follow me for more.

← More posts