This post introduces glassbox, a vLLM plugin for extracting structured signals from transformer attention during inference.

Always start with the why

Transformer internals contain a great deal of information about how a model is behaving. Attention patterns, in particular, can reveal when behavior is becoming concentrated, unstable, or asymmetric — often before those shifts are visible in the output text.

That makes attention a promising source of signals for:

  • Hallucination detection
  • Failure mode diagnosis
  • Task drift detection
  • Uncertainty estimation
  • Ongoing monitoring of model behavior in production

But there is a practical problem. Raw activations and full attention matrices are expensive to retain, and modern serving engines like vLLM are specifically optimized to never materialize the full L×L attention matrix (FlashAttention, Triton kernels). That’s a good thing for performance, but it means the structure we care about is hidden by design.

Because of that, most tools for inspecting transformer internals live in research harnesses around HuggingFace models, creating a gap between papers and what is practical for inference.

What is glassbox?

Glassbox is a vLLM plugin that instruments the attention path during inference to extract compact, structured features from attention-related operators. It is designed to be a practical implementation of ideas from the attention analysis literature in a form that works inside a real inference engine.

The project is built around three ideas:

  1. Research-informed signals
  2. Built for inference
  3. vLLM-native

We will expand on each of these in the following sections.

1. Research-informed signals

Glassbox implements state-of-the-art methods for extracting and analyzing attention structure, and extends them with new techniques from active research by the safety team at Red Hat AI.

Today, glassbox extracts features from four attention-derived operators:

Spectral features from the pre-softmax scores matrix S = QKT. The leading singular values of S reveal attention sharpness (is the head focused on one dominant pattern or distributing attention across several?), whether a head is content-adaptive or positional, and how the attention structure evolves over the course of generation.

Attention symmetry features from the raw post-softmax matrix A = softmax(S/√d) (AttentionTracker, arXiv:2411.00348). These capture the coupling between symmetric and antisymmetric parts of the attention matrix — structure connected to recent results on mechanistic classification of failure modes such as prompt injection and hallucination.

Diagonal self-attention features from the diagonal of A (LLM-Check, NeurIPS 2024). The self-attention weight A[i,i] — how much a token attends to itself — correlates with model confidence and factuality. Glassbox extracts these without materializing A by computing only the diagonal entries.

Routing features from the degree-normalized post-softmax operator M = DQ-1/2 A DK-1/2 (Dahlem et al., upcoming). Degree normalization removes heterogeneity from uneven attention distributions and exposes the effective routing structure of attention. Glassbox extracts spectral features from M as well as features from its Hodge decomposition — separating attention asymmetry into potential-driven (gradient-like) and circulatory (irreversible) components. These routing and flow features are new implementations from ongoing research by the safety team at Red Hat AI.

Glassbox is designed as a practical implementation of ideas from the attention analysis literature in a form that works inside a real inference engine.

2. Built for inference

Research tools that analyze attention typically materialize the full L×L matrix — fine for a notebook, but a non-starter at serving time. Glassbox is designed from the ground up with the inference use case in mind.

Matrix-free algorithms. The core idea is that we never need the attention matrix explicitly. We only need to multiply it by vectors:

S·v  = Q·(Kᵀ·v)      — two thin matrix-vector products, O(Ld)
Sᵀ·u = K·(Qᵀ·u)      — same cost

Each multiplication costs O(Ld) through the L×d factors Q and K — not O(L²). These “matvec” primitives are used in iterative SVD algorithms within glassbox.

For the post-softmax operator, the problem is harder: applying A = softmax(QKT/√d) to arbitrary vectors requires computing softmax without materializing the full matrix. Glassbox handles this with blocked row-streaming (computing softmax in tiles and accumulating results) and a fused Triton kernel that uses online softmax to apply the attention matrix to multiple probe vectors in a single kernel launch — avoiding both O(L²) memory and redundant softmax recomputation.

Configurable overhead. Glassbox provides fine-grained control over the observability-to-latency tradeoff. Each feature group can be independently enabled or disabled. Extraction intervals, monitored layers and heads, SVD rank, and algorithm choice are all configurable:

scores_matrix:
  enabled: true
  interval: 32          # extract every 32 decode steps
  rank: 4
  method: randomized
  heads: [0, 1, 2, 3]   # monitor only these heads

degree_normalized_matrix:
  enabled: true
  interval: 64           # less frequent for heavier features
  rank: 4
  heads: [0]

attention_tracker:
  enabled: false         # off by default

attention_diagonal:
  enabled: true
  interval: 32
  heads: [0]

matvec_strategy: auto    # triton > batched > loop
output: glassbox.jsonl

You choose how much observability you want and how much latency you’re willing to pay for it.

3. vLLM-native

Glassbox works through vLLM’s supported extension points:

Custom attention backend Glassbox’s SVDTritonAttentionBackend provides access to attention internals.

The implementation class calls the parent forward() first (standard Triton attention runs exactly as before), then accumulates Q tokens and periodically extracts K from the paged KV cache to run the feature extraction pipeline. Everything else — metadata builder, KV cache shape, kernel selection — is inherited unchanged.

We then make the backend available via an entry point in pyproject.toml, from which vLLM loads glassbox automatically:

[project.entry-points."vllm.general_plugins"]
glassbox = "glassbox.vllm_plugin:register_svd_backend"

We are also exploring out-of-tree torch.compile passes and register_forward_hook as additional extraction paths for hidden states and activations — a complementary signal source beyond attention. The latter has coincidentally been proposed as a recent vLLM RFC (vllm-project/vllm#36998), so we are excited to see it land.

A first look at what comes out

Here’s a concrete example. We run glassbox on OPT-125m with the prompt “The future of artificial intelligence is”, generating 64 tokens. SVD snapshots are taken every 16 steps, extracting the top 2 singular values of the pre-softmax scores matrix for head 0 across all 12 layers.

The σ₁/σ₂ ratio — how sharply attention is concentrated on a single dominant pattern — tells a different story at each layer:

Layer step 16 step 32 step 48 step 64 Behavior
1 14.1 13.0 14.1 14.1 Constant — content-independent (positional)
3 12.2 7.6 6.3 5.1 Decaying — attention spreads as context diversifies
7 1.22 1.47 1.59 2.07 Increasing — one direction consolidates over time

Layer 1 has a rock-stable ratio of ~14 regardless of what’s being generated — the signature of a fixed structural pattern, almost certainly positional attention (e.g. always attend to the BOS token).

Layer 3 starts very sharp (12×) but steadily decays to 5× as the generated text introduces diverse entities. The dominant Q-K direction can’t capture everything — a second direction gains weight.

Layer 7 starts nearly isotropic (1.2×) but slowly sharpens to 2× as the generation settles into a repetitive theme and one attention direction begins to dominate.

This works across architectures. We’ve verified it on GPT-2 (124M, standard multi-head attention) and Qwen2-7B-Instruct (7B, grouped-query attention with 28 Q heads / 4 KV heads) with no code changes.

This is a small model on a toy prompt. But the point here is not the specific numbers — it’s that different layers have qualitatively different spectral behaviors during generation, and those behaviors are interpretable. That’s the kind of structure downstream systems can learn from.

The vision

Our goal is to make a class of model-internal signals operational inside vLLM. We want to provide:

  • Signals that help distinguish normal behavior from anomalous behavior
  • Signals that help localize where a model starts to drift
  • Signals that downstream systems — classifiers, monitors, debugging tools — can learn from, compare, aggregate, and act on

The value is not in any single metric, but in making this family of metrics available in a practical serving pipeline, informed by the best current research on what those metrics should be.

Where we are going next

In future posts, we plan to go deeper into:

  • Feature evaluation. How do the derived features actually correlate with failure modes? We plan to run across labeled datasets like HaluEval and TruthfulQA and see what signal is there — even if the answer is mixed.
  • Overhead and benchmarks. How much does this cost? What does the observability-to-latency tradeoff look like in practice? And for cases where even minimal inline overhead is unacceptable, we are exploring deployment via llm-d — running glassbox in shadow mode alongside the serving path, separating the observability workload from inference entirely.
  • Which features matter most. Across the different operators and decompositions, which attention-derived features appear most predictive for different failure modes?
  • Layer and head dynamics. How these signals evolve across layers, heads, and decoding steps — and what that tells us about model internals.
  • Design details. The matrix-free extraction path, the fused Triton kernel, and the engineering of the vLLM integration in more depth.

Code: github.com/dmaniloff/glassbox