- Published on
Attention Focus Plots
- Authors
- Name
- Rene Claus
This series of figures examines how the attention block behaves on a particular sentence. The sentence was picked arbitrarily.
This is the sentence used, with each token higlighted:
What does the attention mechanism focus on?
Figure 1: Attention Map
This figure visualizes the attention matrix. Each row corresponds to a token, and the values in the row are the attention (before softmax) to all the previous tokens, which I refer to as target tokens.
The figure higlights which target tokens are getting the most attention. It specifically breaks out two special scenarios--when the target token is the current token (indicated in green) and when it is the first token (purple).
Figure 2: Attention Breakdown
This figure shows a breakdown of the components of attention matrix. Since this model uses ALIBI as the positional encoding mechanism, attention can be broken down into a linear sum of three parts:
- ALiBi: A linear term that is more significant for lower head indices and becomes less significant for tokens further in the past.
- Bias: A linear term that depends only on the target token and not on the current token.
- Attention: The quadratic term that depends both on the current token and the target token in the past.
By default, the figure shows only the value corresponding to the largest attention value in each row of Figure 1 , but the second, third, ect... highest values can be shown using the sort_idx filter.
Figure 3: Attention Breakdown - Bias vs ALiBi
This figure shows the relative contribution of the Bias term and the ALiBi term. They both depend only on the current token (plotted along the x-axis). Adding the same value to all target tokens has no effect on the attention matrix after the softmax, so only the trend is interesting.
How does attention focus change across layers?
Figure 4:
This figure examines how often heads in different layers attend to themselves.
The y-axis1 plots the fraction of tokens where the target token with the highest attention was itself.
Figure 5:
This figure examines how often heads in different layers attend to the first token.
The y-axis1 plots the fraction of tokens where the target token with the highest attention was the first token.
Figure 6:
This figure examines how concentrated the attention is on a single token.
The y-axis1 plots the difference between the highest and second highest attention probability (attention after softmax). Since the probability across all target tokens sums to 1, a value close to 1 in this plot indicates that head was focused on single tokens rather than multiple tokens.
What does the attention mechanism change?
To study the effect that attention has on the result, I examine how the hidden state is changed by the attention block. To do this, we can compute how much the hidden_state2 is changed by each attention head.
From this change to the hidden_state, we can compute two things:
- Update Magnitude: The L2-norm of the context_layer, indicating how strongly each head is changing the hidden_state. We normalize this by the L2-norm of the hidden_state itself.
- Update Orientation: The projection of the normalized hidden_state and normalized context_layer. This measures the cosine of the angle between the two.
Figure 7: Effect of Attention Head on Hidden State
This histogram visualizes the distribution of update magnitude over tokens for each head in a layer.
Figure 8: Attention Impact Orientation
This histogram visualizes the distribution of update alignment over tokens for each head in a layer. The alignment is measured as the projection of the update onto the input.
Figure 9: Impact Orientation vs Impact Strength
This plot compares the alignment of the hidden_state update to its magnitude. Alignments near zero indicate that the update is orthongonal to the existing hidden_state. Negative alginment means the update is canceling out part of the hidden_state.
Figure 10:
This figure plots the token index (from the start of the sequence) vs the impact on the hidden state (mixing all heads together). For layers 4 and later, when attention focuses on the first token, that head has no impact on the hidden state.
Footnotes
The lines in Figure 4 , Figure 5 , and Figure 6 do not correspond directly to specific heads. Instead, they have been sorted so that the lines not cross. This creates a more interesting visualization since there is no relationship between heads in different layers. The actual head ids are available in the tooltips. ↩ ↩2 ↩3
The hidden_state is the term I'm using for the input to each block of the model. It is essentially the embedding as it gets updated by each layer. ↩