Attention Focus Plots

This series of figures examines how the attention block behaves on a particular sentence. The sentence was picked arbitrarily.

This is the sentence used, with each token higlighted:

By the end of 2014, we plan to revamp our health education and disease prevention. program, which was suspended in September 2013. We will conduct health education. sessions at the community centers, group sessions for specific conditions (Diabetes,. Hypertension, etc.), as well as community forums and health fair. We will recruit. health agents to conduct needs assessment in our area of intervention, identify. the health and economic needs, educate and encourage people to access preventive. services, such as vaccination, prenatal care, and health maintenance for chronic. diseases.. The health agents will also identify pregnant women, newborns, infants with high risk of malnutrition to help them obtain proper care, vaccination, food assistance, etc. They will also identify people with certain conditions such as HIV, Tuberculosis, and Malaria to help them access available services.

What does the attention mechanism focus on?

layer14

head17

Figure 1: Attention Map

This figure visualizes the attention matrix. Each row corresponds to a token, and the values in the row are the attention (before softmax) to all the previous tokens, which I refer to as target tokens.

The figure higlights which target tokens are getting the most attention. It specifically breaks out two special scenarios--when the target token is the current token (indicated in green) and when it is the first token (purple).

layer14

head17

Figure 2: Attention Breakdown

This figure shows a breakdown of the components of attention matrix. Since this model uses ALIBI as the positional encoding mechanism, attention can be broken down into a linear sum of three parts:

ALiBi: A linear term that is more significant for lower head indices and becomes less significant for tokens further in the past.
Bias: A linear term that depends only on the target token and not on the current token.
Attention: The quadratic term that depends both on the current token and the target token in the past.

By default, the figure shows only the value corresponding to the largest attention value in each row of Figure 1 , but the second, third, ect... highest values can be shown using the sort_idx filter.

layer14

head17

Figure 3: Attention Breakdown - Bias vs ALiBi

This figure shows the relative contribution of the Bias term and the ALiBi term. They both depend only on the current token (plotted along the x-axis). Adding the same value to all target tokens has no effect on the attention matrix after the softmax, so only the trend is interesting.

How does attention focus change across layers?

Figure 4:

This figure examines how often heads in different layers attend to themselves.

The y-axis¹ plots the fraction of tokens where the target token with the highest attention was itself.

Figure 5:

This figure examines how often heads in different layers attend to the first token.

The y-axis¹ plots the fraction of tokens where the target token with the highest attention was the first token.

Figure 6:

This figure examines how concentrated the attention is on a single token.

The y-axis¹ plots the difference between the highest and second highest attention probability (attention after softmax). Since the probability across all target tokens sums to 1, a value close to 1 in this plot indicates that head was focused on single tokens rather than multiple tokens.

What does the attention mechanism change?

To study the effect that attention has on the result, I examine how the hidden state is changed by the attention block. To do this, we can compute how much the hidden_state² is changed by each attention head.

From this change to the hidden_state, we can compute two things:

Update Magnitude: The L2-norm of the context_layer, indicating how strongly each head is changing the hidden_state. We normalize this by the L2-norm of the hidden_state itself.
Update Orientation: The projection of the normalized hidden_state and normalized context_layer. This measures the cosine of the angle between the two.

layer14

Figure 7: Effect of Attention Head on Hidden State

This histogram visualizes the distribution of update magnitude over tokens for each head in a layer.

layer14

Figure 8: Attention Impact Orientation

This histogram visualizes the distribution of update alignment over tokens for each head in a layer. The alignment is measured as the projection of the update onto the input.

layer14

Figure 9: Impact Orientation vs Impact Strength

This plot compares the alignment of the hidden_state update to its magnitude. Alignments near zero indicate that the update is orthongonal to the existing hidden_state. Negative alginment means the update is canceling out part of the hidden_state.

Figure 10:

This figure plots the token index (from the start of the sequence) vs the impact on the hidden state (mixing all heads together). For layers 4 and later, when attention focuses on the first token, that head has no impact on the hidden state.

Footnotes

The lines in Figure 4 , Figure 5 , and Figure 6 do not correspond directly to specific heads. Instead, they have been sorted so that the lines not cross. This creates a more interesting visualization since there is no relationship between heads in different layers. The actual head ids are available in the tooltips. ↩ ↩² ↩³
The hidden_state is the term I'm using for the input to each block of the model. It is essentially the embedding as it gets updated by each layer. ↩