- Published on
Feed Forward Network Parameters
- Authors
- Name
- Rene Claus
The feed forward network (FFN) has the mathematical structure: . Where is the GELU activation function, is the bias, and and are the weight matrices, which I will call the "input" and "output" matrix.
This can be thought of as similar to a database lookup, with and being the database.
- is compared to each column of to produce a similarity score.
- checks that this similarity score is at least some threshold set by the bias, .
- For each column that meets the threshold, we add the corresponding column of to the output. This output is scaled by how much the score exceeded the threshold.
The following figures examine the properties of the and columns. This model has 8192 columns in each FFN.
M1-M2 Anti-Alignment
Figure 1 explores the relationship between the columns, columns, and bias. Clear correlations can be seen in later layers.
The correlation between the magnitude of the columns and the bias is easy to explain: if you increase the magnitude of an column, the exact same input hidden_state will pass a threshold. If the magnitudes of columns are not precisely controlled, then it makes sense that the bias adjust to keep the likelihood of passing the threshold similar across columns.
I don't have a complete explanation of the correlation between the column mangitudes and bias, but I interpret the figures to suggest that the columns mostly have a particular magnitude, with some outliers.
The most interesting figure is the alignment between the and columns. If you examine this figure at different layers, you'll observe that there are initially two clusters--one with positive alignment and one with negative alignment. For later layers these two clusters evolve into a single cluster with a sligh negative alignment.
Figure 1: Correlation with Bias
This figure looks for correlations between the columns of and and the bias. In the 2D histograms, outliers are highlighted to dispell the illusion that the histogram fully captures the distribution.
M1_norm and M2_norm are the magnitudes of the columns of and , respectively. M1-M2 Alignment is the cosiine similarity of each column with its corresponding column.
Figure 2:
In Figure 1 there is an interesting correlation between the M1-M2 Alignment and the bias. This figure fits a line to the and plots the slope and average bias and alignment as a function of layer.
Figure 3: Column Correlations
This figure shows the cosine similarity of each column with each other column of the same matrix. The bias in the M1 plot indicates that there is that columns of tend to point, at least partially, in the same direction, while for the columns are relatively uncorrelated.