- Published on
Experiment Setup
- Authors
- Name
- Rene Claus
Experimental Philosophy
Over the past year, I've been interested in examining an already trained model to find structure within the model weights that will provide insights into how training or the model architecture could be improved. I expect that the emergent structure will be similar across most LLMs, so which model I evaluate is relatively unimportant at this point.
I'm deliberately avoiding training because that requires infrastrcture and expertise that I don't currently have. My experience with model training is that it tends to be slow and leads to a workflow heavy on trying things and light on introspection and building intuition. While there are many questions that are best answered through model training, I am focusing my attention on those questions I can answer without the need to sweep training parameters or modify model architecture.
Setup
I run all my analysis using streamlit
and create visualizations using plotly
. I use my personal desktop computer with an i5-13600K
processor, a GTX 1080
graphics card, and 96GB
of RAM.
I run on Windows using WSL
. I have disabled the use of swap and allowed WSL to use most of the system memory. To avoid python memory usage growing continuously until it OOMs, I use np.memmap
to materialize and load model weight matrices and other large pieces of data.
I sometimes use jax
to accelerate the model evaluation using the GPU. Most of the work doesn't require me to run the entire model, though. In those cases I typically stick with numpy since it interacts more naturaly with streamlit
. The model I'm using came as a pytorch
model, which was also able to run on the GPU within WSL
.
Model
I'm using the tiiuae/falcon-rw-1b
model from Hugging Face. I initially tried using the LLAMA-2 model, but at 7B parameters, I ran into memory problems. Now that I've worked with these models I know how I'd handle those memory problems, but since I'm all set up to use the falcon-rw-1b
model I see no need to switch at this time.
I got lucky because in addition to being relatively small, the model turns out to make some architecture decisions that make it interesting to study. Some of the things I like about the model are:
- At only 1B parameters, it's small enough that I can keep it in memory and have memory to spare for the analysis.
- It produces relatively good text completions, so it should have some of the same emergent properties of larger models.
- It uses
ALiBi
as the position encoding, which is much simpler than cosine encoding or rotary positional encoding. - It uses a GeLU activation function, which is simpler than the GLU activation function used in
LLAMA-2
.
The model has these properties:
Parameter | Value |
---|---|
embedding size | 2048 |
number of layers | 24 |
number of heads | 32 |
head dimension | 64 |
MLP size | 8192 |
Activation Function | GeLU |
Positional Encoding | ALIBI |
Layer Normalization | layerNorm |
Tokenizer | GPT-2 tokenizer |
Dataset
For some of my examinations I want to look at the statistical properties of intermediates in the model. To do this I need to run many tokens through the model.
I arbitrarily selected the nampdn-ai/tiny-lessons dataset.
from datasets import load_dataset
dataset = load_dataset('nampdn-ai/tiny-lessons')['train']['text']
This dataset has a few properties I like:
- Many relatively short text snippets. They are short enough that I rarely need to truncate the snippets to keep a reasonable context window.
- A variety of topics so that a wide range of the models knowledge gets activated.
- Clean text with no markup or other weird tokens.
I am not using this dataset for training or to evaluate model performance--I only use it to trace different paths through the model.