The Unreasonable Ineffectiveness of the Deeper Layers¶
These are notes this paper about layer pruning.
General Concept¶
Modern LLMs are big, 70 billion parameters big! Far as we're concerned, that's a little extravagant. So if your model feels a little light in these next tests, that's normal. We're gonna hit it with a some layer pruning algorithms and see if we can get that down to maybe 20 billion or 30 billion.
Pruning¶
Pruning is different than distiliation. Distilation uses a large model to "teach" a small model (by replicating the outputs). With pruning, we're interested in dropping layers of the model to reduce redundancies, or make the model more skill-specific. Pruning is much less computationally intensive than distilation because you only have to run the pruning algorithm and the finetuning (instead of running a large model and then training a small one).
First, where $l$ is the index of some layer, we can imagine some equation:
$$ x^{l+1} = x^{l} + f(x^{l}, \theta^{l}) $$
Where $x^l$ is is the input tensor for layer $l$, and $\theta^l$ is the parameters for layer $l$. This is recursive, and can be flattened out to a simple sum:
$$x^L = x^0 + \sum_{l=0}^{L-1} f( x^l, \theta^l)$$
Where each iteration here in the summation represents the effects of a layer. If we imagine that there's a lot of layers, and that those layers were independent from each other, then naturally we could remove one or two layers without effecting the final result very much. Of course, this isn't true, if we delete the third layer, then we have to find a way to connect the second layer to the fourth layer.
The exception to this is of course, if somehow as we go along our summation we find that the layers aren't too different from each other, that's to say
$$x^l \approx x^{l-1} + \epsilon$$
Where $\epsilon$ is very tiny, then in theory we could delete these similar layers without having too much effect on their later layers.
Of course, because the biggest changes are made at the first few layers, deleting them likely would have some cascading effect on all the other layers (they wouldn't be independent) and it would be a big deal, but the later layers only make small changes, and are often very similar, so we can probably remove them. Through experimentation, we actually find that the very last layer is often unique (and takes some kind of max) so we won't remove that one.
Method¶
Angular Distance¶
Before we begin, a quick review that the angular distance between two tensors of length $T$ is given as follows:
$$ dist(X_1, X_2) = \frac{1}{\pi} arccos( \frac{X_1 \cdot X_2}{||X_1|| ||X_2||}) $$
Similarity Based Layer Pruning Algorithm¶
The first layer pruning algorithm presented goes as follows:
- Pick a number of layers to prune $n$.
- Compute the angular distance between the input to layer $l$ and the input to layer $l+n$
- Find the layer $l^*$ that minimizes the distance described above, we'll call him $l^*$.
- Drop the layers $l$ to $l+n$ and replace them with $l^*$
- Optionally do some fine-tuning to smooth things out a bit.
Essentially, look at the output of layer $l$, look at the input of layer $l+n$, come up with a single layer that takes that output and transforms it into that input, then replace every layer between $l$ and $l+n$ with that new layer.
Simple Layer Pruning Algorithm¶
Let $L$ be the total number of layers in the model
Pick a number of layers to prune $n$
Take layers $L-n$ to $L-1$ and uhhh..
Perform some fine-tuning to "heal" the model. This is NOT OPTIONAL for this technique, and the model will perform very poorly if you skip it.
How much to cut out?¶
After you cut out a certain amount of a model it's performance starts to become bad (go figure). Specifically for the benchmarks used in the paper, they cut increasing percents of the following model families until getting close to random accuracy.
Model Family | % You Can Cut And Be Better Than Random |
---|---|
Llama 2 | 45%-55% |
Mistral 7B | 35% |
Phi-2 | 25% |
Qwen | 20% |
Layer pruning seems to be more effective on larger models, you can cut out a larger percent of a 70B parameter model than a 7B one.
A comparison between these two techniques and the baseline score can be seen below:
Conclusion¶
Using this technique models that previously required specialist equipment can run on consumer hardware. Better pruning and healing techniques likely exist since these ones are fairly crude. There's also a question of how to avoid creating these redundancies when we train the model in the first place.
Re-Implementation¶
Sorry bud, maybe some other time. My computer won't run big models so I can't really test it on big models...