Posted on

Table of Contents

Background

Meta's release of the weights of the Llama language models has been one of the more interesting things to happen to the open-source machine learning community in recent memory. Having such a powerful language model available to tinker with locally, and in sizes well-suited to experiments with consumer hardware, has led to a lot of very interesting research coming out of dimly lit basements across the world.

The release of the second generation of Llama models was a delight. Trained on a dataset twice the size and with a 4096-token context length, Llama 2 was in many respects a perfect iterative improvement upon an already successful formula. However, there was an unfortunate omission.

Llama 1 was released in four sizes: 7b, 13b, 33b, and 65b parameters. All were (and are) useful in different niches. The 7b model can be offputtingly capable when trained on specific tasks. 65b is comparable to gpt-3.5-turbo if you squint really hard. 13b is also a size. And much loved by many, the 33b parameter model was the perfect tradeoff between size and capability when quantized to 4 bits and run on a 24gb video card.

Llama 2 was trained in four sizes: 7b, 13b, 34b, and 70b parameters. Three of them were released. One was delayed for being too spicy. And of course it was the good one.

I'll Do It Myself

As we all know, if a company announces that a product that you want is delayed: it means that it's never going to happen and you should get really mad about it on the internet.

Where is the approximately thirty billion parameter Llama 2 model, Meta? Where is it?

After almost an hour of patient waiting I of course came to this rational conclusion and decided to do something about it. How hard could it be? We have a perfectly good Llama-2-13b model that just needs a little thickening up. Stuff a whole bunch more bees in there and we're golden. I've got this.

So what exactly is the difference between a 13b model and a 33b model? The architecture of the two is almost identical, with a few exceptions. Here's a neat little table.

13b33b
Hidden layers4060
Hidden size51206656
Num. attention heads4052
Intermediate size1382417920

We can summarize by saying that there are two major differences: 33b has a) more layers and b) wider latent spaces. This presents two obvious routes to pursue in our quest to embiggen. Adding more layers would work nicely, but surely just duplicating existing layers would never work1. Expanding the latent spaces is a much more straightforward task.

Parameter Sizes Aren't Real

The basic insight behind the approach is simple. The nonlinearities in Llama models that directly touch the output are (almost) all element-wise. This means that if we can guarantee that the first $n$ values of a hidden state vector are unchanged from a base model, then the first $n$ values of the output will be unchanged as well.

The hidden state vectors in a Llama model are interacted with in the following ways:

  • embed_tokens transforms tokens into initial hidden state vectors
  • layer.self_attn projects hidden state vectors into multiple lower-dimensional spaces, does some stuff, then does a linear transformation back
  • layer.mlp projects into an intermediate space, does stuff, then a linear transformation back
  • lm_head transforms final hidden state vectors into logits
  • various norms do their norming

If we can get these operations to have the same results as they do in an Official(tm) Llama model, then we can do whatever else we want with the parameters without any degradation in performance. Aside from the norms, these parameters are all matrices, which provides an easy way to meet this guarantee. If our original model's matrix $M$ is of size $(f_{out}, f_{in})$, then we can create an extended matrix $M^\prime$ of size $(f_{out}^\prime>f_{out}, f_{in}^\prime>f_{in})$ and shove whatever garbage we want in there so long as $M^\prime(i\leq f_{out}, j\leq f_{in}) = M(i, j)$ and $M^\prime(i\leq f_{out},j\gt f_{in}) = 0$. More visually, we can build any block matrix along these lines:

$$ \begin{pmatrix} M & 0 \\ A & B \end{pmatrix} $$

But now what to put there? Literature exists on simple zero-extending, with apparent good results. But it would be nice to have potentially useful features in the new extended matrix if there is to be a hope of improving performance without ridiculously long continued training. And what's that I see over there in the corner? Is it perhaps original-flavor Llama 33b, chock full of succulent attention heads?

Open Tensor Surgery

I wrote a simple script to smash together matrices in this fashion and proceeded to build two different frankenmodels. Both were around 22b parameters, which comes from maintaining the same number of layers but matching the latent dimensions of Llama-1 33b. They are available on my Huggingface page here: $A=0$ (block diagonal), and $A\neq 0$ (block triangular).

Both were given extensive training to integrate the donated parameters. By this I mean I spent like twelve dollars of runpod GPU credits to QLoRA train them on the barest sliver of the RedPajama dataset. And after all of this, was it the model I dreamed of? Does it compare to that sweet, heavenly 34b Llama-2 that Meta dangles over our heads, tormenting us as the food scraps on a table do a hungry dachsund?

Results

Nah.




Alright, it was actually pretty okay! The models are both entirely coherent, and the community produced a number of fine tunes and merges. People seemed to like them! But there are a few problems from my perspective.

The merge produced with $A=0$ is almost completely untrainable. I made a few attempts at instruction tuning it and would often see beautiful loss curves, then sample the output and see a single token repeated infinitely. The whole idea was to give the model useful, meaningful new features to quickly improve performance during training. For $A=0$, this turned out to be Extremely Not The Case. I suspect the sheer number of zeros makes it too easy to overfit.

The $A\neq 0$ merge has better training dynamics, and does seem to maybe converge a little faster than the base 13b model when given the same dataset. This could easily be wishful thinking, hyperparameter differences, or luck of the draw though. I don't have nearly the kind of compute budget needed to rigorously test that.

I did bungle the handling of norm parameters in this experiment - to properly match the output of the base model, they should be zero extended but I just slapped the extra values on the end figuring it would be fine. It was! But not 100% correct.

And there's also the simple question of: is it better than the original 13b? From benchmarks, the answer is a resounding "ehhhh, maybe?"

modelAverageARCHellaSwagMMLUTruthfulQA
llama2-22b58.958.5382.5554.6838.84
llama2-22b-blocktriangular58.7758.5382.5954.6439.3
llama-2-13b58.6659.3982.1355.7737.38

A small increase in HellaSwag and TruthfulQA, and a small decrease in ARC and MMLU, places both versions slightly ahead of Llama-2-13b on the HuggingFace leaderboard. Meaningfully? Not really. You could argue that this is a decent win given the small amount of compute used.

Conclusions

So what are our takeaways here?

Parameters are Not Inviolable

It's quite common to think of models as monolithic and unchangeable once they are trained. The frankenllama 22b model stands as a solid declaration that with a token nod to the mathematics of it, you can commit grievous crimes against man and nature alike and still have a completely workable model. Fine-tuning is not the end-all of experimentation, and there is lots of room to try things that sound absurd and get interesting results.

Give Us The 34b Model, Meta

C'mon, just chuck it out there. Who needs a chat-trained version? It's fine, it's beautiful just the way it is. Pretty please?

This Particular Approach Mostly Works but Probably Don't Use It

It works, but not to the point that you couldn't get better results with the same budget from progressive growing or fine tuning on a carefully curated dataset. There's probably something interesting to explore in selecting specific features to use from a donor model. Permuting the donor matrices to minimize the difference with the final block matrix could potentially be an interesting future experiment.

I Wish I Had a Larger GPU Budget

Yeah. But don't we all?

Anyway, thanks for coming to my TED talk.


1

It totally does. Better, actually. Later article to come maybe? We'll see.