Table of Contents
Mixtral
Mistral AI recently released a new model, Mixtral-8x7B-v0.1, and it's a delight. It uses an architecture based on their (also delightful) Mistral-7B, but where the original flavor uses a standard MLP section Mixtral uses a set of eight "experts". Clearly the folks at Mistral are doing something right because Mixtral punches way above its weight class.
Thanks to this release Mixture of Experts has been the hot new buzzword around town. This type of model has historically been pretty tricky to train, but now there's a great base with a permissive license to work from. The open source community is busy at work doing fine tunes of Mixtral and exploring what it's capable of. There are some impressive early takes on it but I suspect it'll take a lot of iteration before we collectively figure out the formula for really extracting the best out of Mixtral.
There's more potential here than the obvious value in the model itself. Mixtral being released didn't just give us a model, it gave us a known-working Mixture of Experts architecture integrated into the transformers
library as a first class citizen. That's big! Now just about anyone can train a MoE model without having to come up with their own bespoke architecture.
What about the little guy though, huh? This is all great if you have enough GPU laying around to train models, but most people don't. Well I have great news for you, fellow non-billion-dollar-company-or-research-team! The Mixtral architecture has much to offer us. Let's dig in a bit more, because there are some fun things we can do with it.
Architecture
As I mentioned earlier, Mixtral is quite similar to the Mistral architecture. Aside from the token embedding and language model head you can divide all of the parameters in a Mistral model into three groups: self attention, layernorm, and MLP1. A Mixtral model with eight experts consists of mostly the exact same parameters as a Mistral model. There are two major differences. The first, and most obvious, is that a Mixtral model has eight2 different sets of Mistral MLP parameters - the eponymous experts in our mixture. The second is that each layer has a simple linear gate that assigns a score to each of the eight experts.
Looking at this setup, my immediate first thought was "neat! I'm going to clown-car a bunch of regular Mistral models together." If it weren't for the need for gate parameters this would be trivial. Select a model to use the self attention parameters from, then take the MLP parameters from whatever you fancy and cram 'em in there. Unfortunately we really do need gates. You could randomly initialize the gates and do some fine tuning. But that takes time and resources, which we've already established we don't have those. And I like my crimes against common sense to be a bit more accessible.
MoE Gates Without Training
Instead of turning to the big hammer that is fine tuning let's look a bit at what the gates are actually doing. They are quite simple and straightforward - each gate is a single matrix that takes in a hidden state vector and outputs the expert scores. If you have a hidden size of 4096 and 8 experts, as Mixtral-8x7B does, then they will be 8x4096 matrices. Thanks to our friend Linear Algebra™ we can say that there is a hidden state vector associated with each expert, and at inference time the ones that are closest3 to the actual current hidden state will be selected.
With that in mind, there's a straightforward strategy that we can use to select the parameters for the gates. We can choose a set of prompts that we'd like to be associated with each expert, run them through a "base" model, then use the actual hidden state values of the prompts for our gates.
For example, say that we have a model that is good at math and a model that is good at storywriting. We can compute the hidden states for the prompt "As a Doctor of Math, here is some of my favorite math:" and use those for the gate parameters associated with the math model's MLP parameters. "Chapter 1: The Horse that was Also a Wizard" gives us our indisputably perfect vector for the storywriting model.
This almost certainly won't give the even distribution of expert use and diffusion of knowledge that training a MoE from scratch would give you but it will work quite nicely for our purposes.
To be a bit more flexible I wrote a script that takes in a set of positive and negative prompts for each expert and combines their hidden state representations. For my first attempt at this kind of merge, I decided to try combining 5 Mistral models to make a 4-expert Mixtral. It worked pretty well! It's on huggingface here.
Like I do with most of my early tools, I gave Undi the first look at the script. (Thanks for testing all of my semi-broken junk, Undi.) As usual he cranked out a bunch of cool stuff that blows my initial attempt out of the water. If you're in the mood for trying out the weird, bleeding edge of experimental roleplay models then check out Mixtral-4x7B-DPO-RPChat, Mixtral-8x7B-MoE-RP-Story, or Undi95/Toppy-Mix-4x7B. You can also see some examples of config files with prompts for the gates.
Now, this is great and all, but we can go a step further. Right now we just have (parts of) Mistral models ensembled together in our pseudo-Mixtral. Mistral and Mixtral are practically the same word. There's only one letter difference - this isn't nearly weird or upsetting enough of a merge.
More Weird, More Upsetting
Why not a Llama MoE?
As you might know, Mistral has almost the exact same architecture as Llama. The only difference is in Mistral's use of sliding window attention. As long as you don't go beyond the initial context window you can actually slap the weights from a Mistral model into a program expecting Llama weights and it'll work no problem. This more-or-less works both ways.
Thanks to the mostly-shared architecture, we can perform the same merge procedure with Llama-based models with minimal tweaking. For an example of one of these in action see Undi's first Llamix2 4x13B.
The script is available in mergekit on the mixtral
branch here. Give it a shot if you like. Want a 4x13B model? No problem! Why not a 32x1B? Live a little. Treat yourself. There's surely a lot of potential in Mixture of Experts models and there's no reason to let the fact that you don't own a semiconductor fab or substantial stake in NVIDIA to stop you from experimenting.
For those more inclined to fine tune, there's a mode for random gate initialization (and an option to add gaussian noise to the experts if you're feeling saucy.) Common speculation right now is that Mistral trained Mixtral initialized with Mistral's parameters4 so maybe that works? We'll find out.
Future Work
As mentioned, this approach of using prompts to determine gate vectors almost certainly gives a very different distribution of expert usage from a trained-from-scratch mixture of experts. Recent literature shows standard MoE models tend to have individual experts that focus on specific parts of speech or simple low-level concepts. This approach assigns them according to a more human frame of view - which is potentially both good and bad. Subjective results are pretty great so far but some work definitely needs to be done to evaluate how well the resulting models use their added parameters.
I'm also investigating using more advanced merging techniques like otfusion to make MoE merges out of disparate base models. The recent glut of high-quality small foundation models being released with permissive licenses is potentially a great opportunity for making a powerful mixture of experts without spending much of your own money on compute.
That's all! End of post. Bye now.
This terminology is really funny to me. MLP stands for Multi-Layer Perceptron. Perceptrons are one of the venerable distant ancestors of neural networks as we know them, and are very much not present in modern transformers. They were (are) used for binary classification and have the Heaviside step function as their activation. But we're used to talking about MLPs and "dense linear feedforward network with activation function of the month (vegan)" just doesn't have the same zip.
Eight is just the number used by Mixtral-8x7B - the architecture supports an arbitrary number. Don't do two or fewer though. That breaks the transformers implementation for whatever reason.
The dot product is the metric we care about here.
This model is a spicy boy. It's a cool world's first public Llama MoE but probably don't show it to your boss or anything. Maybe wait for the second or third one. Llamix2-MLewd-4x13B
Mistral mixtral mistral Mistral, Mixtral mistral.