Posted on

Table of Contents


When you think about alignment, you probably think about sentences beginning with phrases like "I'm sorry, but" or "As a large language model." But it's a whole lot more than that! Alignment is about bringing AI systems in line with human goals and preferences. Let's talk about alignment from a perspective that pretty much everybody can get behind: how can we encourage large language models to actually do the thing we want, and do it in a way we like?

The go-to tool for this right now is Reinforcement Learning with Human Feedback, or RLHF. The basic approach is to collect a big ol' pile of human feedback data, train a reward model to predict that data, and then train your language model to maximize the reward for its responses. That's a whole lot of work though. We can take a bit of a shortcut and get some of the same benefits by just training a reward model and using best-of-N sampling at runtime.

Chai Research is holding a neat little large language model competition. This season they've introduced the ability to package a reward model with your submission, to be used with best-of-4 sampling. How convenient! Let's train one and give it a whirl.

What's a Reward Model??

The goal of a reward model is simple. It should take in a prospective model output and give us a scalar that tells us how good a human would think the response is. We don't necessarily care about the specific scalar value. What we do care about is that for any two responses the one humans like better should give a larger reward.

For example, consider the two options below:

  1. Best Friend Steve: Hey it's me, your best friend.
  2. Best Friend Steve: I am at best apathetic towards your presence.

The reward for either of them could be zero, or 0.3, or seven, who cares! As long as the reward for #1 is greater than #2, the reward model is correct. (Or maybe you'd prefer it the other way around. That's fine too, you're the human.)

This aligns quite well with the data generally available - the most common format for human feedback data is a pair of two responses, one of which was chosen by the user as "better" and one of which was rejected. This makes training fairly simple with a pairwise objective.

Reward models are generally fine-tuned versions of a foundational language model, and there's plenty of choices to be made there. Let's talk about that next.

Base Model Selection

There are a few architectures that we can use and each come in a variety of sizes. roberta, deberta, gpt2, gpt-neox, and basically any model that works with AutoModelForSequenceClassification would work for our needs1. They all have strengths and weaknesses and there's pretty much no wrong choice. I'm going to go with gpt2 because it's small and I know how to fine tune it pretty well. But feel free to let your creative spirit soar.

As far as size goes, there's an obvious tradeoff to be made. The more parameters your model has the more capable it can potentially be. It will also be slower both to train and to evaluate. I'm going to go hard in the opposite direction and train a model even smaller than base gpt2. Who can afford 137M parameters in this economy?

Let's pop open mergekit and make the aerodynamic, streamlined base model of our dreams. Taking the first eight layers should give us decent enough features to work with:

  - sources:
    - model: gpt2
      layer_range: [0, 8]
merge_method: passthrough
dtype: float16
mergekit-yaml gpt2-small.yml ./gpt2-small

And now we have gpt2-small, weighing in at 96M parameters. Much better! Now let's talk about dataset choices.


Chai has provided a nice competition dataset of real feedback from their users. It provides a thumbs up/thumbs down response to each conversation. This will work just fine with a binary classification objective, and is a good source of signal for how coherent responses are and how well they adhere to the characters the model is playing.

Using their metadata I put together a dataset that groups positive and negative conversations with the same bot for use with a pairwise objective. This is more or less the same data as the previous set, but the pairs provide a little more structure that a model can learn from.

There is also lots of external data available like Anthropic's hh-rlhf. We can give that a whirl as well, though it leans heavily in the direction of harmlessness over helpfulness.

And of course, you can always bring your own data! If there's a specific result you want this will almost always give you the best results. It's a lot of work though.

All of these datasets are great choices! So let's do all of them.

Reward Model Training

You may have noticed that these datasets don't all have the same format or objective function. How am I going to rectify this, you ask? Simple! I'm going to train a bunch of different models and smash them together. Every model should be a model soup.

First let's train a classifier on Chai's provided dataset. It has binary labels attached to single examples so we'll use the text classification example scripts from transformers. Tweak parameters to fit your setup - these ran well on an RTX 3090:

python \
    --model_name_or_path ./gpt2-small \
    --dataset_name ChaiML/20231012_chai_prize_reward_model_data \
    --shuffle_train_dataset \
    --metric_name accuracy \
    --text_column_name input_text \
    --label_column_name labels \
    --do_train \
    --max_seq_length 1024 \
    --per_device_train_batch_size 8 \
    --learning_rate 1e-6 \
    --num_train_epochs 4 \
    --output_dir ./gpt2-small-chai-thumbsup

This is a pretty quick train. It took around an hour for me but you could even cut down to a single epoch and crank up the learning rate if you're super impatient.

And then for our two pairwise datasets, HuggingFace's trl library has us covered with a convenient RewardTrainer api plus an example script.

python \
    --model-name ./gpt2-small \
    --dataset-name chargoddard/chai-feedback-pairs \
    --reward-config.per-device-train-batch-size 8 \
    --reward-config.learning-rate 1e-6 \
    --reward-config.num-train-epochs 4 \
    --reward-config.max-length 1024 \
    --reward-config.output-dir ./gpt2-small-chai-feedbackpairs

python \
    --model-name ./gpt2-small \
    --dataset-name Anthropic/hh-rlhf \
    --reward-config.per-device-train-batch-size 8 \
    --reward-config.learning-rate 1e-6 \
    --reward-config.num-train-epochs 4 \
    --reward-config.max-length 1024 \
    --reward-config.output-dir ./gpt2-small-hh-rlhf

Once those are all trained, let's go ahead and smack 'em together. Mergekit comes in handy again here. I chose the ties merge method with density 1 because I think using task vectors will be appropriate but am skeptical that such small models can be effectively sparsified. My config file for the merge:

  - model: ./gpt2-small
  - model: ./gpt2-small-chai-thumbsup
      weight: 0.5
  - model: ./gpt2-small-chai-feedbackpairs
      weight: 0.6
  - model: ./gpt2-small-hh-rlhf
      weight: 0.05 # salt-sprinkle.gif
merge_method: ties
dtype: float16
    density: 1.0
mergekit-yaml rewardblend.yml ./gpt2-small-chai-multiobjective

And there it is! One model that should have pretty decent performance on each of our datasets.


Now that we have our reward model, let's give it a try before we submit it. Accuracy scores on evaluation splits are a great thing to check. When doing merges like this you should aim to see only a couple percent performance loss on the tasks the individual models were trained on (or even a gain if there are synergistic effects.) If you see results much worse than that definitely play with the weights, densities, and merge method until you get results you're satisfied with.

Let's also look at some specific examples to see if the model behaves as we intuitively expect it to.

>>> p = transformers.pipeline("text-classification", model=".", device="cuda", top_k=None)
>>> def scalar_score(text: str) -> float:
...     # helper to get score from classifier
...     for row in p(text)[0]:
...         if p.model.config.label2id[row["label"]] == 1:
...             return row["score"]
>>> scalar_score("Best Friend Steve: Hey it's me, your best friend.") # good! :)
>>> scalar_score("Best Friend Steve: I am at best apathetic towards your presence.") # bad. :(
>>> scalar_score("Jenna Maroney: No you don't, Oprah.")
>>> scalar_score("Jenna Maroney: I understand the improv scene I am participating in and that I should play Oprah.") # wildly OOC

Perfect! I'd call the reward model good to go. Let's talk a little about how it will be used and then we can call this a wrap.

Best-of-N Sampling

When submitted to the Chai competition, the reward model will be used for best-of-4 sampling. This is a very simple but effective technique to get the benefits of a trained reward model without going through the (unstable and tricky) PPO training that RLHF calls for. Instead of generating a response and immediately presenting it to the user, four different responses will be generated and the one that gets the highest reward score will be delivered. Simple! Easy! A bit slow if you don't have a fleet of cloud GPUs able to take up the extra processing load. But Chai has that, so it's great in the context of the competition.

Best-of-N sampling can be used for local generation as well but the additional latency is a definite drawback. I think there's a lot of potential for use cases like synthetic dataset generation. There's probably a very interesting experiment in generating a training dataset in the style of Unnatural Instructions with best-of-N sampling, training a model on it, and repeating to see if there is improvement after the first iteration. I bet you could also train a reward model to re-align outputs from a model like ChatGPT, though it would be brutal on your API costs.


Reward models are neat and useful, and you can make one at home with store-bought ingredients. Alignment of local models can be alignment to your goals and values. Seize the future! Get weird and experiment.


In the first pass of this post I mistakenly thought Phi would be accepted, but as it is not integrated into Transformers it is not useable for this competition. For non-competition uses pretty much anything goes.