June 5, 2024
Since founding Boson AI in 2023, we have dedicated ourselves to empower enterprises with AI technologies, with a mission to transform how stories are told, knowledge is learned, and insights are gathered. We helped customers build intelligent agents to interact with their users by playing various roles, including game characters, language tutors, insurance agents and financial advisors.
Today, we are excited to share our open-source series of Higgs family of large language models with the community to foster innovation. The first model, Higgs-Llama-3-70B, is a powerful chat model based on Meta’s LLaMA-3-base. It is specially tuned for role-playing while being competitive in general-domain instruction-following and reasoning.
Role Playing
Our customers want models with personality and character. Beyond the general abilities of a “helpful AI assistant”, models need to be able to act as a patient teacher, competent financial adviser, sympathetic companion, an evil villain, or an ambiguous protagonist straddling the line between good and evil.
In order to achieve this, models need the ability to follow and adapt to story and scene context, rather than just emulate a known character. A hero may be tempted in a particular situation, while a villain may be perfectly willing to provide sound advice in poetry, given the right context. To accomplish specific missions, goals and instructions, agents require strong general-domain reasoning ability.
We built both pretraining and post-training pipelines to generate models that excel in role playing. The current release showcases the effectiveness of our post-training pipeline. We choose Meta’s LLama-3 base model as a strong starting point. This is followed by our in-house teacher models and toolings to guide the alignment, making the fine-tuned model strong in both general-domain tasks and role playing.
Benchmarks
All benchmarks lead to eventual overfitting, including those for LLMs. Training on data, particularly beneficial for benchmarks typically does not improve (or even worsen) role playing performance. We worked to exclude benchmark data, including their training examples, from our fine-tuning data.
We highlight our results on two new and challenging benchmarks: MMLU-Pro and Arena-hard. MMLU-Pro extends the popular MMLU benchmark. We believe that it suffers from less overfitting by other released models as well, as it was released only recently (it was released after our models finished training).
Model | MMLU-Pro |
---|---|
GPT-4o | 72.6 |
Gemini-1.5-Pro | 69.0 |
Claude-3-Opus | 68.5 |
GPT-4-Turbo | 63.7 |
Higgs-Llama-3-70B | 63.2 |
Gemini-1.5-Flash | 59.1 |
Claude-3-Sonnet | 56.8 |
Llama-3-70B-Instruct | 56.2 |
Arena-hard contains 500 challenging real user queries from the popular Chatbot Arena.
Model | Arena-Hard |
---|---|
GPT-4o | 79.5 |
Gemini-1.5-Pro | 72.0 |
Claude-3-Opus | 60.4 |
Higgs-Llama-3-70B | 49.6 |
Gemini-1.5-Flash | 49.6 |
Claude-3-Sonnet | 46.8 |
Claude-3-Haiku | 41.5 |
Llama-3-70B-Instruct | 41.1 |
GPT-4-0613 | 37.9 |
Mistral-Large | 37.7 |
With the same base model, Higgs-Llama-3-70B outperforms Meta’s LLama-3-70B-Instruct on 6 widely-used benchmarks.
MMLU-Pro | Arena-Hard | AlpacaEval 2.0 LC |
MMLU | GPQA | DROP (F1,3-shot) |
|
---|---|---|---|---|---|---|
GPT-4o | 72.6 | 82.6 | 57.5 | 87.2 | 49.9 | 83.7 |
Higgs-Llama-3-70B | 63.2 | 49.6 | 38.6 | 80.8 | 42.1 | 81.6 |
LLama-3-70B-Instruct | 56.2 | 41.1 | 34.4 | 80.2 | 41.3 | 81.4 |
What’s Next?
Higgs-Llama-3-70B is but an appetizer of what Boson AI offers. We will dive into the role playing performance, the post-training pipeline, our experience in building a datacenter from scratch, using GPUs in the cloud, straddling multiple providers in the future. This includes releasing more models from the Higgs family.
We would like to thank our customers for their constructive feedback and the excellent technical support from our friends at NVIDIA, Arc Compute, eStruxture, Crusoe, AWS and Scaleway. This wouldn’t have been possible without them. There are more stories to be told in the future.