July 18, 2024

Announcing Higgs Llama V2

At Boson AI, we are working on intelligent agents that can serve as human companions and helpers. Today, we are excited to share Higgs-Llama-3-70B-v2, a new model that significantly improves upon its predecessor. It narrows the gap to the very best proprietary models on benchmarks relevant for dialog, interaction and understanding. Arena-Hard and AlpacaEval 2.0 measure the general intelligence of LLMs and correlate well with human preference. MMLU-Pro is a recent benchmark that measures LLM’s knowledge and reasoning capability.

Higgs-v2

Partnering with the roleplay community we collected 6.2M dialogues in a 2-week A/B test. This allowed us to evaluate Higgs v2 directly against other models. Compared to Claude 3.5 Sonnet, Higgs v2 reduces the response regeneration rate1 by 21.6%. This rate matters as it directly relates to the cases where users are unhappy with the generated result. Moreover, it increases the day 1 retention rate2 by 5.3%.

Higgs Judger

Much of the performance boost of Higgs v2 comes from an improved judging system, which guides the model alignment through synthetic feedback signals. We built an in-house LLM reward model, named Higgs Judger, to evaluate model outputs. On Reward Bench, Higgs Judger ties with the best generative judger, Google’s Gemini 1.5 Pro, in the leaderboard.

Higgs Judger

In addition, this judger model learns the preference of players during roleplays, using the the feedback that the user provides.

What’s Next?

We are conducting more evaluations before the final release. If you would like to access Higgs v2 early or do customization, please contact us at api@boson.ai.

Acknowledgments

Model: Xingjian Shi, Rand Xie, Weisu Yin

Serving: Yizhi Liu, Zach Zheng

Data / Evaluation: Yi Zhu, Jaewon Lee, Weisu Yin, Canwen Xu

Training Infrastructure: Shuai Zheng, Rand Xie

Hardware: Sergii Tiugaiev, Kells Kearney, Alex Shylo

We would like to thank our customers for their constructive feedback and the excellent technical support from our friends at NVIDIA, Arc Compute, eStruxture, Crusoe, AWS and Scaleway.

Performance on Reward Bench

Model Reward Bench score
Higgs Judger 88.1
Gemini 1.5 Pro (05/14) 88.1
GPT-4 Turbo (04/09) 85.1
GPT-4o 84.7
Claude 3.5 Sonnet 83.8
Claude 3 Opus 80.7

Performance on Arena-Hard

Model Arena-Hard
Claude 3.5 Sonnet 79.3
GPT-4o 79.2
Higgs Llama 3 70B v2 78.6
GPT-4 Turbo (01/25) 78.0
Gemini 1.5 Pro 72.0
Claude 3 Opus 60.4
Higgs Llama 3 70B 3 49.6
Claude 3 Sonnet 46.8
Llama 3 70B Instruct 41.1
Mistral Large 37.7

Performance on AlpacaEval 2.0

Model AlpacaEval 2.0
GPT-4o 57.5
Higgs Llama 3 70B v2 56.7
GPT-4 Turbo (04/09) 55.0
Claude 3.5 Sonnet 52.4
Claude 3 Opus 40.5
Higgs Llama 3 70B 38.6
Claude 3 Sonnet 34.9
Llama 3 70B Instruct 34.4
Mistral Large 32.7

Performance on MMLU Pro

Model MMLU-Pro
GPT-4o 72.6
Gemini 1.5 Pro 69.0
Claude 3 Opus 68.5
GPT-4 Turbo 63.7
Higgs Llama 3 70B 63.2
Higgs Llama 3 70B v2 62.8
Gemini 1.5 Flash 59.1
Claude 3 Sonnet 56.8
Llama 3 70B Instruct 56.2

Footnotes

  1. The rate a user regenerates the response from the model. 

  2. The percentage of new users who returns back next day.