Boson AI

Announcing Higgs Llama V2

July 15, 2024•By Boson AI Team

Play the entire pageVoiced by BosonAI

At Boson AI, we are working on intelligent agents that can serve as human companions and helpers. Today we are excited to share Higgs-Llama-3-70B-v2, a new model that significantly improves upon its predecessor. It narrows the gap to the very best proprietary models on benchmarks relevant for dialog interaction and understanding.

Partnering with the roleplay community we collected 6.2M dialogues in a 2-week A/B test. This allowed us to evaluate Higgs v2 directly against other models. Compared to Claude 3.5 Sonnet, Higgs v2 reduces the response regeneration rate⁽¹⁾ by 21.6%. This rate matters as it directly relates to the cases where users are unhappy with the generated result. Moreover, it increases the day 1 retention rate⁽²⁾ by 5.3%.

Higgs Judger

Much of the performance boost of Higgs v2 comes from an improved judging system, which guides the model alignment through synthetic feedback signals. We built an in-house LLM reward model, named Higgs Judger, to evaluate model outputs. On Reward Bench, Higgs Judger ties with the best generative judger, Google’s Gemini 1.5 Pro, in the leaderboard.

In addition, this judger model learns the preference of players during roleplays, using the the feedback that the user provides.

Performance on Reward Bench

Model	Reward Bench score
Higgs Judger	88.1
Gemini 1.5 Pro (05/14)	88.1
GPT-4 Turbo (04/09)	85.1
GPT-4o	84.7
Claude 3.5 Sonnet	83.8
Claude 3 Opus	80.7

Performance on Arena-Hard

Model	Arena-Hard
Claude 3.5 Sonnet	79.3
GPT-4o	79.2
Higgs Llama 3 70B v2	78.6
GPT-4 Turbo (01/25)	78.0
Gemini 1.5 Pro	72.0
Claude 3 Opus	60.4
Higgs Llama 3 70B 3	49.6
Claude 3 Sonnet	46.8
Llama 3 70B Instruct	41.1
Mistral Large	37.7

Performance on AlpacaEval 2.0

Model	AlpacaEval 2.0
GPT-4o	57.5
Higgs Llama 3 70B v2	56.7
GPT-4 Turbo (04/09)	55.0
Claude 3.5 Sonnet	52.4
Claude 3 Opus	40.5
Higgs Llama 3 70B	38.6
Claude 3 Sonnet	34.9
Llama 3 70B Instruct	34.4
Mistral Large	32.7

Performance on MMLU Pro

Model	MMLU-Pro
GPT-4o	72.6
Gemini 1.5 Pro	69.0
Claude 3 Opus	68.5
GPT-4 Turbo	63.7
Higgs Llama 3 70B	63.2
Higgs Llama 3 70B v2	62.8
Gemini 1.5 Flash	59.1
Claude 3 Sonnet	56.8
Llama 3 70B Instruct	56.2

Acknowledgments

Model: Xingjian Shi, Rand Xie, Weisu Yin

Serving: Yizhi Liu, Zach Zheng

Data / Evaluation: Yi Zhu, Jaewon Lee, Weisu Yin, Canwen Xu

Training Infrastructure: Shuai Zheng, Rand Xie

Hardware: Sergii Tiugaiev, Kells Kearney, Alex Shylo

We would like to thank our customers for their constructive feedback and the excellent technical support from our friends at NVIDIA, Arc Compute, eStruxture, Crusoe, AWS and Scaleway.

Footnotes

1. The rate a user regenerates the response from the model.↩︎
2. The percentage of new users who returns back next day.↩︎