April 3, 2025

Introducing Higgs Audio — Advanced Audio Understanding and Generation

At Boson AI, we work on making communication with AI as easy, natural and fun as talking to a human. Today, we are excited to introduce Higgs Audio Understanding and Higgs Audio Generation — two powerful tools designed to build customized AI agents tailored for diverse audio understanding and generation needs.

Higgs Audio Generation

To communicate with humans in a delightful and natural manner, we need to be able to generate realistic, emotionally competent and well-accentuated speech. We need a system that is capable of pronouncing words correctly, even if they derive from a foreign language, particularly for people’s names and places. We need a system that can generate conversations between multiple speakers, particularly when multiple characters in games are involved, or when reading books or screenplays.

Pure TTS (text to speech) systems struggle at these tasks, since they typically do not understand the meaning of what they’re generating, or any sense of urgency, hesitation, or other intonations that would be plainly obvious to a human speaker. They also struggle to adopt the natural character of a speaker, e.g. whether they’re naturally enthusiastic or more deliberate and thoughtful.

The way to address this problem is to build a TTS system using a Large Language Model (LLM) as a backbone. This endows the TTS system with the understanding needed to generate competent speech. Higgs Audio Generation enhances the underlying LLM to process audio by treating raw audio as tokens. This approach enables the model to be trained end-to-end on extensive text-audio datasets.

The base model we are introducing today demonstrates impressive performance on benchmark tests. Additionally, it showcases emerging capabilities, including generating speech with emotional tone based on text semantics and producing multi-speaker dialogues from written transcripts, all due to the improved understanding. Before diving into technical details, let’s listen to two examples of audio generated by our model.

Text: A profound sense of realization washed over Beal as he whispered, "You've been there for me all along, haven't you? I never truly appreciated you until now."
Text: Overwhelmed with confusion and despair, David Darlan cried out, "What do you want from me? Why can't you just tell me what's wrong? Leave me alone!"

Audio Generation Benchmarks

Of course, beautifully sounding audio is only part of the story. We need to verify that it works better objectively. For that purpose we evaluate the performance of Higgs Audio against CosyVoice2, QWen2.5-omni, and ElevenLabs on two widely used audio generation benchmarks: Seed-TTS Eval and the Emotional Speech Dataset (ESD). In this comparison, models are provided with a reference (text, audio) pair to generate audio for the target text, while ensuring that the output matches the style of the reference audio. As we can see, Higgs Audio is meaningfully better at generation than the reference models, including ElevenLabs.

Seed-TTS Eval ESD
WER ↓ SIM ↑ WER ↓ SIM ↑
Cosyvoice2 2.28 65.49 2.71 80.48
Qwen2.5-omni† 2.33 64.10 - -
ElevenLabs 1.43 50.0 1.66 65.87
Higgs Audio 2.18 66.27 1.49 82.84
† Qwen2.5-omni's performance is from the official report.

Audio Generation Judger

Over the past decade, TTS has improved dramatically. With this, assessing their quality becomes increasingly difficult. Traditional metrics such as word error rate (WER) or Mean Opinion Score (MOS) provide only a rough estimate of speech quality. In particular, these metrics fail to capture crucial elements like the naturalness of tone, pitch, energy, pauses, and non-verbal cues such as sighs.

This problem is reminiscent to problems in natural language processing, where the mere agreement in the number of characters or words is no longer an accurate measure of quality. For instance ‘I saw the cat’ is more similar to ‘I saw the kitten’ than ‘I saw the rat’, even though the latter differs from the first sentence only by a single character.

Drawing inspiration from the concept of LLM-as-a-judge, we leverage advanced audio understanding models to assess the quality of generated audio. Specifically, we selected 120 text prompts from BASE TTS categories, including “Compound Nouns”, “Emotions”, “Foreign Words”, “Paralinguistics”, “Questions”, and “Syntactic Complexities”. We then use Gemini-2.0 Flash to evaluate whether the generated audio outperforms the industry standard, ElevenLabs. We refer to this evaluation as the EmergentTTS-Eval benchmark. Let’s listen to how this works in practice:

Text: His face lit up with pure delight as he exclaimed, "We did it! We won the championship! I knew we could do it together!"
Higgs Audio (System 1)
ElevenLabs (System 2)
Judger output: Both systems successfully synthesized the text. However, system 1 better captured the emotion of excitement in the phrase "We did it! We won the championship!" by using a more enthusiastic tone and varying the pitch to convey the speaker's delight. System 2, while clear and understandable, sounded less emotionally expressive, making system 1 the winner.

We compare the model’s performance against QWen2.5-omni, ElevenLabs, and GPT-4o-mini-TTS using the EmergentTTS-Eval benchmark.

WER ↓ Win-rate ↑
Qwen2.5-omni 6.74 50.83
GPT-4o-mini-tts 3.14 58.33
ElevenLabs 1.31 50.00
Higgs Audio 1.82 61.67

This illustrates that model is capable of producing natural and emotionally expressive speech that aligns with the semantic context. In terms of benchmark numbers, our model performs well relative to other models in a paired comparison. ElevenLabs has a win rate of 50% as we used it as the baseline comparator for our benchmark.

Emergent Capability of Generating Multi-speaker Dialogues

We noticed that Higgs Audio can produce realistic multi-speaker dialogues from a transcript. This ability highlights the model’s strong semantic understanding of the text, enabling it to uncover the underlying story and respond accordingly. Below are some audio examples created by directly feeding the raw transcript (generated by ChatGPT) into the model. You’ll observe that the model successfully role-plays multiple characters and generates natural interruptions and filler words.

Training Higgs Audio
SPEAKER 0: You're training Higgs Audio again? Aren't you tired of staring at it all day?
SPEAKER 1: Ha! This time, I'm trying to get it to generate multi-speaker dialogues.
SPEAKER 0: Oh, so you want it to sound like a real conversation with multiple people? That sounds… tricky.
SPEAKER 1: It is. The biggest challenge is making sure it understands who's speaking and when. We need a solid dataset with real conversations, including interruptions and natural flow.
SPEAKER 0: Right, because real conversations aren't just people taking turns like robots. There are overlaps, hesitations, and sudden topic changes.
SPEAKER 1: Exactly! That's why we need speaker diarization—so the model knows when one speaker stops and another starts, even if they overlap.
Conversation about music
SPEAKER 0: You know, every time I listen to that song, it feels like I'm hearing it for the first time.
SPEAKER 1: Yeah, it's like the music just hits differently each time. The way the beat drops in the chorus gives me chills.
SPEAKER 0: Exactly! And the lyrics... it's almost like they're speaking to me. Like the artist was in my head when they wrote it.
SPEAKER 1: Same! That line about "finding peace in the chaos" always gets me. It's like they knew exactly what I needed to hear.
SPEAKER 2: I agree with you two. I love how music can do that. It has this way of connecting with you, even when you're not sure why. That's the reason I like that song.
Argument (Warning - loud audio)
SPEAKER 0: I can't believe you did that without even asking me first!
SPEAKER 1: Oh, come on! It wasn't a big deal, and I knew you would overreact like this.
SPEAKER 0: Overreact? You made a decision that affects both of us without even considering my opinion!
SPEAKER 1: Because I didn't have time to sit around waiting for you to make up your mind! Someone had to act.
Higgs Boson particle (with interruptions)
SPEAKER 0: So the Higgs boson is basically the particle that—
SPEAKER 1: Wait, wait—before you go full science mode, is it really a particle or just part of a field?
SPEAKER 0: Good question. Technically, it’s an excitation in the Higgs field—
SPEAKER 1: Let me stop you there. You’re saying mass comes from... moving through this field?
SPEAKER 0: Kind of, yeah! Particles feel resistance, and that’s their mass.
SPEAKER 1: That’s wild.
SPEAKER 0: That’s physics.

Higgs Audio Understanding

Speaking is only half of the story. Listening is the other half. To build a competent system for human-machine interaction, such as a sales agent we need an audio understanding model. Again, this goes beyond mere speech recognition, as the emotion, the context and the background noises all matter to further our understanding.

Similar to Higgs Audio Generation, we start with a pretrained LLM. To obtain the generator, we feed the raw audio into the LLM. This yields an end-to-end model, using large-scale text-audio understanding datasets. The pretrained base model demonstrates impressive performance across both audio understanding and reasoning benchmarks.

Understanding Benchmarks

We evaluated Higgs Audio on several audio understanding benchmarks, comparing it to Gemini-2.0-flash and GPT-4o-audio. Higgs Audio shows strong performance across the board.

Audio Understanding Chart

Reasoning Benchmarks

Next we evaluated the model’s audio reasoning capabilities on MMAU. Higgs Audio performs well on sound and speech tasks. It lags behind others in music tasks due to limited music data coverage in our datasets. Nonetheless, by utilizing the Chain-of-Thought (COT) capacity of the base LLM, its performance on music tasks is significantly enhanced.

Audio Understanding Chart

As demonstrated in the examples below, the CoT reasoning process not only improves interpretability but also boosts performance. This is particularly true for audio understanding tasks that involve counting, mathematical reasoning, complex semantic comprehension, external knowledge extraction, and sound event analysis.

Chain of Thought Example
CoT example 1 - Speech
Audio:

Question: How many times does the word 'boat' appear in the sentence?

Choices:
  • three
  • two
  • five
  • zero

Correct Answer: two

Chain of Thought:

PLANNING

The user wants to know how many times the word 'boat' appears in the given sentence. To answer this, I need to carefully listen to the audio and count the occurrences of the word 'boat'. I will break down the process into the following steps:

1. Listen to the entire audio clip to get an overview.

2. Identify and isolate the word 'boat' in the sentence.

3. Count the number of times 'boat' appears.

4. Summarize the findings.

CAPTIONING

The audio clip contains a single sentence spoken by a male voice. The sentence is: \"Both blue boats brought bait, but the boatmen borrowed Braden's rods.\" The speaker's voice is clear and the sentence is spoken at a moderate pace. There are no background noises or interruptions, making it easy to focus on the spoken words.

REASONING

Okay, let's see. The user wants to know how many times the word 'boat' appears in the sentence. So, I need to carefully listen to the audio and count each occurrence. The sentence is: \"Both blue boats brought bait, but the boatmen borrowed Braden's rods.\"

First, I'll break down the sentence into parts to make it easier. The first part is \"Both blue boats brought bait.\" Here, I hear the word 'boat' twice. It's clear and distinct. Then, in the second part, \"but the boatmen borrowed Braden's rods,\" the word 'boat' doesn't appear again. The only mention of 'boatmen' which is a different term.

Wait, 'boatmen' is not 'boat'. So, that's just one mention of a related term, but not the word 'boat' itself. Therefore, the total count should be two. Let me double-check by listening to the entire sentence again. \"Both blue boats brought bait, but the boatmen borrowed Braden's rods.\" Yes, the word 'boat' is used twice, and no other instances.

So, the answer should be 'two'.

SUMMARY

The word 'boat' appears twice in the sentence. The first instance is in the phrase \"Both blue boats brought bait,\" and the second instance is in the phrase \"but the boatmen borrowed Braden's rods.\" There are no other occurrences of the word 'boat' in the sentence.

Model Predictions:

Higgs Audio + CoT: two

Gemini Model: three

OpenAI Model: three

CoT example 2 - Sound
Audio:

Question: Where could the loud and deep purring be originating from?

Choices:
  • A zoo
  • A playground
  • A veterinary clinic
  • A forest

Correct Answer: A veterinary clinic

Chain of Thought:

PLANNING

The user wants to determine the source of a loud and deep purring sound heard in the audio clip. To answer this, I need to carefully listen to the audio and identify the characteristics of the purring sound. I will then consider the possible environments where such a sound could originate from and match it with the given choices.

CAPTIONING

After listening to the audio clip, I hear a continuous, deep, and resonant purring sound. The purring is consistent and has a low frequency, which gives it a rumbling quality. There are no other sounds present in the background, making the purring the dominant and only audible element in the clip.

REASONING

Okay, so the question is asking where the loud and deep purring sound could be coming from. The choices are a zoo, a playground, a veterinary clinic, or a forest. Let me think through this step by step.

First, I need to consider what kind of animal could produce a loud and deep purring sound. From what I know, cats are known for purring, but typically, their purring is not described as loud or deep. It's usually a softer, more consistent sound. However, big cats like lions or tigers are said to have deeper purring sounds. But are these sounds typically described as loud? Maybe in a zoo setting, where such large cats might be present, but it's not common to hear them purring loudly all the time.

Then there's the veterinary clinic. If there's a large cat, like a tiger or lion, in a clinic, especially one that deals with exotic animals, they might produce a deep purring sound, especially if they're relaxed or content. A veterinary clinic that handles big cats would be a plausible place. But would the purring be described as loud? It could be, depending on the size of the cat and the environment.

A playground is an unlikely place. There are usually children, equipment like swings and slides, and maybe some birds or squirrels, but not big cats. So that can be ruled out.

A forest... Big cats like lions and tigers are not typically found in forests. They have specific habitats like savannas or jungles. So a forest is not a likely source.

So between a zoo and a veterinary clinic, which one is more likely? Zoos do have big cats, but they are often spread out, and the sounds might not be as prominent unless you're very close. A veterinary clinic that specializes in exotic animals would have these cats in more controlled environments, possibly closer to the recording device, making the purring more audible and deeper. Also, in a clinic, the focus is on the animals, so the sound might be more noticeable.

Therefore, the most logical answer is a veterinary clinic.

SUMMARY

The loud and deep purring sound in the audio clip is most likely originating from a veterinary clinic. This is because such a sound could be produced by a large cat like a tiger or lion, which is more likely to be found in a controlled environment like a clinic that specializes in exotic animals. The purring is consistent and deep, suggesting a relaxed and content state of the animal, which is more probable in a veterinary setting where the animal is being cared for. In contrast, a zoo or a forest is less likely to produce such a prominent and deep purring sound, and a playground is an unlikely environment for such an animal to be present.

Model Predictions:

Higgs Audio + CoT: A veterinary clinic

Gemini Model: A zoo

OpenAI Model: A zoo

CoT example 3 - Music
Audio:

Question: What is the tempo of the song in the audio?

Choices:
  • 90.0 beats per minute
  • 100.0 beats per minute
  • 121.0 beats per minute
  • 140.0 beats per minute

Correct Answer: 121.0 beats per minute

Chain of Thought:

PLANNING

Okay, the user wants to determine the tempo of the song from the audio clip. The tempo is the speed or pace of the music, typically measured in beats per minute (BPM). To find this, I need to focus on the rhythmic elements of the audio. The audio is described as a 10-second segment of a song with a prominent bass guitar. Since the bass guitar is a key instrument in establishing the rhythm, I'll pay close attention to its pattern.

First, I need to identify the beat. The beat is the main pulse of the music, usually the downbeat in each measure. For a bass guitar, the downstrokes or the notes played on the strong beats are most indicative of the tempo. I'll listen for the consistent rhythmic pattern and count the beats in a 10-second segment to extrapolate to BPM.

To do this accurately, I'll start by isolating the bass guitar's sound. It's likely the most prominent instrument in this clip, so it should be relatively easy to focus on. I'll count the number of beats in the 10-second clip. Once I have that count, I can multiply it by 6 (since there are 60 seconds in a minute). For example, if I count 12 beats in 10 seconds, the tempo would be 12 * 6 = 72 BPM. But I need to make sure the beats are consistent and not just a few isolated notes.

Another thing to consider is the overall feel of the music. If it feels fast, the BPM should be higher. If it feels slow, it's lower. However, since the question provides specific options, I should stick to the count method for accuracy.\n\nWait, the provided options are 90, 100, 121, and 140 BPM. These are all within a reasonable range for popular music. Let me think: if the tempo is around 121 BPM, that's about 2 beats per second. So in 10 seconds, there would be 20 beats. That seems a bit fast for the options given. 140 BPM would be 2.33 beats per second, which is even faster. 90 BPM is 1.5 beats per second, which is slow. 100 BPM is 2 beats per second. So if the clip has a steady, moderate pace, 100 BPM is the most likely.

But I need to verify. Let me listen again and count carefully. If I count the beats in the 10-second segment, let's say I get 12 beats. That's 12 * 6 = 72 BPM, which is not in the options. So that's not correct. Maybe I miscounted. Let me try again. If I count 14 beats in 10 seconds, that's 14 * 6 = 84 BPM. Still not in the options. It seems I need to listen more closely.

Alternatively, maybe the tempo is faster. If I count 17 beats in 10 seconds, that's 102 BPM. Close to 100. If I count 18 beats, that's 108. Not in the options. 19 beats is 114. 20 beats is 120. So 120 BPM is closest to the options provided, which is 121 BPM. So maybe the correct answer is 121 BPM.

Alternatively, if the music has a driving, energetic feel, it might be slightly faster. But since 121 BPM is the closest to the calculated 120, I'll choose 121 BPM.

CAPTIONING

After listening to the audio clip, I noticed a prominent bass guitar playing throughout the 10-second segment. The bass guitar provides a strong, rhythmic foundation, with a consistent pattern that repeats. The notes are played in a steady, driving manner, creating a clear and recognizable beat. There are no other instruments or vocals present, allowing the bass guitar to stand out and dominate the soundscape.

REASONING

Okay, so I need to figure out the tempo of the song from the audio clip. The tempo is measured in beats per minute (BPM), and I have to count the beats in a 10-second segment to calculate it. The provided options are 90, 100, 121, and 140 BPM.

First, I'll focus on the bass guitar, which is the most prominent instrument. It has a strong, rhythmic pattern that repeats. I need to identify the beats. The bass guitar plays a consistent sequence, and each note in the sequence is a beat. Let me start counting the beats in the 10-second clip.

Starting from the beginning, I count: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20. That's 20 beats in 10 seconds. To find BPM, I multiply 20 by 6 (since 60 seconds divided by 10 seconds is 6). 20 * 6 = 120 BPM. However, 120 is not one of the options. The closest option is 121 BPM.

Alternatively, maybe I miscounted. Let me listen again more carefully. This time, I'll count more precisely. Starting again from the beginning of the clip, I count: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20. That's still 20 beats. So 20 * 6 = 120. The options given are 90, 100, 121, 140. Since 120 is not listed, but 121 is close, maybe the correct answer is 121 BPM.

Alternatively, perhaps the tempo is slightly faster. If I count 17 beats in 10 seconds, that's 102 BPM. If I count 18, it's 108. 19 beats would be 114. 20 beats is 120. 121 is the closest to 120. So the most likely answer is 121 BPM.

Wait, but the provided options include 121. So even though my initial count was 20 beats, which would be 120, the correct answer from the options is 121. Maybe there's a slight variation in the beat timing. Alternatively, the music might have a slight acceleration or deceleration, but since the question is a multiple-choice, and 121 is the closest, I should select that.

But let me verify once more. If the tempo is 121 BPM, that's approximately 2.0167 beats per second. In 10 seconds, that's 20.167 beats. So counting 20 beats would be very close to the actual tempo. Given the options, 121 BPM is the best fit.

Therefore, the correct answer is 121 BPM.

SUMMARY

The tempo of the song in the audio is 121 BPM. After carefully listening to the prominent bass guitar, I counted 20 beats in a 10-second segment, which extrapolates to 120 BPM. Since 121 BPM is the closest option provided, it is the most accurate answer.

Model Predictions:

Higgs Audio + CoT: 121.0 beats per minute

Gemini Model: 140.0 beats per minute

OpenAI Model: 121.0 beats per minute

Try Higgs Audio

To explore more about Higgs Audio, feel free to experiment with the generation playground or engage with the live voice chat demo. If you’re interested in integrating advanced speech recognition, natural voice synthesis, or both into your applications, don’t hesitate to reach out to our sales team.

Generation Playground Live Voice Chat Contact Sales

Acknowledgments

Lead: Alex Smola, Mu Li, Xingjian Shi

Model – Understanding: Jielin Qiu, Dongming Shen, Silin Meng, Rand Xie

Model – Generation: Xingjian Shi, Martin Ma, Ke Bai, Ruskin Raj Manku

Audio Tokenizer: Martin Ma, Ke Bai, Xingjian Shi

Evaluation: Ruskin Raj Manku, Jaewon Lee

Data - Pretrain: Mu Li, Ke Bai, Jaewon Lee, Geeyang Tay, Yizhi Liu, Yi Zhu

Data - Synthetic: Dongming Shen, Silin Meng

Distributed Training: Shuai Zheng, Sergii Tiugaiev

Serving: Zach Zheng

Playground: Yizhi Liu, Rand Xie

We would like to thank our customers for their constructive feedback and the excellent technical support from our friends at NVIDIA, Arc Compute, eStruxture, Crusoe, AWS and Scaleway.