Skip to content

Nvidia's new AI audio model can synthesize sounds that never existed

    At this point, anyone who follows AI research has long been familiar with generative models that can synthesize speech or melodic music based on nothing more than text cues. Nvidia's recently unveiled “Fugatto” model looks to go a step further, using new synthetic training methods and inference-level combination techniques to “transform any mix of music, voices and sounds,” including the synthesis of sounds that never existed .

    While Fugatto isn't yet available for public testing, an example-filled website shows how Fugatto can be used to turn up or down a number of different audio attributes and descriptions, resulting in everything from the sound of saxophones barking to people talking underwater to ambulance sirens singing in a kind of choir. While the results shown can be a bit hit and miss, the sheer range of capabilities on display here supports Nvidia's description of Fugatto as “a Swiss army knife for sound.”

    You are only as good as your data

    In an explanatory research paper, more than a dozen Nvidia researchers explain the difficulty of assembling a training dataset that can “reveal meaningful relationships between audio and language.” Although standard language models can often infer from the text-based data itself how to handle various instructions, it can be difficult to generalize descriptions and features from audio without more explicit guidance.

    To this end, the researchers begin by using an LLM to generate a Python script that can create a large number of template-based and free-form instructions describing different audio personas (e.g. “standard, young audience, 30-somethings, professional” ) They then generate a series of both absolute (e.g., “synthesize a happy voice”) and relative (e.g., “increase the happiness of this voice”) instructions that can be applied to those personas.

    In the wide range of open source audio datasets used as a basis for Fugatto, these types of property measurements are generally not built in as standard. But the researchers use existing audio understanding models to create “synthetic subtitles” for their training clips based on their cues, creating natural language descriptions that can automatically quantify properties such as gender, emotion and speech quality. Audio processing tools are also used to describe and quantify training clips at a more acoustic level (e.g. “fundamental frequency variance” or “reverb”).