Skip to content

Creepy realistic AI-Stemdemo arouses surprise and discomfort online

    An example argument with Sesame's CSM made by Gavin Purcell.

    Gavin Purcell, co-gastheer of the Podcast AI, placed an example video on Reddit where people pretend to be a blackout and argues with a boss. It is so dynamic that it is difficult to say who man is and what the AI ​​model is. Based on our own demo, it is fully capable of what you see in the video.

    “Besides human quality”

    Under the hood, Sesame's CSM reaches realism by using two AI models that work together (a backbone and a decoder) based on the Lama architecture of Meta that processes interleaved text and audio. Sesame has trained three AI model sizes, with the largest with 8.3 billion parameters (an 8 billion backbone model plus a 300 million parameter decoder) at about 1 million hours of mainly English audio.

    Sesame's CSM does not follow the traditional two-stage approach used by many previous text-to-speech systems. Instead of generating semantic tokens (high-level speech presentations) and acoustic details (fine-grained audio functions) in two separate phases, Sesame's CSM integrates into a multimodal transformer-based model with a single-bag to produce interleaved text and audio-text and audio-text and audio-text and audio-lysts and audio-lysts and audio-lyrics and audio-lyrics and audio-lyrics and audiotoksts. The OpenAI speech model uses a similar multimodal approach.

    In blind tests without conversation context, human evaluators did not show a clear preference between CSM generated speech and real human recordings, suggesting that the model achieves almost human quality for isolated speech samples. However, if provided with conversation context, evaluators still consistently preferred real human speech, indicating that a gap continues to exist in completely contextual speech generation.

    Co-founder of Sesame Brendan Iribe acknowledged the current limitations in a comment about Hacker News and noted that the system is “still too eager and often inappropriate in its tone, prosodie and pacing” and has problems with interruptions, timing and conversation flow. “Today we are firmly in the valley, but we are optimistic that we can climb out,” he wrote.