SAMPLED
Tech

The Atlantic just named the four datasets training the AI music boom

Millions of copyrighted recordings — Taylor Swift, Bad Bunny, the whole canon — are circulating among AI developers in four specific datasets. Here is what that means for the lawsuits already in motion.

By the Sampled desk·
Illustration by Sampled. (opens in a new tab)

On Sunday, a four-part investigation (opens in a new tab) named the datasets quietly powering most of the AI music generation industry. They are not obscure. They are not legally licensed. And they contain millions of recordings from artists who have never been asked, paid, or even told.

By Monday the findings were everywhere (opens in a new tab) — landing at the worst possible moment for Suno and Udio, and the best possible moment for the labels suing them.

Vinyl record dissolving into binary code and data points The Atlantic's reporting puts names on the datasets quietly fueling generative music. Illustration: Sampled.

What the datasets actually are

The Atlantic identified four specific corpora being shared among AI developers, each holding millions of tracks scraped or aggregated without licensing. Taken together, they cover most of the modern recorded canon — pop, hip-hop, country, Latin, K-pop — including, by name, Taylor Swift and Bad Bunny, two of the most commercially active artists on Earth.

These are not "internet-scraped audio of unclear provenance" as defendants have characterized their training data in court filings. These are organized, labeled, redistributable datasets that name the artist, the track, and in many cases the rights holder. The defense argument that AI companies cannot reasonably know what was in their training data gets harder to make when there is a manifest.

Why this lands now

Three weeks ago, UMG and Sony asked a federal court for permission to add more than 61,000 sound recordings to their copyright suit against Suno — the motion is public (opens in a new tab) and cites discovery showing Suno trained on "millions" of their recordings. Sony filed a parallel motion (opens in a new tab) against Udio adding 30,000+ recordings.

Last week, a federal judge vacated the order (opens in a new tab) sealing Udio's "confidential" data in the Sony case. So in the span of one month: the discovery record opened, the recording counts in active lawsuits ballooned, and now a major-magazine investigation has put names on the source corpora. The detection layer is no longer the problem — Pex's AI Song Detector (opens in a new tab) and its work on ACR for AI training data (opens in a new tab) show that audio identification has already moved into this lane. The legal layer is what just changed.

The other shoe

Suno and Udio have spent the last six months signaling they want to license, not litigate (opens in a new tab) — joining the industry rather than fighting it from the outside. The Atlantic's reporting makes that pivot more expensive. Once the datasets are named and the trained-on counts are documented, any settlement number gets bigger and any safe-harbor argument gets weaker.

The interesting question is not whether Suno and Udio settle. It is what happens to every other generative-audio startup that quietly pulled from the same four datasets and assumed nobody would ever produce a list.

The list now exists.