Illustration by Alex Castro / The Verge
More than 170,000 YouTube videos are part of a massive dataset that was used to train AI systems for some of the biggest technology companies, according to an investigation by Proof News and copublished with Wired. Apple, Anthropic, Nvidia, and Salesforce are among the tech firms that used the “YouTube Subtitles” data that was ripped from the video platform without permission. The training dataset is a collection of subtitles taken from YouTube videos belonging to more than 48,000 channels — it does not include imagery from the videos.
Videos from popular creators like MrBeast and Marques Brownlee appear in the dataset, as do clips from news outlets like ABC News, the BBC, and The New York Times. More than 100 videos from The Verge a…