A massive open-source AI dataset, LAION-5B, has come under scrutiny due to its inclusion of child sexual abuse material, according to a new report published by the Stanford Internet Observatory. The dataset, which has been used to train popular AI text-to-image generators, contains at least 1,008 instances of child sexual abuse material, with additional instances suspected. The report warns that this could enable AI products built on this data to generate new and potentially realistic child abuse content. LAION, the organization behind the dataset, has responded by temporarily taking down its datasets to ensure their safety before republishing them.
LAION’s image datasets have faced criticism in the past. In a paper published in October 2021, cognitive scientist Abeba Birhane highlighted problematic and explicit content in an earlier image dataset released by LAION. The dataset contained images and text pairs depicting rape, pornography, malign stereotypes, racist and ethnic slurs, and other highly problematic content.
Furthermore, there have been instances of private medical record photos taken by a doctor in 2013 being referenced in the LAION-5B image dataset. The artist Lapine discovered these photos on the Have I Been Trained website, which allows individuals to search for their work in popular AI training datasets.
A class-action lawsuit, Andersen et al. v. Stability AI LTD et al., was filed by visual artists Sarah Andersen, Kelly McKernan, and Karla Ortiz against Stability AI, Midjourney, and DeviantArt. Although LAION was not directly sued, it was named in the lawsuit. The lawsuit alleges that Stability AI downloaded or acquired copies of billions of copyrighted images from the internet, including LAION’s dataset, without permission.
LAION-5B has also raised concerns regarding privacy and intellectual property rights. Award-winning artist Karla Ortiz, who has worked for prominent film studios, spoke about the dataset’s controversial content, which includes private medical records, non-consensual pornography, images of children, and even social media pictures of individuals’ faces, during a virtual panel organized by the FTC.