AI Dataset Privacy Breach: Millions of Personal Images in DataComp CommonPool Exposed
July 18, 2025
A major issue is the lack of consent, as many individuals whose images were scraped likely never agreed to their data being used for AI training, especially since many images predate the development of these models.
The researchers raised concerns about the effectiveness of current filtering methods, noting that even with face blurring, personal details often remain in captions and metadata.
The study also pointed out that existing privacy laws like GDPR and CCPA may not be sufficient to protect individuals from the widespread use of publicly available data in AI datasets.
A recent study has uncovered that millions of images containing personally identifiable information are included in DataComp CommonPool, one of the largest open-source datasets for AI training, raising serious privacy concerns.
Released in 2023, the DataComp CommonPool dataset comprises 12.8 billion samples collected from web scraping by Common Crawl between 2014 and 2022, sharing sources with datasets like LAION-5B used for models such as Stable Diffusion.
The findings highlight the need to reevaluate current web scraping practices and urge the AI community to recognize that many publicly available data may still contain private information.
Despite efforts to anonymize the data, the study identified over 800 faces that escaped detection by existing algorithms, with an overall estimate that 102 million faces could have been missed.
The dataset contains thousands of validated identity documents such as passports and credit cards, with estimates suggesting it could include hundreds of millions of similar sensitive images.
Summary based on 1 source
Get a daily email with more AI stories
Source

MIT Technology Review • Jul 18, 2025
A major AI training data set contains millions of examples of personal data