AI Dataset Privacy Breach: Millions of Personal Images in DataComp CommonPool Exposed

July 18, 2025
AI Dataset Privacy Breach: Millions of Personal Images in DataComp CommonPool Exposed
  • A major issue is the lack of consent, as many individuals whose images were scraped likely never agreed to their data being used for AI training, especially since many images predate the development of these models.

  • The researchers raised concerns about the effectiveness of current filtering methods, noting that even with face blurring, personal details often remain in captions and metadata.

  • The study also pointed out that existing privacy laws like GDPR and CCPA may not be sufficient to protect individuals from the widespread use of publicly available data in AI datasets.

  • A recent study has uncovered that millions of images containing personally identifiable information are included in DataComp CommonPool, one of the largest open-source datasets for AI training, raising serious privacy concerns.

  • Released in 2023, the DataComp CommonPool dataset comprises 12.8 billion samples collected from web scraping by Common Crawl between 2014 and 2022, sharing sources with datasets like LAION-5B used for models such as Stable Diffusion.

  • The findings highlight the need to reevaluate current web scraping practices and urge the AI community to recognize that many publicly available data may still contain private information.

  • Despite efforts to anonymize the data, the study identified over 800 faces that escaped detection by existing algorithms, with an overall estimate that 102 million faces could have been missed.

  • The dataset contains thousands of validated identity documents such as passports and credit cards, with estimates suggesting it could include hundreds of millions of similar sensitive images.

Summary based on 1 source


Get a daily email with more AI stories

Source

More Stories