Harvard Releases 1 Million Historic Books to Boost AI Training, Supported by Microsoft and OpenAI

June 12, 2025
Harvard Releases 1 Million Historic Books to Boost AI Training, Supported by Microsoft and OpenAI
  • Burton Davis from Microsoft emphasized that using public domain data is less controversial than copyrighted material, which has led to legal disputes.

  • OpenAI has committed $50 million to support digitization efforts at research institutions, ensuring that the resulting content remains publicly accessible.

  • Libraries, such as the Boston Public Library, are collaborating with companies like OpenAI to digitize public domain materials while maintaining public access to these resources.

  • While the dataset offers valuable historical insights, it also contains outdated and potentially harmful content, prompting discussions on responsible AI usage.

  • The initiative seeks to shift some power back to libraries, which have traditionally been stewards of information, according to Aristana Scourtas from Harvard's Library Innovation Lab.

  • The digitization process is costly and labor-intensive, but it aligns with libraries' missions to preserve and share knowledge.

  • The linguistic diversity of the new dataset is notable, with less than half of the volumes in English, raising concerns about outdated or harmful content.

  • On June 12, 2025, Harvard University announced the release of nearly one million public domain books, dating back to the 15th century, to assist AI researchers and enhance chatbot learning.

  • This collection, dubbed Institutional Books 1.0, includes over 394 million scanned pages in 254 languages, providing a richer cultural and historical context for AI training.

  • Supported by unrestricted donations from Microsoft and OpenAI, this initiative aims to digitize historic library collections, benefiting both libraries and the communities they serve.

  • Tech companies are increasingly looking to libraries for data to train AI chatbots, moving beyond the internet to access historic collections of books and documents.

  • The datasets from Harvard are expected to improve the accuracy and reliability of AI systems by utilizing original texts rather than secondary sources like Wikipedia.

Summary based on 8 sources


Get a daily email with more Tech stories

More Stories