Harvard Releases 1 Million Historic Books to Boost AI Training, Supported by Microsoft and OpenAI
June 12, 2025
Burton Davis from Microsoft emphasized that using public domain data is less controversial than copyrighted material, which has led to legal disputes.
OpenAI has committed $50 million to support digitization efforts at research institutions, ensuring that the resulting content remains publicly accessible.
Libraries, such as the Boston Public Library, are collaborating with companies like OpenAI to digitize public domain materials while maintaining public access to these resources.
While the dataset offers valuable historical insights, it also contains outdated and potentially harmful content, prompting discussions on responsible AI usage.
The initiative seeks to shift some power back to libraries, which have traditionally been stewards of information, according to Aristana Scourtas from Harvard's Library Innovation Lab.
The digitization process is costly and labor-intensive, but it aligns with libraries' missions to preserve and share knowledge.
The linguistic diversity of the new dataset is notable, with less than half of the volumes in English, raising concerns about outdated or harmful content.
On June 12, 2025, Harvard University announced the release of nearly one million public domain books, dating back to the 15th century, to assist AI researchers and enhance chatbot learning.
This collection, dubbed Institutional Books 1.0, includes over 394 million scanned pages in 254 languages, providing a richer cultural and historical context for AI training.
Supported by unrestricted donations from Microsoft and OpenAI, this initiative aims to digitize historic library collections, benefiting both libraries and the communities they serve.
Tech companies are increasingly looking to libraries for data to train AI chatbots, moving beyond the internet to access historic collections of books and documents.
The datasets from Harvard are expected to improve the accuracy and reliability of AI systems by utilizing original texts rather than secondary sources like Wikipedia.
Summary based on 8 sources
Get a daily email with more Tech stories
Sources

The Washington Post • Jun 12, 2025
AI chatbots need more books to learn from. These libraries are opening their stacks
ABC News • Jun 12, 2025
AI chatbots need more books to learn from. These libraries are opening their stacks
Economic Times • Jun 12, 2025
AI chatbots need more books to learn from; These libraries are opening their stacks