AI scraping is unintentionally hurting the Wayback Machine

Most internet users have probably encountered a dead link at some point. A news article disappears, a company removes an old announcement, or a website shuts down entirely. In many cases, the Wayback Machine is where people go to find what was originally there.

Operated by the Internet Archive, the Wayback Machine captures snapshots of webpages and stores them for future reference. Users can enter a URL and view older versions of websites, sometimes dating back decades. The archive has become a common tool for journalists verifying edits to articles, researchers tracking policy changes, and investigators documenting statements that were later removed from public view.

Several major publishers have started blocking the Wayback Machine from archiving their content. Internet Archive actually has already started a petition to help keep the Wayback Machine alive and its efforts to create a preserved memory of the internet. The move comes as news organizations look for ways to prevent AI companies from scraping articles, images, and other content for use in training artificial intelligence models.

Article continues after this advertisement

Producing journalism requires significant investment in reporting, editing, photography, and distribution. At the same time, AI developers have built models using vast amounts of publicly available online content. News organizations increasingly want more control over how that content is accessed and reused.

FEATURED STORIES

TECHNOLOGY

Many of the technical tools used to block AI crawlers can also affect web archiving services. Websites often rely on robots.txt files and other access controls to determine which automated systems can access their content. As those restrictions expand, the Wayback Machine can lose access alongside AI bots.

AI companies collect information to train models and generate new outputs. The Internet Archive preserves copies of webpages as they existed at a specific point in time.

The archive has played a role in everything from investigative reporting to academic research. Archived webpages have been used to verify changes to political statements, track revisions to government guidance, document corporate announcements, and preserve local news coverage after publications closed. When websites shut down or remove older content, archived copies are often the only publicly accessible record that remains.

Article continues after this advertisement

Much of modern public life now exists primarily online. Government agencies publish reports on websites. Companies announce policy changes through blog posts. Product launches, public statements, election materials, and regulatory updates are frequently distributed through webpages that can later be edited or removed.

The Internet Archive says the Wayback Machine contains hundreds of billions of archived webpages collected over nearly three decades. No other public archive operates at a comparable scale.

Publishers have legitimate concerns about AI companies using their work without compensation or permission. The challenge is determining how to limit AI scraping without also restricting services that preserve the public web. As more websites tighten access to automated crawlers, the debate is expanding beyond copyright and AI training data. It now includes a separate question: who is responsible for preserving a record of the internet itself if we are limiting free and good information?

Your subscription could not be saved. Please try again.

Your subscription has been successful.

View original source — Philippine Daily Inquirer ↗

ShareShare on X Share on Facebook

Google’s AI shift is causing a collective freak-out

Japan Times

TechnologyJun 9, 2026 · 1 min

UK forces Google to let publishers opt out of AI search results without losing their ranking

The Next Web

AI scraping is unintentionally hurting the Wayback Machine

Related stories

Google’s AI shift is causing a collective freak-out

Why an AI 'Death Spiral' Threatens the Internet

Bots are scraping open data — how should researchers respond?

UK forces Google to let publishers opt out of AI search results without losing their ranking