What is GPTBot? OpenAI's New Web Crawler Explained

GPTBot

What is GPTBot?

GPTBot is a website crawling tool released by OpenAI in August 2023. Its primary purpose is to crawl websites and gather content to train its proprietary large language models (LLMs), such as GPT-4 and GPT-5.

What is GPTBot Used for?

GPTBot’s collection of data will enable OpenAI to collect more data to train its proprietary AI systems. This means that when a user enters a prompt into ChatGPT or another tool, the chatbot will be able to respond with more pertinent and relevant information.

It is essentially a new version of a traditional web crawler, which scans each webpage on a website to index sites across the web. The more data that OpenAI collects, the more signals it can train its AI models on and increase their accuracy over time.

The ability to opt in or out gives organizations the choice of whether they want to contribute their data to helping OpenAI improve its proprietary models or not.

Early Controversy

Shortly after its release, OpenAI received a lot of criticism for scraping publicly available data to train its own AI systems. This prompted numerous content providers, including Disney, Bloomberg, CNN, The New York Times, Reuters, The Washington Post, The Atlantic, Axios, Insider, ABC News, ESPN, and Vox Media, to block the crawler from accessing their websites entirely.

At the heart of these concerns is whether it’s ethical and legal for GPTBot to scrape intellectual property and copyrighted materials from websites to develop its own internal AI products.

While OpenAI has attempted to allay these concerns by allowing organizations to disallow it, there’s no transparency regarding how the data obtained from sites that permit GPTBot access will be utilized.