What is GPTBot?
GPTBot is a website crawling tool released by OpenAI in August 2023. Its primary purpose is to crawl websites and gather content to train its proprietary large language models (LLMs), such as GPT-4 and GPT-5.
According to OpenAI’s GPTBot page:
“Web pages crawled with the GPT user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.”
Website owners can opt out of GPTBot by modifying their robots.txt file to disallow it. Organizations that want to allow partial access can customize the robots.txt file to determine which directories are allowed or disallowed to be scrapped.
What is GPTBot Used for?
GPTBot’s collection of data will enable OpenAI to collect more data to train its proprietary AI systems. This means that when a user enters a prompt into ChatGPT or another tool, the chatbot will be able to respond with more pertinent and relevant information.
It is essentially a new version of a traditional web crawler, which scans each webpage on a website to index sites across the web. The more data that OpenAI collects, the more signals it can train its AI models on and increase their accuracy over time.
The ability to opt in or out gives organizations the choice of whether they want to contribute their data to helping OpenAI improve its proprietary models or not.
Shortly after its release, OpenAI received a lot of criticism for scraping publicly available data to train its own AI systems. This prompted numerous content providers, including Disney, Bloomberg, CNN, The New York Times, Reuters, The Washington Post, The Atlantic, Axios, Insider, ABC News, ESPN, and Vox Media, to block the crawler from accessing their websites entirely.
At the heart of these concerns is whether it’s ethical and legal for GPTBot to scrape intellectual property and copyrighted materials from websites to develop its own internal AI products.
While OpenAI has attempted to allay these concerns by allowing organizations to disallow it, there’s no transparency regarding how the data obtained from sites that permit GPTBot access will be utilized.