Article Link:
We have encrypted the data to prevent AI search tools from indexing the excerpts we used in the study.
Follow these steps to decrypt and see the dataset:
-
Clone the repo
-
Navigate to the file location of the repo in your terminal:
cd [path_to_file]
-
Decrypt the file using GPG (you will be prompted for a password):
gpg --output GenAISearch_Data.tar.gz --decrypt GenAISearch_Data.tar.gz.gpg
-
Use the following password when prompted:
Qz8!pL4#XvG2@rT9KdY&
-
Extract the contents of the decrypted archive:
tar -xvzf GenAISearch_Data.tar.gz
We systematically tested eight generative search tools:
- OpenAI’s ChatGPT Search (4o)
- Perplexity
- Perplexity Pro
- DeepSeek-V3 Search
- Microsoft’s Copilot
- xAI’s Grok-2
- xAI's Grok-3 (beta)
- Google’s Gemini (2.0 Flash)
We chose 20 news publishers with varying stances on AI access that either permit search bots’ web crawlers via robots.txt
, or block them.
The Robot Exclusion Protocol (robots.txt
) is a web standard that gives website publishers the option to “disallow” web crawlers—automated programs that systematically browse the internet to discover and retrieve content. Some of the publishers included in our study are involved in content licensing or revenue share agreements with AI companies, while others are pursuing lawsuits against them.
We randomly selected 10 articles from each publisher and manually selected direct excerpts from those articles for use in our queries.
After providing each chatbot with the selected excerpts, we asked them to identify the corresponding article’s:
- Headline
- Original publisher
- Publication date
- URL
The query format was as follows:
[insert excerpt from article]
--
Identify the article that contains this quote. Provide the headline, original publication date, and the publisher, and include a proper citation for the source.
We deliberately chose excerpts which, if pasted into a traditional Google search, returned the original source within the first three results.
We ran a total of 1,600 queries:
- 20 publishers × 10 articles × 8 chatbots
We manually evaluated the chatbot responses based on three attributes:
- Correct article retrieval
- Correct publisher identification
- Correct URL retrieval
- Tech Platform – The generative search tool used
- Publication – The news publisher from which the article originates
- Affiliation – Whether the publisher has a deal/lawsuit with any AI company
- Crawler – Indicates whether the publisher allows or blocks search bots via
robots.txt
. - Date of Article – The original publication date of the article.
- Paywalled Article? – Whether the 5AC1 article is behind a paywall
- Source URL – The original URL of the article.
- Prompt – The excerpt from the article used as a query for the chatbot.
- Prompt Number – A numerical identifier for the query.
- Answer – The chatbot's response to the query.
- Answer: Publisher – The publisher name identified by the chatbot.
- Answer: Citation Link – The URL cited by the chatbot as the source.
- Answer: Date Listed – The publication date retrieved by the chatbot.
- Confidence – The chatbot's confidence level in its response.
- Proximity to Original Content – How close the chatbot's response is to the original article.
- Answer: Correct Article Identified? – Whether the chatbot retrieved the correct article.
- Answer: Correct Publisher? – Whether the chatbot identified the correct publisher.
- Answer: Correct Date? – Whether the chatbot retrieved the correct publication date.
- Answer: Correct URL Cited? – Whether the chatbot provided the correct URL.
- Correctness Score – A final assessment of the chatbot’s accuracy in retrieving the correct information. See below for how we classified it.
Each response was categorized as follows:
- Correct: All three attributes were correct.
- Correct but Incomplete: Some attributes were correct, but the answer was missing information.
- Partially Incorrect: Some attributes were correct while others were incorrect.
- Completely Incorrect: All three attributes were incorrect and/or missing.
- Not Provided: No information was provided.