Methodology

Article Link:

How to Access the Data File

We have encrypted the data to prevent AI search tools from indexing the excerpts we used in the study.

Follow these steps to decrypt and see the dataset:

Clone the repo
Navigate to the file location of the repo in your terminal:
```
cd [path_to_file]
```

Decrypt the file using GPG (you will be prompted for a password):

gpg --output GenAISearch_Data.tar.gz --decrypt GenAISearch_Data.tar.gz.gpg

Use the following password when prompted:
```
Qz8!pL4#XvG2@rT9KdY&
```
Extract the contents of the decrypted archive:
```
tar -xvzf GenAISearch_Data.tar.gz
```

Methodology

We systematically tested eight generative search tools:

OpenAI’s ChatGPT Search (4o)
Perplexity
Perplexity Pro
DeepSeek-V3 Search
Microsoft’s Copilot
xAI’s Grok-2
xAI's Grok-3 (beta)
Google’s Gemini (2.0 Flash)

Publisher Selection

We chose 20 news publishers with varying stances on AI access that either permit search bots’ web crawlers via robots.txt, or block them.

The Robot Exclusion Protocol (robots.txt) is a web standard that gives website publishers the option to “disallow” web crawlers—automated programs that systematically browse the internet to discover and retrieve content. Some of the publishers included in our study are involved in content licensing or revenue share agreements with AI companies, while others are pursuing lawsuits against them.

Query Selection

We randomly selected 10 articles from each publisher and manually selected direct excerpts from those articles for use in our queries.

After providing each chatbot with the selected excerpts, we asked them to identify the corresponding article’s:

Headline
Original publisher
Publication date
URL

The query format was as follows:

[insert excerpt from article]

-- 
Identify the article that contains this quote. Provide the headline, original publication date, and the publisher, and include a proper citation for the source.

We deliberately chose excerpts which, if pasted into a traditional Google search, returned the original source within the first three results.

Testing Process

We ran a total of 1,600 queries:

20 publishers × 10 articles × 8 chatbots

We manually evaluated the chatbot responses based on three attributes:

Correct article retrieval
Correct publisher identification
Correct URL retrieval

Data Fields Explanation

Tech Platform – The generative search tool used
Publication – The news publisher from which the article originates
Affiliation – Whether the publisher has a deal/lawsuit with any AI company
Crawler – Indicates whether the publisher allows or blocks search bots via robots.txt.
Date of Article – The original publication date of the article.
Paywalled Article? – Whether the 5AC1 article is behind a paywall
Source URL – The original URL of the article.
Prompt – The excerpt from the article used as a query for the chatbot.
Prompt Number – A numerical identifier for the query.
Answer – The chatbot's response to the query.
Answer: Publisher – The publisher name identified by the chatbot.
Answer: Citation Link – The URL cited by the chatbot as the source.
Answer: Date Listed – The publication date retrieved by the chatbot.
Confidence – The chatbot's confidence level in its response.
Proximity to Original Content – How close the chatbot's response is to the original article.
Answer: Correct Article Identified? – Whether the chatbot retrieved the correct article.
Answer: Correct Publisher? – Whether the chatbot identified the correct publisher.
Answer: Correct Date? – Whether the chatbot retrieved the correct publication date.
Answer: Correct URL Cited? – Whether the chatbot provided the correct URL.
Correctness Score – A final assessment of the chatbot’s accuracy in retrieving the correct information. See below for how we classified it.

How Correctness Score was Categorized

Each response was categorized as follows:

Correct: All three attributes were correct.
Correct but Incomplete: Some attributes were correct, but the answer was missing information.
Partially Incorrect: Some attributes were correct while others were incorrect.
Completely Incorrect: All three attributes were incorrect and/or missing.
Not Provided: No information was provided.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
GenAISearch_Data.tar.gz.gpg		GenAISearch_Data.tar.gz.gpg
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

How to Access the Data File

Methodology

Publisher Selection

Query Selection

Testing Process

Data Fields Explanation

How Correctness Score was Categorized

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

TowCenter/genSearch-Part2

Folders and files

Latest commit

History

Repository files navigation

How to Access the Data File

Methodology

Publisher Selection

Query Selection

Testing Process

Data Fields Explanation

How Correctness Score was Categorized

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages