This project consists of several Python scripts to extract domain information, crawl news articles from Google News API, create article embeddings, and perform structured information extraction using LLM prompt engineering.
To run the scripts in this project, you need to have the following:
- Python 3.x
- Required Python packages (can be installed using
pip install -r requirements.txt
) - Environment variables set for the following:
GOOGLE_SEARCH_API_KEY
: API key for the Google Custom Search APIGOOGLE_SEARCH_ENGINE_ID
: Custom search engine ID for Google News APIOPENAI_API_KEY
: API key for OpenAI
- Clone the repository to your local machine:
git clone <repository_url>
cd domain-news-analysis
- install the required packages:
pip install -r requirements.txt
- Set the necessary environment variables with your API keys.
The index.py script is the main script that drives the entire process of domain information extraction, news crawling, embeddings creation, and structured information extraction.
python index.py
The script will prompt you to enter the domain name of a company. It will then check if the domain is registered, and if it is, it will fetch domain information and crawl news articles related to the domain and the company name (if available). The crawled articles will be used to create embeddings and then indexed using FAISS.
The domain.py script contains the Domain class, which provides functionality to check if a domain is registered and fetch domain information.
The google_news_api.py script contains the GoogleNewsAPI class, which interacts with the Google News API to fetch news articles related to a given query.
The embeddings.py script contains the Embeddings class, which handles OpenAI embeddings and creates an article index using FAISS.
The prompts.py script contains the Prompts class, which performs structured information extraction using LLM prompt engineering.
The output of the script will be structured information for different event types saved in a JSON file named output.json. The output will be stored in the following format:
{
"event_type_1": {
"question_1": "answer_1",
"question_2": "answer_2",
...
},
"event_type_2": {
"question_1": "answer_1",
"question_2": "answer_2",
...
},
...
}
This project is licensed under the MIT License.